pdf_info {pdftools} | R Documentation |
Utilities based on libpoppler for extracting text, fonts, attachments and metadata from a pdf file.
pdf_info(pdf, opw = "", upw = "") pdf_text(pdf, opw = "", upw = "") pdf_data(pdf, opw = "", upw = "") pdf_fonts(pdf, opw = "", upw = "") pdf_attachments(pdf, opw = "", upw = "") pdf_toc(pdf, opw = "", upw = "") pdf_pagesize(pdf, opw = "", upw = "")
pdf |
file path or raw vector with pdf data |
opw |
string with owner password to open pdf |
upw |
string with user password to open pdf |
The pdf_text
function renders all textboxes on a text canvas
and returns a character vector of equal length to the number of pages in the
PDF file. On the other hand, pdf_data
is more low level and
returns one data frame per page, containing one row for each texbox in the PDF.
Note that pdf_data
requires a recent version of libpoppler
which might not be available on all Linux systems.
When using pdf_data
in R packages, condition use on
poppler_config()$has_pdf_data
which shows if this function can be
used on the current system.
Poppler is pretty verbose when encountering minor errors in PDF files,
in especially pdf_text
. These messages are usually safe
to ignore, use suppressMessages
to hide them altogether.
Other pdftools: pdf_render_page
# Just a random pdf file pdf_file <- file.path(R.home("doc"), "NEWS.pdf") info <- pdf_info(pdf_file) text <- pdf_text(pdf_file) fonts <- pdf_fonts(pdf_file) files <- pdf_attachments(pdf_file)