pdf_info {pdftools}R Documentation

PDF utilities

Description

Utilities based on libpoppler for extracting text, fonts, attachments and metadata from a pdf file.

Usage

pdf_info(pdf, opw = "", upw = "")

pdf_text(pdf, opw = "", upw = "")

pdf_data(pdf, opw = "", upw = "")

pdf_fonts(pdf, opw = "", upw = "")

pdf_attachments(pdf, opw = "", upw = "")

pdf_toc(pdf, opw = "", upw = "")

pdf_pagesize(pdf, opw = "", upw = "")

Arguments

pdf

file path or raw vector with pdf data

opw

string with owner password to open pdf

upw

string with user password to open pdf

Details

The pdf_text function renders all textboxes on a text canvas and returns a character vector of equal length to the number of pages in the PDF file. On the other hand, pdf_data is more low level and returns one data frame per page, containing one row for each texbox in the PDF.

Note that pdf_data requires a recent version of libpoppler which might not be available on all Linux systems. When using pdf_data in R packages, condition use on poppler_config()$has_pdf_data which shows if this function can be used on the current system.

Poppler is pretty verbose when encountering minor errors in PDF files, in especially pdf_text. These messages are usually safe to ignore, use suppressMessages to hide them altogether.

See Also

Other pdftools: pdf_render_page

Examples

# Just a random pdf file
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
info <- pdf_info(pdf_file)
text <- pdf_text(pdf_file)
fonts <- pdf_fonts(pdf_file)
files <- pdf_attachments(pdf_file)

[Package pdftools version 2.1 Index]