Class PDF::Reader
In: lib/pdf/reader/page_text_receiver.rb
lib/pdf/reader/text_receiver.rb
lib/pdf/reader/token.rb
lib/pdf/reader/buffer.rb
lib/pdf/reader/encoding.rb
lib/pdf/reader/register_receiver.rb
lib/pdf/reader/page.rb
lib/pdf/reader/stream.rb
lib/pdf/reader/filter.rb
lib/pdf/reader/cmap.rb
lib/pdf/reader/lzw.rb
lib/pdf/reader/abstract_strategy.rb
lib/pdf/reader/glyph_hash.rb
lib/pdf/reader/reference.rb
lib/pdf/reader/metadata_strategy.rb
lib/pdf/reader/parser.rb
lib/pdf/reader/font.rb
lib/pdf/reader/error.rb
lib/pdf/reader/resource_methods.rb
lib/pdf/reader/object_hash.rb
lib/pdf/reader/standard_security_handler.rb
lib/pdf/reader/print_receiver.rb
lib/pdf/reader/form_xobject.rb
lib/pdf/reader/object_stream.rb
lib/pdf/reader/object_cache.rb
lib/pdf/reader/xref.rb
lib/pdf/reader/pages_strategy.rb
lib/pdf/reader/page_state.rb
lib/pdf/reader.rb
Parent: Object

The Reader class serves as an entry point for parsing a PDF file.

PDF is a page based file format. There is some data associated with the document (metadata, bookmarks, etc) but all visible content is stored under a Page object.

In most use cases for extracting and examining the contents of a PDF it makes sense to traverse the information using page based iteration.

In addition to the documentation here, check out the PDF::Reader::Page class.

File Metadata

  reader = PDF::Reader.new("somefile.pdf")

  puts reader.pdf_version
  puts reader.info
  puts reader.metadata
  puts reader.page_count

Iterating over page content

  reader = PDF::Reader.new("somefile.pdf")

  reader.pages.each do |page|
    puts page.fonts
    puts page.images
    puts page.text
  end

Extracting all text

  reader = PDF::Reader.new("somefile.pdf")

  reader.pages.map(&:text)

Extracting content from a single page

  reader = PDF::Reader.new("somefile.pdf")

  page = reader.page(1)
  puts page.fonts
  puts page.images
  puts page.text

Low level callbacks (ala current version of PDF::Reader)

  reader = PDF::Reader.new("somefile.pdf")

  page = reader.page(1)
  page.walk(receiver)

Encrypted Files

Depending on the algorithm it may be possible to parse an encrypted file. For standard PDF encryption you‘ll need the :password option

  reader = PDF::Reader.new("somefile.pdf", :password => "apples")

Methods

file   info   metadata   new   object   object_file   object_string   open   page   page_count   pages   parse   pdf_version   string  

Classes and Modules

Module PDF::Reader::ResourceMethods
Class PDF::Reader::Buffer
Class PDF::Reader::EncryptedPDFError
Class PDF::Reader::Font
Class PDF::Reader::FormXObject
Class PDF::Reader::InvalidObjectError
Class PDF::Reader::MalformedPDFError
Class PDF::Reader::ObjectCache
Class PDF::Reader::ObjectHash
Class PDF::Reader::Page
Class PDF::Reader::PageState
Class PDF::Reader::PageTextReceiver
Class PDF::Reader::Parser
Class PDF::Reader::PrintReceiver
Class PDF::Reader::Reference
Class PDF::Reader::RegisterReceiver
Class PDF::Reader::StandardSecurityHandler
Class PDF::Reader::Stream
Class PDF::Reader::TextReceiver
Class PDF::Reader::UnsupportedFeatureError
Class PDF::Reader::XRef

Attributes

objects  [R]  lowlevel hash-like access to all objects in the underlying PDF

Public Class methods

DEPRECATED: this method was deprecated in version 1.0.0 and will

            eventually be removed

Parse the file with the given name, sending events to the given receiver.

creates a new document reader for the provided PDF.

input can be an IO-ish object (StringIO, File, etc) containing a PDF or a filename

  reader = PDF::Reader.new("somefile.pdf")

  File.open("somefile.pdf","rb") do |file|
    reader = PDF::Reader.new(file)
  end

If the source file is encrypted you can provide a password for decrypting

  reader = PDF::Reader.new("somefile.pdf", :password => "apples")

DEPRECATED: this method was deprecated in version 1.0.0 and will

            eventually be removed

Parse the file with the given name, returning an unmarshalled ruby version of represents the requested pdf object

DEPRECATED: this method was deprecated in version 1.0.0 and will

            eventually be removed

Parse the given string, returning an unmarshalled ruby version of represents the requested pdf object

syntactic sugar for opening a PDF file. Accepts the same arguments as new().

  PDF::Reader.open("somefile.pdf") do |reader|
    puts reader.pdf_version
  end

or

  PDF::Reader.open("somefile.pdf", :password => "apples") do |reader|
    puts reader.pdf_version
  end

DEPRECATED: this method was deprecated in version 1.0.0 and will

            eventually be removed

Parse the given string, sending events to the given receiver.

Public Instance methods

DEPRECATED: this method was deprecated in version 1.0.0 and will

            eventually be removed

Given an IO object that contains PDF data, return the contents of a single object

returns a single PDF::Reader::Page for the specified page. Use this instead of pages method when you need to access just a single page

  reader = PDF::Reader.new("somefile.pdf")
  page   = reader.page(10)

  puts page.text

See the docs for PDF::Reader::Page to read more about the methods available on each page

returns an array of PDF::Reader::Page objects, one for each page in the source PDF.

  reader = PDF::Reader.new("somefile.pdf")

  reader.pages.each do |page|
    puts page.fonts
    puts page.images
    puts page.text
  end

See the docs for PDF::Reader::Page to read more about the methods available on each page

DEPRECATED: this method was deprecated in version 1.0.0 and will

            eventually be removed

Given an IO object that contains PDF data, parse it.

[Validate]