pikepdf API¶
Primary objects¶
-
class
pikepdf.
Pdf
¶ In-memory representation of a PDF
-
Root
¶ the /Root object of the PDF
-
check_linearization
(self: pikepdf._qpdf.Pdf, stream: object=sys.stderr) → None¶ Reports information on the PDF’s linearization
Parameters: stream – A stream to write this information too; must implement .write()
and.flush()
method. Defaults tosys.stderr
.
-
copy_foreign
(self: pikepdf._qpdf.Pdf, arg0: QPDFObjectHandle) → QPDFObjectHandle¶ Copy object from foreign PDF to this one.
-
docinfo
¶ access the document information dictionary
-
filename
¶ the source filename of an existing PDF, when available
-
get_object
(*args, **kwargs)¶ Overloaded function.
get_object(self: pikepdf._qpdf.Pdf, arg0: Tuple[int, int]) -> QPDFObjectHandle
Look up an object by ID and generation number
- Returns:
pikepdf.Object
get_object(self: pikepdf._qpdf.Pdf, arg0: int, arg1: int) -> QPDFObjectHandle
Look up an object by ID and generation number
- Returns:
pikepdf.Object
-
get_warnings
(self: pikepdf._qpdf.Pdf) → List[QPDFExc]¶
-
is_linearized
¶ Returns True if the PDF is linearized.
Specifically returns True iff the file starts with a linearization parameter dictionary. Does no additional validation.
-
make_indirect
(*args, **kwargs)¶ Overloaded function.
make_indirect(self: pikepdf._qpdf.Pdf, arg0: QPDFObjectHandle) -> QPDFObjectHandle
Attach an object to the Pdf as an indirect object
Direct objects appear inline in the binary encoding of the PDF. Indirect objects appear inline as references (in English, “look up object 4 generation 0”) and then read from another location in the file. The PDF specification requires that certain objects are indirect - consult the PDF specification to confirm.
Generally a resource that is shared should be attached as an indirect object.
pikepdf.Stream
objects are always indirect, and creating them will automatically attach it to the Pdf.- See Also:
pikepdf.Object.is_indirect()
- Returns:
pikepdf.Object
make_indirect(self: pikepdf._qpdf.Pdf, arg0: object) -> QPDFObjectHandle
Encode a Python object and attach to this Pdf as an indirect object
- Returns:
pikepdf.Object
-
new
() → pikepdf._qpdf.Pdf¶ create a new empty PDF from stratch
-
open
(filename_or_stream: object, password: str='', hex_password: bool=False, ignore_xref_streams: bool=False, suppress_warnings: bool=True, attempt_recovery: bool=True, inherit_page_attributes: bool=True) → pikepdf._qpdf.Pdf¶ Open an existing file at filename_or_stream.
If filename_or_stream is path-like, the file will be opened. The file should not be modified by another process while it is open in pikepdf.
If filename_or_stream has .read() and .seek() methods, the file will be accessed as a readable binary stream. pikepdf will read the entire stream into a private buffer.
Parameters: - filename_or_stream (os.PathLike) – Filename of PDF to open
- password (str or bytes) – User or owner password to open an encrypted PDF. If a str is given it will be converted to UTF-8.
- hex_password (bool) – If True, interpret the password as a hex-encoded version of the exact encryption key to use, without performing the normal key computation. Useful in forensics.
- ignore_xref_streams (bool) – If True, ignore cross-reference streams. See qpdf documentation.
- suppress_warnings (bool) – If True (default), warnings are not printed to stderr. Use get_warnings() to retrieve warnings.
- attempt_recovery (bool) – If True (default), attempt to recover from PDF parsing errors.
- inherit_page_attributes (bool) – If True (default), push attributes set on a group of pages to individual pages
Raises: pikepdf.PasswordError
– If the password failed to open the file.pikepdf.PdfError
– If for other reasons we could not open the file.TypeError
– If the type of filename_or_stream is not usable.FileNotFoundError
– If the file was not found.
-
open_metadata
(set_pikepdf_as_editor=True, update_docinfo=True)¶ Open the PDF’s XMP metadata for editing
Recommend for use in a
with
block. Changes are committed to the PDF when the block exits.Example
>>> with pdf.open_metadata() as meta: meta['dc:title'] = 'Set the Dublic Core Title' meta['dc:description'] = 'Put the Abstract here'
Parameters: Returns: pikepdf.models.PdfMetadata
-
pdf_version
¶ the PDF standard version, such as ‘1.7’
-
remove_unreferenced_resources
(self: pikepdf._qpdf.Pdf) → None¶ Remove from /Resources of each page any object not referenced in page’s contents
PDF pages may share resource dictionaries with other pages. If pikepdf is used for page splitting, pages may reference resources in their /Resources dictionary that are not actually required. This purges all unnecessary resource entries.
Suggested before saving.
-
root
¶ alias for .Root, the /Root object of the PDF
-
save
(self: pikepdf._qpdf.Pdf, filename: object, static_id: bool=False, preserve_pdfa: bool=True, min_version: str='', force_version: str='', compress_streams: bool=True, stream_decode_level: pikepdf._qpdf.StreamDecodeLevel=StreamDecodeLevel.generalized, object_stream_mode: pikepdf._qpdf.ObjectStreamMode=ObjectStreamMode.preserve, normalize_content: bool=False, linearize: bool=False, qdf: bool=False, progress: object=None) → None¶ Save all modifications to this
pikepdf.Pdf
Parameters: - filename (str or stream) – Where to write the output. If a file exists in this location it will be overwritten.
- static_id (bool) – Indicates that the
/ID
metadata, normally calculated as a hash of certain PDF contents and metadata including the current time, should instead be generated deterministically. Normally for debugging. - preserve_pdfa (bool) – Ensures that the file is generated in a manner compliant with PDF/A and other stricter variants. This should be True, the default, in most cases.
- min_version (str) – Sets the minimum version of PDF specification that should be required. If left alone QPDF will decide.
- force_version (str) – Override the version recommend by QPDF, potentially creating an invalid file that does not display in old versions. See QPDF manual for details.
- object_stream_mode (pikepdf.ObjectStreamMode) –
disable
prevents the use of object streams.preserve
keeps object streams from the input file.generate
uses object streams wherever possible, creating the smallest files but requiring PDF 1.5+. - compress_streams (bool) – Enables or disables the compression of
stream objects in the PDF. Metadata is never compressed.
By default this is set to
True
, and should be except for debugging. - stream_decode_level (pikepdf.StreamDecodeLevel) – Specifies how
to encode stream objects. See documentation for
StreamDecodeLevel
. - normalize_content (bool) – Enables parsing and reformatting the content stream within PDFs. This may debugging PDFs easier.
- linearize (bool) – Enables creating linear or “fast web view”, where the file’s contents are organized sequentially so that a viewer can begin rendering before it has the whole file. As a drawback, it tends to make files larger.
- qdf (bool) – Save output QDF mode. QDF mode is a special output
mode in QPDF to allow editing of PDFs in a text editor. Use
the program
fix-qdf
to fix convert back to a standard PDF.
You may call
.save()
multiple times with different parameters to generate different versions of a file, and you may continue to modify the file after saving it..save()
does not modify thePdf
object in memory.Note
pikepdf.Pdf.remove_unreferenced_resources()
before saving may eliminate unnecessary resources from the output file, so calling this method before saving is recommended. This is not done automatically because.save()
is intended to be idempotent.
-
show_xref_table
(self: pikepdf._qpdf.Pdf) → None¶ Pretty-print the Pdf’s xref (cross-reference table)
-
trailer
¶ Provides access to the PDF trailer object.
See section 7.5.5 of the PDF reference manual. Generally speaking, the trailer should not be modified with pikepdf, and modifying it may not work. Some of the values in the trailer are automatically changed when a file is saved.
-
-
pikepdf.
open
(*args, **kwargs)¶ Alias for
pikepdf.Pdf.open()
.
-
class
pikepdf.
ObjectStreamMode
¶ Options for saving object streams within PDFs, which are more a compact way of saving certains types of data that was added in PDF 1.5. All modern PDF viewers support object streams, but some third party tools and libraries cannot read them.
-
disable
¶ Disable the use of object streams. If any object streams exist in the file, remove them when the file is saved.
-
preserve
¶ Preserve any existing object streams in the original file. This is the default behavior.
-
generate
¶ Generate object streams.
-
-
class
pikepdf.
StreamDecodeLevel
¶ -
none
¶ Do not attempt to apply any filters. Streams remain as they appear in the original file. Note that uncompressed streams may still be compressed on output. You can disable that by calling setCompressStreams(false).
-
generalized
¶ This is the default. libqpdf will apply LZWDecode, ASCII85Decode, ASCIIHexDecode, and FlateDecode filters on the input. When combined with setCompressStreams(true), which the default, the effect of this is that streams filtered with these older and less efficient filters will be recompressed with the Flate filter. As a special case, if a stream is already compressed with FlateDecode and setCompressStreams is enabled, the original compressed data will be preserved.
-
specialized
¶ In addition to uncompressing the generalized compression formats, supported non-lossy compression will also be be decoded. At present, this includes the RunLengthDecode filter.
-
all
¶ In addition to generalized and non-lossy specialized filters, supported lossy compression filters will be applied. At present, this includes DCTDecode (JPEG) compression. Note that compressing the resulting data with DCTDecode again will accumulate loss, so avoid multiple compression and decompression cycles. This is mostly useful for retrieving image data.
-
-
exception
pikepdf.
PdfError
¶
-
exception
pikepdf.
PasswordError
¶
Object construction¶
-
class
pikepdf.
Object
¶ -
as_dict
(self: pikepdf._qpdf.Object) → pikepdf._qpdf._ObjectMapping¶
-
as_list
(self: pikepdf._qpdf.Object) → pikepdf._qpdf._ObjectList¶
-
get
(*args, **kwargs)¶ Overloaded function.
- get(self: pikepdf._qpdf.Object, key: str, default_: object=None) -> object
for dictionary objects, behave as dict.get(key, default=None)
- get(self: pikepdf._qpdf.Object, key: pikepdf._qpdf.Object, default_: object=None) -> object
for dictionary objects, behave as dict.get(key, default=None)
-
get_raw_stream_buffer
(self: pikepdf._qpdf.Object) → pikepdf._qpdf.Buffer¶ Return a buffer protocol buffer describing the raw, encoded stream
-
get_stream_buffer
(self: pikepdf._qpdf.Object) → pikepdf._qpdf.Buffer¶ Return a buffer protocol buffer describing the decoded stream
-
is_owned_by
(self: pikepdf._qpdf.Object, arg0: pikepdf._qpdf.Pdf) → bool¶ Test if this object is owned by the indicated possible_owner.
-
items
(self: pikepdf._qpdf.Object) → iterable¶
-
keys
(self: pikepdf._qpdf.Object) → Set[str]¶
-
objgen
¶ Return the object-generation number pair for this object
If this is a direct object, then the returned value is
(0, 0)
. By definition, if this is an indirect object, it has a “objgen”, and can be looked up using this in the cross-reference (xref) table. Direct objects cannot necessarily be looked up.The generation number is usually 0, except for PDFs that have been incrementally updated.
-
page_contents_add
(self: pikepdf._qpdf.Object, contents: pikepdf._qpdf.Object, prepend: bool=False) → None¶ Append or prepend to an existing page’s content stream.
-
page_contents_coalesce
(self: pikepdf._qpdf.Object) → None¶
-
parse
(stream: str, description: str='') → pikepdf._qpdf.Object¶ Parse PDF binary representation into PDF objects.
-
read_bytes
(self: pikepdf._qpdf.Object) → bytes¶ Decode and read the content stream associated with this object
-
read_raw_bytes
(self: pikepdf._qpdf.Object) → bytes¶ Read the content stream associated with this object without decoding
-
unparse
(self: pikepdf._qpdf.Object, resolved: bool=False) → bytes¶ Convert PDF objects into their binary representation, optionally resolving indirect objects.
-
write
(self: pikepdf._qpdf.Object, arg0: bytes, *args, **kwargs) → None¶ Replace the content stream with data, compressed according to filter and decode_parms
Parameters: - data (bytes) – the new data to use for replacement
- filter – The filter(s) with which the data is (already) encoded
- decode_parms – Parameters for the filters with which the object is encode
If only one filter is specified, it may be a name such as Name(‘/FlateDecode’). If there are multiple filters, then array of names should be given.
If there is only one filter, decode_parms is a Dictionary of parameters for that filter. If there are multiple filters, then decode_parms is an Array of Dictionary, where each array index is corresponds to the filter.
-
-
class
pikepdf.
Name
¶ Constructs a PDF Name object
Names can be constructed with two notations:
Name.Resources
Name('/Resources')
The two are semantically equivalent. The former is preferred for names that are normally expected to be in a PDF. The latter is preferred for dynamic names and attributes.
-
static
__new__
(cls, name)¶ Create and return a new object. See help(type) for accurate signature.
-
class
pikepdf.
String
¶ Constructs a PDF String object
-
class
pikepdf.
Array
¶ Constructs a PDF Array object
-
static
__new__
(cls, a=None)¶ Parameters: a (iterable) – A list of objects. All objects must be either pikepdf.Object or convertible to pikepdf.Object. Returns: pikepdf.Object
-
static
-
class
pikepdf.
Dictionary
¶ Constructs a PDF Dictionary object
-
static
__new__
(cls, d=None, **kwargs)¶ Constructs a PDF Dictionary from either a Python
dict
or keyword arguments.These two examples are equivalent:
pikepdf.Dictionary({'/NameOne': 1, '/NameTwo': 'Two'}) pikepdf.Dictionary(NameOne=1, NameTwo='Two')
In either case, the keys must be strings, and the strings correspond to the desired Names in the PDF Dictionary. The values must all be convertible to pikepdf.Object.
Returns: pikepdf.Object
-
static
-
class
pikepdf.
Stream
¶ Constructs a PDF Stream object
-
static
__new__
(cls, owner, obj)¶ Parameters: - owner (pikepdf.Pdf) – The Pdf to which this stream shall be attached.
- obj (bytes or list) – If
bytes
, the data bytes for the stream. Iflist
, a list of(operands, operator)
tuples such as returned bypikepdf.parse_content_stream()
.
Returns: pikepdf.Object
-
static
-
class
pikepdf.
Operator
(arg0: str) → pikepdf._qpdf.Object¶ Construct a PDF Operator object for use in content streams
Support models¶
-
pikepdf.
parse_content_stream
(page_or_stream, operators='')¶ Parse a PDF content stream into a sequence of instructions.
A PDF content stream is list of instructions that describe where to render the text and graphics in a PDF. This is the starting point for analyzing PDFs.
If the input is a page and page.Contents is an array, then the content stream is automatically treated as one coalesced stream.
Each instruction contains at least one operator and zero or more operands.
Parameters: - page_or_stream (pikepdf.Object) – A page object, or the content stream attached to another object such as a Form XObject.
- operators (str) – A space-separated string of operators to whitelist. For example ‘q Q cm Do’ will return only operators that pertain to drawing images. Use ‘BI ID EI’ for inline images. All other operators and associated tokens are ignored. If blank, all tokens are accepted.
Returns: - List of
(operands, command)
tuples wherecommand
is an operator (str) and
operands
is a tuple of str; the PDF drawing command and the command’s operands, respectively.
Return type: Example
>>> pdf = pikepdf.Pdf.open(input_pdf) >>> page = pdf.pages[0] >>> for operands, command in parse_content_stream(page): >>> print(command)
-
class
pikepdf.
PdfMatrix
(*args)¶ Support class for PDF content stream matrices
PDF content stream matrices are 3x3 matrices summarized by a shorthand
(a, b, c, d, e, f)
which correspond to the first two column vectors. The final column vector is always(0, 0, 1)
since this is using homogenous coordinates.PDF uses row vectors. That is,
vr @ A'
gives the effect of transforming a row vectorvr=(x, y, 1)
by the matrixA'
. Most textbook treatments useA @ vc
where the column vectorvc=(x, y, 1)'
.(
@
is the Python matrix multiplication operator added in Python 3.5.)Addition and other operations are not implemented because they’re not that meaningful in a PDF context (they can be defined and are mathematically meaningful in general).
PdfMatrix objects are immutable. All transformations on them produce a new matrix.
-
a
¶
-
b
¶
-
c
¶
-
d
¶
-
e
¶
-
f
¶ Return one of the six “active values” of the matrix.
-
encode
()¶ Encode this matrix in binary suitable for including in a PDF
-
static
identity
()¶ Constructs and returns an identity matrix
-
rotated
(angle_degrees_ccw)¶ Concatenates a rotation matrix on this matrix
-
scaled
(x, y)¶ Concatenates a scaling matrix on this matrix
-
shorthand
¶ Return the 6-tuple (a,b,c,d,e,f) that describes this matrix
-
translated
(x, y)¶ Translates this matrix
-
-
class
pikepdf.
PdfImage
(obj)¶ Support class to provide a consistent API for manipulating PDF images
The data structure for images inside PDFs is irregular and flexible, making it difficult to work with without introducing errors for less typical cases. This class addresses these difficulties by providing a regular, Pythonic API similar in spirit (and convertible to) the Python Pillow imaging library.
-
as_pil_image
()¶ Extract the image as a Pillow Image, using decompression as necessary
Returns: PIL.Image.Image
-
extract_to
(*, stream)¶ Attempt to extract the image directly to a usable image file
If possible, the compressed data is extracted and inserted into a compressed image file format without transcoding the compressed content. If this is not possible, the data will be decompressed and extracted to an appropriate format.
Because it is not known until attempted what image format will be extracted, users should not assume what format they are getting back. When saving the image to a file, use a temporary filename, and then rename the file to its final name based on the returned file extension.
Parameters: stream – Writable stream to write data to Returns: The file format extension Return type: str
-
get_stream_buffer
()¶ Access this image with the buffer protocol
-
is_inline
¶ False
for image XObject
-
read_bytes
()¶ Decompress this image and return it as unencoded bytes
-
show
()¶ Show the image however PIL wants to
-
-
class
pikepdf.
PdfInlineImage
(*, image_data, image_object: tuple)¶ Support class for PDF inline images
-
class
pikepdf.models.
PdfMetadata
(pdf, pikepdf_mark=True, sync_docinfo=True)¶ Read and edit the metadata associated with a PDF
The PDF specification contain two types of metadata, the newer XMP (Extensible Metadata Platform, XML-based) and older DocumentInformation dictionary. The PDF 2.0 specification removes the DocumentInformation dictionary.
This primarily works with XMP metadata, but includes methods to generate XMP from DocumentInformation and will also coordinate updates to DocumentInformation so that the two are kept consistent.
XMP metadata fields may be accessed using the full XML namespace URI or the short name. For example
metadata['dc:description']
andmetadata['{http://purl.org/dc/elements/1.1/}description']
both refer to the same field. Several common XML namespaces are registered automatically.See the XMP specification for details of allowable fields.
To update metadata, use a with block.
with pdf.open_metadata() as records: records['dc:title'] = 'New Title'
See also
-
load_from_docinfo
(docinfo, delete_missing=False)¶ Populate the XMP metadata object with DocumentInfo
A few entries in the deprecated DocumentInfo dictionary are considered approximately equivalent to certain XMP records. This method copies those entries into the XMP metadata.
-
pdfa_status
¶ Returns the PDF/A conformance level claimed by this PDF, or False
A PDF may claim to PDF/A compliant without this being true. Use an independent verifier such as veraPDF to test if a PDF is truly conformant.
Returns: The conformance level of the PDF/A, or an empty string if the PDF does not claim PDF/A conformance. Possible valid values are: 1A, 1B, 2A, 2B, 2U, 3A, 3B, 3U. Return type: str
-
pdfx_status
¶ Returns the PDF/X conformance level claimed by this PDF, or False
A PDF may claim to PDF/X compliant without this being true. Use an independent verifier such as veraPDF to test if a PDF is truly conformant.
Returns: The conformance level of the PDF/X, or an empty string if the PDF does not claim PDF/X conformance. Return type: str
-