Python API

Preface

Incompatibility with Threading

PDFium is inherently not thread-safe. It is not allowed to call pdfium functions simultaneously across different threads, not even with different documents. [1] However, you may still use pdfium in a threaded context if it is ensured that only a single pdfium call can be made at a time (e.g. via mutex). It is fine to do pdfium work in one thread and other work in other threads.

The same applies to pypdfium2’s helpers, or any wrapper calling pdfium, whether directly or indirectly, unless protected by mutex.

To parallelize expensive pdfium tasks such as rendering, consider processes instead of threads.

API layers

pypdfium2 provides multiple API layers:

The raw PDFium API, to be used with ctypes (pypdfium2.raw or pypdfium2_raw [2]).
The support model API, which is a set of Python helper classes around the raw API (pypdfium2).
Additionally, there is the internal API, which contains various utilities that aid with using the raw API and are accessed by helpers, but do not fit in the support model namespace itself (pypdfium2.internal).

Wrapper objects provide a raw attribute to access the underlying ctypes object. In addition, helpers automatically resolve to raw if used as C function parameter. [3] This allows to conveniently use helpers where available, while the raw API can still be accessed as needed.

The raw API is quite stable and provides a high level of backwards compatibility (seeing as PDFium is well-tested and relied on by popular projects), but it can be difficult to use, and special care needs to be taken with memory management.

The support model API is still in beta stage. It only covers a subset of pdfium features. Backwards incompatible changes may be applied occasionally, although we try to contain them within major releases. On the other hand, it is supposed to be safer and easier to use (“pythonic”), abstracting the finnicky interaction with C functions.

Memory management

Note

This section covers the support model. It is not applicable to the raw API alone!

PDFium objects commonly need to be closed by the caller to release allocated memory. [4] Where necessary, pypdfium2’s helper classes implement automatic closing on garbage collection using weakref.finalize. Additionally, they provide close() methods that can be used to release memory explicitly.

It may be advantageous to close objects explicitly instead of relying on Python garbage collection behaviour, to release allocated memory and acquired file handles immediately. [5]

Closed objects must not be accessed anymore. Closing an object sets the underlying raw attribute to None, which should prevent illegal use of closed raw handles, though. Attempts to re-close an already closed object are silently ignored. Closing a parent object will automatically close any open children (e.g. pages derived from a pdf).

Raw objects must not be detached from their wrappers. Accessing a raw object after it was closed, whether explicitly or on garbage collection of the wrapper, is illegal (use after free). Due to limitations in weakref, finalizers can only be attached to wrapper objects, although they logically belong to the raw object.

Version

Note

Version info can be fooled. See it as orientation rather than inherently reliable data.

PYPDFIUM_INFO = 5.0.0b2+3.g07bc3a9

pypdfium2 helpers version.

It is suggesed to compare against api_tag and possibly also beta (see below).

Parameters:

version (str) – Joined tag and desc, forming the full version.
tag (str) – Version ciphers joined as str, including possible beta. Corresponds to the latest release tag at install time.
desc (str) – Non-cipher descriptors represented as str.
api_tag (tuple[int]) – Version ciphers joined as tuple, excluding possible beta.
major (int) – Major cipher.
minor (int) – Minor cipher.
patch (int) – Patch cipher.
beta (int | None) – Beta cipher, or None if not a beta version.
n_commits (int) – Number of commits after tag at install time. 0 for release.
hash (str | None) – Hash of head commit if n_commits > 0, None otherwise.
dirty (bool) – True if there were uncommitted changes at install time, False otherwise.
data_source (str) –
Source of this version info. Possible values:
- git: Parsed from git describe. Always used if available. Highest accuracy.
- given: Pre-supplied version file (e.g. packaged with sdist, or else created by caller).
- record: Parsed from autorelease record. Implies that possible changes after tag are unknown.
is_editable (bool | None) –
True for editable install, False otherwise. None if unknown.

If True, the version info is the one captured at install time. An arbitrary number of forward or reverse changes may have happened since.

PDFIUM_INFO = 140.0.7323.0

PDFium version.

It is suggesed to compare against build (see below).

Parameters:

version (str) – Joined tag and desc, forming the full version.
tag (str) – Version ciphers joined as string.
desc (str) – Descriptors (origin, flags) as string.
api_tag (tuple[int]) – Version ciphers grouped as tuple.
major (int) – Chromium major cipher.
minor (int) – Chromium minor cipher.
build (int) – Chromium/pdfium build cipher. This value uniquely identifies the pdfium version.
patch (int) – Chromium patch cipher.
n_commits (int) – Number of commits after tag at install time. 0 for tagged build commit.
hash (str | None) – Hash of head commit if n_commits > 0, None otherwise.
origin (str) – The pdfium binary’s origin.
flags (tuple[str]) – Tuple of pdfium feature flags. Empty for default build. (V8, XFA) for pdfium-binaries V8 build.

Document

class PdfDocument(input, password=None, autoclose=False)[source]

Bases: AutoCloseable

Document helper class.

Parameters:

input_data (str | pathlib.Path | bytes | ctypes.Array | BinaryIO | FPDF_DOCUMENT) – The input PDF given as file path, bytes, ctypes array, byte stream, or raw PDFium document handle. A byte stream is defined as an object that implements seek() tell() read() readinto().
password (str | None) – A password to unlock the PDF, if encrypted. Otherwise, None or an empty string may be passed. If a password is given but the PDF is not encrypted, it will be ignored (as of PDFium 5418).
autoclose (bool) – Whether byte stream input should be automatically closed on finalization.

Raises:

PdfiumError – Raised if the document failed to load. The exception is annotated with the reason reported by PDFium (via message and err_code).
FileNotFoundError – Raised if an invalid or non-existent file path was given.

Hint

Documents may be used in a with-block, closing the document on context manager exit. This is recommended when input_data is a file path, to safely and immediately release the bound file handle.
len() may be called to get a document’s number of pages.
Pages may be loaded using list index access.
Looping over a document will yield its pages from beginning to end.
The del keyword and list index access may be used to delete pages.

raw

The underlying PDFium document handle.

Type:: FPDF_DOCUMENT

formenv

Form env, if the document has forms and init_forms() was called.

Type:: PdfFormEnv | None

property parent

classmethod new()[source]

Returns:: A new, empty document.
Return type:: PdfDocument

init_forms(config=None)[source]

Initialize a form env, if the document has forms. If already initialized, nothing will be done. See the formenv attribute.

Attention

If form rendering is desired, this method shall be called right after document construction, before getting document length or page handles.

Parameters:: config (FPDF_FORMFILLINFO | None) – Custom form config interface to use (optional).

get_formtype()[source]

Returns:: PDFium form type that applies to the document (FORMTYPE_*). FORMTYPE_NONE if the document has no forms.
Return type:: int

get_pagemode()[source]

Returns:: Page displaying mode (PAGEMODE_*).
Return type:: int

is_tagged()[source]

Returns:: Whether the document is tagged (cf. PDF 1.7, 10.7 “Tagged PDF”).
Return type:: bool

save(dest, version=None, flags=0)[source]

Save the document at its current state.

Parameters:

dest (str | pathlib.Path | io.BytesIO) – File path or byte stream the document shall be written to.
version (int | None) – The PDF version to use, given as an integer (14 for 1.4, 15 for 1.5, …). If None (the default), PDFium will set a version automatically.
flags (int) – PDFium saving flags (defaults to 0).

get_identifier(type=pdfium_c.FILEIDTYPE_PERMANENT)[source]

Parameters:: type (int) – The identifier type to retrieve (FILEIDTYPE_*), either permanent or changing. If the file was updated incrementally, the permanent identifier stays the same, while the changing identifier is re-calculated.
Returns:: Unique file identifier from the PDF’s trailer dictionary. See PDF 1.7, Section 14.4 “File Identifiers”.
Return type:: bytes

get_version()[source]

Returns:: The PDF version of the document (14 for 1.4, 15 for 1.5, …), or None if the document is new or its version could not be determined.
Return type:: int | None

get_metadata_value(key)[source]

Returns:: Value of the given key in the PDF’s metadata dictionary. If the key is not contained, an empty string will be returned.
Return type:: str

METADATA_KEYS = ('Title', 'Author', 'Subject', 'Keywords', 'Creator', 'Producer', 'CreationDate', 'ModDate')

get_metadata_dict(skip_empty=False)[source]

Get the document’s metadata as dictionary.

Parameters:: skip_empty (bool) – If True, skip items whose value is an empty string.
Returns:: PDF metadata.
Return type:: dict

count_attachments()[source]

Returns:: The number of embedded files in the document.
Return type:: int

get_attachment(index)[source]

Returns:: The attachment at given index (zero-based).
Return type:: PdfAttachment

new_attachment(name)[source]

Add a new attachment to the document. It may appear at an arbitrary index (as of PDFium 5418).

Parameters:: name (str) – The name the attachment shall have. Usually a file name with extension.
Returns:: Handle to the new, empty attachment.
Return type:: PdfAttachment

del_attachment(index)[source]

Unlink the attachment at given index (zero-based). It will be hidden from the viewer, but is still present in the file (as of PDFium 5418). Following attachments shift one slot to the left in the array representation used by PDFium’s API.

Handles to the attachment in question received from get_attachment() must not be accessed anymore after this method has been called.

get_page(index)[source]

Returns:: The page at given index (zero-based).
Return type:: PdfPage

Note

This calls FORM_OnAfterLoadPage() if the document has an active form env. In that case, note that closing the formenv would implicitly close the page.

new_page(width, height, index=None)[source]

Insert a new, empty page into the document.

Parameters:

width (float) – Target page width (horizontal size).
height (float) – Target page height (vertical size).
index (int | None) – Suggested zero-based index at which the page shall be inserted. If None or larger that the document’s current last index, the page will be appended to the end.

Returns:

The newly created page.

Return type:

PdfPage

del_page(index)[source]: Remove the page at given index (zero-based). It is recommended to close any open handles to the page before calling this method.

import_pages(pdf, pages=None, index=None)[source]

Import pages from a foreign document.

Parameters:

pdf (PdfDocument) – The document from which to import pages.
pages (list[int] | str | None) – The pages to include. It may either be a list of zero-based page indices, or a string of one-based page numbers and ranges. If None, all pages will be included.
index (int) – Zero-based index at which to insert the given pages. If None, they are appended to the end of the document.

get_page_size(index)[source]

Returns:: Width and height of the page at given index (zero-based), in PDF canvas units.
Return type:: (float, float)

get_page_label(index)[source]

Returns:: Label of the page at given index (zero-based). (A page label is essentially an alias that may be displayed instead of the page number.)
Return type:: str

page_as_xobject(index, dest_pdf)[source]

Capture a page as XObject and attach it to a document’s resources.

Parameters:

index (int) – Zero-based index of the page.
dest_pdf (PdfDocument) – Target document to which the XObject shall be added.

Returns:

The page as XObject.

Return type:

PdfXObject

get_toc(max_depth=15, parent=None, level=0, seen=None)[source]

Iterate through the bookmarks in the document’s table of contents (TOC).

Parameters:: max_depth (int) – Maximum recursion depth to consider.
Yields:: PdfBookmark

class PdfFormEnv(raw, pdf, config)[source]

Bases: AutoCloseable

Form environment helper class.

raw

The underlying PDFium form env handle.

Type:: FPDF_FORMHANDLE

config

Accompanying form configuration interface, to be kept alive.

Type:: FPDF_FORMFILLINFO

pdf

Parent document this form env belongs to.

Type:: PdfDocument

property parent

class PdfXObject(raw, pdf)[source]

Bases: AutoCloseable

XObject helper class.

raw

The underlying PDFium XObject handle.

Type:: FPDF_XOBJECT

pdf

Reference to the document this XObject belongs to.

Type:: PdfDocument

property parent

as_pageobject()[source]

Returns:: An independent pageobject representation of the XObject. If multiple pageobjects are created from an XObject, they share resources. Returned pageobjects remain valid after the XObject is closed.
Return type:: PdfObject

class PdfBookmark(raw, pdf, level)[source]

Bases: AutoCastable

Bookmark helper class.

raw

The underlying PDFium bookmark handle.

Type:: FPDF_BOOKMARK

pdf

Reference to the document this bookmark belongs to.

Type:: PdfDocument

level

The bookmark’s nesting level in the TOC tree (zero-based). Corresponds to the number of parent bookmarks.

Type:: int

get_title()[source]

Returns:: The bookmark’s title string.
Return type:: str

get_count()[source]

Returns:: Signed number of child bookmarks that would be visible if the bookmark were open (i.e. recursively counting children of open children). The bookmark’s initial state is open (expanded) if the number is positive, closed (collapsed) if negative. Zero if the bookmark has no descendants.
Return type:: int

get_dest()[source]

Returns:: The bookmark’s destination (an object providing page index and viewport), or None on failure.
Return type:: PdfDest | None

class PdfDest(raw, pdf)[source]

Bases: AutoCastable

Destination helper class.

raw

The underlying PDFium destination handle.

Type:: FPDF_DEST

pdf

Reference to the document this dest belongs to.

Type:: PdfDocument

get_index()[source]

Returns:: Zero-based index of the page the dest points to, or None on failure.
Return type:: int | None

get_view()[source]

Returns:: A tuple of (view_mode, view_pos). view_mode is a constant (one of PDFDEST_VIEW_*) defining how view_pos shall be interpreted. view_pos is the target position on the page the dest points to. It may contain between 0 to 4 float coordinates, depending on the view mode.
Return type:: (int, list[float])

Page

class PdfPage(raw, pdf, formenv)[source]

Bases: AutoCloseable

Page helper class.

raw

The underlying PDFium page handle.

Type:: FPDF_PAGE

pdf

Reference to the document this page belongs to.

Type:: PdfDocument

formenv

Formenv handle, if the parent pdf had an active formenv at the time of page retrieval. None otherwise.

Type:: PdfFormEnv | None

property parent

get_width()[source]

Returns:: Page width (horizontal size), in PDF canvas units.
Return type:: float

get_height()[source]

Returns:: Page height (vertical size), in PDF canvas units.
Return type:: float

get_size()[source]

Returns:: Page width and height, in PDF canvas units.
Return type:: (float, float)

get_rotation()[source]

Returns:: Clockwise page rotation in degrees.
Return type:: int

set_rotation(rotation)[source]: Define the absolute, clockwise page rotation (0, 90, 180, or 270 degrees).

get_mediabox(fallback_ok=True)[source]

Returns:: The page MediaBox in PDF canvas units, consisting of four coordinates (usually x0, y0, x1, y1). If MediaBox is not defined, returns ANSI A (0, 0, 612, 792) if fallback_ok=True, None otherwise.
Return type:: (float, float, float, float) | None

Known issue

Due to quirks in PDFium, all get_*box() functions except get_bbox() do not inherit from parent nodes in the page tree (as of PDFium 5418).

set_mediabox(l, b, r, t)[source]: Set the page’s MediaBox by passing four float coordinates (usually x0, y0, x1, y1).

get_cropbox(fallback_ok=True)[source]

Returns:: The page’s CropBox (If not defined, falls back to MediaBox).

set_cropbox(l, b, r, t)[source]: Set the page’s CropBox.

get_bleedbox(fallback_ok=True)[source]

Returns:: The page’s BleedBox (If not defined, falls back to CropBox).

set_bleedbox(l, b, r, t)[source]: Set the page’s BleedBox.

get_trimbox(fallback_ok=True)[source]

Returns:: The page’s TrimBox (If not defined, falls back to CropBox).

set_trimbox(l, b, r, t)[source]: Set the page’s TrimBox.

get_artbox(fallback_ok=True)[source]

Returns:: The page’s ArtBox (If not defined, falls back to CropBox).

set_artbox(l, b, r, t)[source]: Set the page’s ArtBox.

get_bbox()[source]

Returns:: The bounding box of the page (the intersection between its media box and crop box).

get_textpage()[source]

Returns:: A new text page handle for this page.
Return type:: PdfTextPage

insert_obj(pageobj)[source]

Insert a pageobject into the page.

The pageobject must not belong to a page yet. If it belongs to a PDF, the target page must be part of that PDF.

Position and form are defined by the object’s matrix. If it is the identity matrix, the object will appear as-is on the bottom left corner of the page.

Parameters:: pageobj (PdfObject) – The pageobject to insert.

remove_obj(pageobj)[source]

Remove a pageobject from the page. As of PDFium 5692, detached pageobjects may be only re-inserted into existing pages of the same document. If the pageobject is not re-inserted into a page, its close() method may be called.

Note

If the object’s type is FPDF_PAGEOBJ_TEXT, any PdfTextPage handles to the page should be closed before removing the object.

Parameters:: pageobj (PdfObject) – The pageobject to remove.

gen_content()[source]

Generate page content to apply additions, removals or modifications of pageobjects.

If page content was changed, this function should be called once before saving the document or re-loading the page.

get_objects(filter=None, max_depth=15, form=None, level=0)[source]

Iterate through the pageobjects on this page.

Parameters:

filter (list[int] | None) – An optional list of pageobject types to filter (FPDF_PAGEOBJ_*). Any objects whose type is not contained will be skipped. If None or empty, all objects will be provided, regardless of their type.
max_depth (int) – Maximum recursion depth to consider when descending into Form XObjects.

Yields:

PdfObject – A pageobject.

flatten(flag=pdfium_c.FLAT_NORMALDISPLAY)[source]

Flatten form fields and annotations into page contents.

Attention

init_forms() must have been called on the parent pdf, before the page was retrieved, for this method to work. In other words, PdfPage.formenv must be non-null.
Flattening may invalidate existing handles to the page, so you’ll want to re-initialize these afterwards.

Parameters:: flag (int) – PDFium flattening target (FLAT_*)
Returns:: PDFium flattening status (FLATTEN_*). FLATTEN_FAIL is handled internally.
Return type:: int

render(scale=1, rotation=0, crop=(0, 0, 0, 0), may_draw_forms=True, bitmap_maker=PdfBitmap.new_native, color_scheme=None, fill_to_stroke=False, **kwargs)[source]

Rasterize the page to a PdfBitmap.

Parameters:

scale (float) – A factor scaling the number of pixels per PDF canvas unit. This defines the resolution of the image. To convert a DPI value to a scale factor, multiply it by the size of 1 canvas unit in inches (usually 1/72in). [6]
rotation (int) – Additional rotation in degrees (0, 90, 180, or 270).
crop (tuple[float, float, float, float]) – Amount in PDF canvas units to cut off from page borders (left, bottom, right, top). Crop is applied after rotation.
may_draw_forms (bool) – If True, render form fields (provided the document has forms and init_forms() was called).
bitmap_maker (Callable) – Callback function used to create the PdfBitmap.
fill_color (tuple[int, int, int, int]) – Color the bitmap will be filled with before rendering. This uses RGBA syntax regardless of the pixel format used, with values from 0 to 255. If the fill color is not opaque (i.e. has transparency), {BGR,RGB}A will be used.
grayscale (bool) – If True, render in grayscale mode.
optimize_mode (None | str) – Page rendering optimization mode (None, “lcd”, “print”).
draw_annots (bool) – If True, render page annotations.
no_smoothtext (bool) – If True, disable text anti-aliasing. Overrides optimize_mode="lcd".
no_smoothimage (bool) – If True, disable image anti-aliasing.
no_smoothpath (bool) – If True, disable path anti-aliasing.
force_halftone (bool) – If True, always use halftone for image stretching.
limit_image_cache (bool) – If True, limit image cache size.
rev_byteorder (bool) – If True, render with reverse byte order, leading to RGB{A/x} output rather than BGR{A/x}. Other pixel formats are not affected.
prefer_bgrx (bool) – If True, use 4-byte {BGR/RGB}x rather than 3-byte {BGR/RGB} (i.e. add an unused byte). Other pixel formats are not affected.
use_bgra_on_transparency (bool) – If True, use a pixel format with alpha channel (i.e. {BGR/RGB}A) if page content has transparency. This is recommended for performance in these cases, but as page-dependent format selection is somewhat unexpected, it is not enabled by default.
force_bitmap_format (int | None) – If given, override automatic pixel format selection and enforce use of the given format (one of the FPDFBitmap_* constants). In this case, you should not pass any other format selection options, except potentially rev_byteorder.
extra_flags (int) – Additional PDFium rendering flags. May be combined with bitwise OR (| operator).
color_scheme (PdfColorScheme | None) – A custom pdfium color scheme. Note that this may flatten different colors into one, so the usability of this is limited.
fill_to_stroke (bool) – If a color_scheme is given, whether to only draw borders around fill areas using the path_stroke color, instead of filling with the path_fill color.

Returns:

Bitmap of the rendered page.

Return type:

PdfBitmap

Format selection

This is the format selection hierarchy used by render(), from lowest to highest priority:

default: BGR
prefer_bgrx=True: BGRx
grayscale=True: L
prefer_bgra_on_transparency=True: BGRA if the page has transparency, else the format selected otherwise
fill_color[3] < 255: BGRA (background color with transparency)
force_bitmap_format=... -> any supported by pdfium

Additionally, rev_byteorder will swap BGR{A/x} to RGB{A/x} if applicable.

class PdfColorScheme(path_fill, path_stroke, text_fill, text_stroke)[source]

Bases: object

Rendering color scheme. Each color shall be provided as a list of values for red, green, blue and alpha, ranging from 0 to 255.

convert(rev_byteorder)[source]

Returns:: The color scheme as FPDF_COLORSCHEME object.

Pageobjects

class PdfObject(raw, *args, **kwargs)[source]

Bases: AutoCloseable

Pageobject helper class.

When constructing a PdfObject, an instance of a more specific subclass may be returned instead, depending on the object’s type (e. g. PdfImage).

Note

PdfObject.close() only takes effect on loose pageobjects. It is a no-op otherwise, because pageobjects that are part of a page are owned by pdfium, not the caller.

raw

The underlying PDFium pageobject handle.

Type:: FPDF_PAGEOBJECT

type

The object’s type (FPDF_PAGEOBJ_*).

Type:: int

page

Reference to the page this pageobject belongs to. May be None if not part of a page (e.g. new or detached object).

Type:: PdfPage

pdf

Reference to the document this pageobject belongs to. May be None if the object does not belong to a document yet. This attribute is always set if page is set.

Type:: PdfDocument

container

PdfObject handle to parent Form XObject, if the pageobject is nested in a Form XObject, None otherwise.

Type:: PdfObject | None

level

Nesting level signifying the number of parent Form XObjects, at the time of construction. Zero if the object is not nested in a Form XObject.

Type:: int

property parent

get_bounds()[source]

Get the bounds of the object on the page.

Returns:: Left, bottom, right and top, in PDF page coordinates.
Return type:: tuple[float * 4]

get_quad_points()[source]

Get the object’s quadriliteral points (i.e. the positions of its corners). For transformed objects, this may provide tighter bounds than a rectangle (e.g. rotation by a non-multiple of 90°, shear).

Note

This function only supports image and text objects.

Returns:: Corner positions as (x, y) tuples, counter-clockwise from origin, i.e. bottom-left, bottom-right, top-right, top-left, in PDF page coordinates.
Return type:: tuple[tuple[float*2] * 4]

get_matrix()[source]

Returns:: The pageobject’s current transform matrix.
Return type:: PdfMatrix

set_matrix(matrix)[source]

Parameters:: matrix (PdfMatrix) – Set this matrix as the pageobject’s transform matrix.

transform(matrix)[source]

Parameters:: matrix (PdfMatrix) – Multiply the pageobject’s current transform matrix by this matrix.

class PdfImage(raw, *args, **kwargs)[source]

Bases: PdfObject

Image object helper class (specific kind of pageobject).

SIMPLE_FILTERS = ('ASCIIHexDecode', 'ASCII85Decode', 'RunLengthDecode', 'FlateDecode', 'LZWDecode'): Filters applied by FPDFImageObj_GetImageDataDecoded(), referred to as “simple filters”. Other filters are considered “complex filters”.

classmethod new(pdf)[source]

Parameters:: pdf (PdfDocument) – The document to which the new image object shall be added.
Returns:: Handle to a new, empty image. Note that position and size of the image are defined by its matrix, which defaults to the identity matrix. This means that new images will appear as a tiny square of 1x1 canvas units on the bottom left corner of the page. Use PdfMatrix and set_matrix() to adjust size and position.
Return type:: PdfImage

get_metadata()[source]

Retrieve image metadata including DPI, bits per pixel, color space, and size. If the image does not belong to a page yet, bits per pixel and color space will be unset (0).

Note

The DPI values signify the resolution of the image on the PDF page, not the DPI metadata embedded in the image file.
Due to issues in pdfium, this function might be slow on some kinds of images. If you only need size, prefer get_px_size() instead.

Returns:: Image metadata structure
Return type:: FPDF_IMAGEOBJ_METADATA

get_px_size()[source]

Returns:: Image dimensions as a tuple of (width, height).
Return type:: (int, int)

load_jpeg(source, pages=None, inline=False, autoclose=True)[source]

Set a JPEG as the image object’s content.

Parameters:

source (str | pathlib.Path | BinaryIO) – Input JPEG, given as file path or readable byte stream.
pages (list[PdfPage] | None) – If replacing an image, pass in a list of loaded pages that might contain it, to update their cache. (The same image may be shown multiple times in different transforms across a PDF.) May be None or an empty sequence if the image is not shared.
inline (bool) – Whether to load the image content into memory. If True, the buffer may be closed after this function call. Otherwise, the buffer needs to remain open until the PDF is closed.
autoclose (bool) – If the input is a buffer, whether it should be automatically closed once not needed by the PDF anymore.

set_bitmap(bitmap, pages=None)[source]

Set a bitmap as the image object’s content. The pixel data will be flate compressed (as of PDFium 5418).

Parameters:

bitmap (PdfBitmap) – The bitmap to inject into the image object.
pages (list[PdfPage] | None) – A list of loaded pages that might contain the image object. See load_jpeg().

get_bitmap(render=False, scale_to_original=True)[source]

Get a bitmap rasterization of the image.

Parameters:

render (bool) – Whether the image should be rendered, thereby applying possible transform matrices and alpha masks.
scale_to_original (bool) – If render is True, whether to temporarily scale the image to its native resolution, or close to that (defaults to True). This should improve output quality. Ignored if render is False.

Returns:

Image bitmap (with a buffer allocated by PDFium).

Return type:

PdfBitmap

get_data(decode_simple=False)[source]

Parameters:: decode_simple (bool) – If True, decode simple filters (see SIMPLE_FILTERS), so only complex filters will remain, if any. If there are no complex filters, this provides the decoded pixel data. If False, the raw stream data will be returned instead.
Returns:: The data of the image stream (as c_ubyte array).
Return type:: ctypes.Array

get_filters(skip_simple=False)[source]

Parameters:: skip_simple (bool) – If True, exclude simple filters.
Returns:: A list of image filters, to be applied in order (from lowest to highest index).
Return type:: list[str]

extract(dest, *args, **kwargs)[source]

Extract the image into an independently usable file or byte stream, attempting to avoid re-encoding or quality loss, as far as pdfium’s limited API permits.

This method can only extract DCTDecode (JPEG) and JPXDecode (JPEG 2000) images directly. Otherwise, the pixel data is decoded and re-encoded using PIL, which is slower and loses the original encoding. For images with simple filters only, get_data(decode_simple=True) is used to preserve higher bit depth or special color formats not supported by FPDF_BITMAP. For images with complex filters other than those extracted directly, we have to resort to get_bitmap().

Note, this method is not able to account for alpha masks, and potentially other data stored separately of the main image stream, which might lead to incorrect representation of the image.

Tip

The pikepdf library is capable of preserving the original encoding in many cases where this method is not.

Parameters:

dest (str | pathlib.Path | io.BytesIO) – File path prefix or byte stream to which the image shall be written.
fb_format (str) – The image format to use in case it is necessary to (re-)encode the data.

Text Page

class PdfTextPage(raw, page)[source]

Bases: AutoCloseable

Text page helper class.

Hint

(py)pdfium itself does not implement layout analysis, such as detecting words/lines/paragraphs. However, there may be third-party extensions for this job, e.g.: https://github.com/VikParuchuri/pdftext

raw

The underlying PDFium textpage handle.

Type:: FPDF_TEXTPAGE

page

Reference to the page this textpage belongs to.

Type:: PdfPage

property parent

get_text_bounded(left=None, bottom=None, right=None, top=None, errors='ignore')[source]

Extract text from given boundaries, in PDF canvas units. If a boundary value is None, it defaults to the corresponding value of PdfPage.get_bbox().

Parameters:: errors (str) – Error treatment when decoding the data (see bytes.decode()).
Returns:: The text on the page area in question, or an empty string if no text was found.
Return type:: str

get_text_range(index=0, count=-1, errors='ignore')[source]

Extract text from a given range.

Warning

This method is limited to UCS-2, whereas get_text_bounded() provides full Unicode support.

Parameters:

index (int) – Index of the first char to include.
count (int) – Number of chars to cover, relative to the internal char list. Defaults to -1 for all remaining chars after index.
errors (str) – Error handling when decoding the data (see bytes.decode()).

Returns:

The text in the range in question, or an empty string if no text was found.

Return type:

str

Note

The returned text’s length does not have to match count, even if it will for most PDFs. This is because the underlying API may exclude/insert chars compared to the internal list, although rare in practice. This means, if the char at i is excluded, get_text_range(i, 2)[1] will raise an index error. Pdfium provides raw APIs FPDFText_GetTextIndexFromCharIndex() / FPDFText_GetCharIndexFromTextIndex() to translate between the two views and identify excluded/inserted chars.
In case of leading/trailing excluded characters, pypdfium2 modifies index and count accordingly to prevent pdfium from unexpectedly reading beyond range(index, index+count).

count_chars()[source]

Returns:: The number of characters on the text page.
Return type:: int

count_rects(index=0, count=-1)[source]

Parameters:

index (int) – Start character index.
count (int) – Character count to consider (defaults to -1 for all remaining).

Returns:

The number of text rectangles in the given character range.

Return type:

int

get_index(x, y, x_tol, y_tol)[source]

Get the index of a character by position.

Parameters:

x (float) – Horizontal position (in PDF canvas units).
y (float) – Vertical position.
x_tol (float) – Horizontal tolerance.
y_tol (float) – Vertical tolerance.

Returns:

The index of the character at or nearby the point (x, y). May be None if there is no character. If an internal error occurred, an exception will be raised.

Return type:

int | None

get_charbox(index, loose=False)[source]

Get the bounding box of a single character.

Parameters:

index (int) – Index of the character to work with, in the page’s character array.
loose (bool) – Get a more comprehensive box covering the entire font bounds, as opposed to the default tight box specific to the one character.

Returns:

Float values for left, bottom, right and top in PDF canvas units.

get_rect(index)[source]

Get the bounding box of a text rectangle at the given index.

Attention

count_rects() must be called once with default params before subsequent get_rect() calls for this function to work.

Returns:: Float values for left, bottom, right and top in PDF canvas units.

search(text, index=0, match_case=False, match_whole_word=False, consecutive=False, flags=0)[source]

Locate text on the page.

Parameters:

text (str) – The string to search for.
index (int) – Character index at which to start searching.
match_case (bool) – If True, the search will be case-specific (upper and lower letters treated as different characters).
match_whole_word (bool) – If True, substring occurrences will be ignored (e. g. cat would not match category).
consecutive (bool) – If False (the default), search() will skip past the current match to look for the next match. If True, parts of the previous match may be caught again (e. g. searching for aa in aaaa would match 3 rather than 2 times).
flags (int) – Passthrough of raw pdfium searching flags. Note that you may want to use the boolean options instead.

Returns:

A helper object to search text.

Return type:

PdfTextSearcher

class PdfTextSearcher(raw, textpage)[source]

Bases: AutoCloseable

Text searcher helper class.

raw

The underlying PDFium searcher handle.

Type:: FPDF_SCHHANDLE

textpage

Reference to the textpage this searcher belongs to.

Type:: PdfTextPage

property parent

get_next()[source]

Returns:: Start character index and count of the next occurrence, or None if the last occurrence was passed.
Return type:: (int, int) | None

get_prev()[source]

Returns:: Start character index and count of the previous occurrence (i. e. the one before the last valid occurrence), or None if the last occurrence was passed.
Return type:: (int, int) | None

Bitmap

class PdfBitmap(raw, buffer, width, height, stride, format, rev_byteorder, needs_free)[source]

Bases: AutoCloseable

Bitmap helper class.

Warning

bitmap.close(), which frees the buffer of foreign bitmaps, is not validated for safety. A bitmap must not be closed while other objects still depend on its buffer!

raw

The underlying PDFium bitmap handle.

Type:: FPDF_BITMAP

buffer

A ctypes array representation of the pixel data (each item is an unsigned byte, i. e. a number ranging from 0 to 255).

Type:: c_ubyte

width

Width of the bitmap (horizontal size).

Type:: int

height

Height of the bitmap (vertical size).

Type:: int

stride

Number of bytes per line in the bitmap buffer. Depending on how the bitmap was created, there may be a padding of unused bytes at the end of each line, so this value can be greater than width * n_channels.

Type:: int

format

PDFium bitmap format constant (FPDFBitmap_*)

Type:: int

rev_byteorder

Whether the bitmap is using reverse byte order.

Type:: bool

n_channels

Number of channels per pixel.

Type:: int

mode

The bitmap format as string (see PIL Modes).

Type:: str

property parent

classmethod from_raw(raw, rev_byteorder=False, ex_buffer=None)[source]

Construct a PdfBitmap wrapper around a raw PDFium bitmap handle.

Note

This method is primarily meant for bitmaps provided by pdfium (as in PdfImage.get_bitmap()). For bitmaps created by the caller, where the parameters are already known, it may be preferable to call the PdfBitmap constructor directly.

Parameters:

raw (FPDF_BITMAP) – PDFium bitmap handle.
rev_byteorder (bool) – Whether the bitmap uses reverse byte order.
ex_buffer (c_ubyte | None) – If the bitmap was created from a buffer allocated by Python/ctypes, pass in the ctypes array to keep it referenced.

classmethod new_native(width, height, format, rev_byteorder=False, buffer=None, stride=None)[source]

Create a new bitmap using FPDFBitmap_CreateEx(), with a buffer allocated by Python/ctypes, or provided by the caller.

If buffer and stride are None, a packed buffer is created.
If a custom buffer is given but no stride, the buffer is assumed to be packed.
If a custom stride is given but no buffer, a stride-agnostic buffer is created.
If both custom buffer and stride are given, they are used as-is.

Caller-provided buffer/stride are subject to a logical validation.

classmethod new_foreign(width, height, format, rev_byteorder=False, force_packed=False)[source]

Create a new bitmap using FPDFBitmap_CreateEx(), with a buffer allocated by PDFium. There may be a padding of unused bytes at line end, unless force_packed=True is given.

Note, the recommended default bitmap creation strategy is new_native().

classmethod new_foreign_simple(width, height, use_alpha, rev_byteorder=False)[source]

Create a new bitmap using FPDFBitmap_Create(). The buffer is allocated by PDFium.

PDFium docs specify that each line uses width * 4 bytes, with no gap between adjacent lines, i.e. the resulting buffer should be packed.

Contrary to the other PdfBitmap.new_*() methods, this method does not take a format constant, but a use_alpha boolean. If True, the format will be FPDFBitmap_BGRA, FPFBitmap_BGRx otherwise. Other bitmap formats cannot be used with this method.

Note, the recommended default bitmap creation strategy is new_native().

fill_rect(color, left, top, width, height)[source]

Fill a rectangle on the bitmap with the given color. The coordinate system’s origin is the top left corner of the image.

Note

This function replaces the color values in the given rectangle. It does not perform alpha compositing.

Parameters:: color (tuple[int, int, int, int]) – RGBA fill color (a tuple of 4 integers ranging from 0 to 255).

to_numpy()[source]

Get a numpy array view of the bitmap.

The array contains as many rows as the bitmap is high. Each row contains as many pixels as the bitmap is wide. Each pixel will be an array holding the channel values, or just a value if there is only one channel (see n_channels and format).

The resulting array is supposed to share memory with the original bitmap buffer, so changes to the buffer should be reflected in the array, and vice versa.

Returns:: NumPy array (representation of the bitmap buffer).
Return type:: numpy.ndarray

to_pil()[source]

Get a PIL image of the bitmap, using PIL.Image.frombuffer().

For RGBA, RGBX and L bitmaps, PIL is supposed to share memory with the original buffer, so changes to the buffer should be reflected in the image, and vice versa. Otherwise, PIL will make a copy of the data.

Returns:: PIL image (representation or copy of the bitmap buffer).
Return type:: PIL.Image.Image

classmethod from_pil(pil_image)[source]

Convert a PIL image to a PDFium bitmap. Due to the limited number of color formats and bit depths supported by FPDF_BITMAP, this may be a lossy operation.

Bitmaps returned by this function should be treated as immutable.

Parameters:: pil_image (PIL.Image.Image) – The image.
Returns:: PDFium bitmap (with a copy of the PIL image’s data).
Return type:: PdfBitmap

get_posconv(page)[source]

Acquire a PdfPosConv object to translate between coordinates on the bitmap and the page it was rendered from.

This method requires passing in the page explicitly, to avoid holding a strong reference, so that bitmap and page can be independently freed by finalizer.

class PdfPosConv(page, pos_args)[source]

Bases: object

Pdf coordinate translator.

Hint

You may want to use PdfBitmap.get_posconv() to obtain an instance of this class.

Parameters:

page (PdfPage) – Handle to the page.
pos_args (tuple[int*5]) – pdfium canvas args (start_x, start_y, size_x, size_y, rotate), as in FPDF_RenderPageBitmap() etc.

to_page(bitmap_x, bitmap_y)[source]: Translate coordinates from bitmap to page.

to_bitmap(page_x, page_y)[source]: Translate coordinates from page to bitmap.

Matrix

class PdfMatrix(a=1, b=0, c=0, d=1, e=0, f=0)[source]

Bases: object

PDF transformation matrix helper class.

See the PDF 1.7 specification, Section 8.3.3 (“Common Transformations”).

Note

The PDF format uses row vectors.
Transformations operate from the origin of the coordinate system (PDF coordinates: commonly bottom left, but can be any corner in principle. Device coordinates: top left).
Matrix calculations are implemented independently in Python.
Matrix objects are immutable, so transforming methods return a new matrix.
Matrix objects implement ctypes auto-conversion to FS_MATRIX for easy use as C function parameter.

a

Matrix value [0][0].

Type:: float

b

Matrix value [0][1].

Type:: float

c

Matrix value [1][0].

Type:: float

d

Matrix value [1][1].

Type:: float

e

Matrix value [2][0] (X translation).

Type:: float

f

Matrix value [2][1] (Y translation).

Type:: float

get()[source]: Get the matrix as tuple of the form (a, b, c, d, e, f).

classmethod from_raw(raw)[source]: Load a PdfMatrix from a raw FS_MATRIX object.

to_raw()[source]: Convert the matrix to a raw FS_MATRIX object.

multiply(other)[source]: Multiply this matrix by another PdfMatrix, to concatenate transformations.

translate(x, y)[source]

Parameters:

x (float) – Horizontal shift (<0: left, >0: right).
y (float) – Vertical shift.

scale(x, y)[source]

Parameters:

x (float) – A factor to scale the X axis (<1: compress, >1: stretch).
y (float) – A factor to scale the Y axis.

rotate(angle, ccw=False, rad=False)[source]

Parameters:

angle (float) – Angle by which to rotate the matrix.
ccw (bool) – If True, rotate counter-clockwise.
rad (bool) – If True, interpret the angle as radians.

mirror(invert_x, invert_y)[source]

Parameters:

invert_x (bool) – If True, invert X coordinates (horizontal transform). Corresponds to flipping around the Y axis.
invert_y (bool) – If True, invert Y coordinates (vertical transform). Corresponds to flipping around the X axis.

Note

Flipping around a vertical axis leads to a horizontal transform, and vice versa.

skew(x_angle, y_angle, rad=False)[source]

Parameters:

x_angle (float) – Inner angle to skew the X axis.
y_angle (float) – Inner angle to skew the Y axis.
rad (bool) – If True, interpret the angles as radians.

on_point(x, y)[source]

Returns:: Transformed point.
Return type:: (float, float)

on_rect(left, bottom, right, top)[source]

Returns:: Transformed rectangle.
Return type:: (float, float, float, float)

Attachment

class PdfAttachment(raw, pdf)[source]

Bases: AutoCastable

Attachment helper class. See PDF 1.7, Section 7.11 “File Specifications”.

raw

The underlying PDFium attachment handle.

Type:: FPDF_ATTACHMENT

pdf

Reference to the document this attachment belongs to. Must remain valid as long as the attachment is used.

Type:: PdfDocument

get_name()[source]

Returns:: Name of the attachment.
Return type:: str

get_data()[source]

Returns:: The attachment’s file data (as c_char array).
Return type:: ctypes.Array

set_data(data)[source]

Set the attachment’s file data. If this function is called on an existing attachment, it will be changed to point at the new data, but the previous data will not be removed from the file (as of PDFium 5418).

Parameters:: data (bytes | ctypes.Array) – New file data for the attachment. May be any data type that can be implicitly converted to c_void_p.

has_key(key)[source]

Parameters:: key (str) – A key to look for in the attachment’s params dictionary.
Returns:: True if key is contained in the params dictionary, False otherwise.
Return type:: bool

get_value_type(key)[source]

Returns:: Type of the value of key in the params dictionary (FPDF_OBJECT_*).
Return type:: int

get_str_value(key)[source]

Returns:: The value of key in the params dictionary, if it is a string or name. Otherwise, an empty string will be returned. On other failures, an exception will be raised.
Return type:: str

set_str_value(key, value)[source]

Set the attribute specified by key to the string value.

Parameters:: value (str) – New string value for the attribute.

Miscellaneous

exception PdfiumError(msg, err_code=None)[source]

Bases: RuntimeError

An exception from the PDFium library, detected by function return code.

err_code

PDFium error code, for programmatic handling of error subtypes, if provided by the API in question (e.g. document loading). None otherwise.

Type:: int | None

class PdfUnspHandler[source]

Bases: object

Unsupported feature handler helper class.

handlers

A dictionary of named handler functions to be called with an unsupported code (FPDF_UNSP_*) when PDFium detects an unsupported feature.

Type:: dict[str, Callable]

setup(add_default=True)[source]

Attach the handler to PDFium, and register an exit function to keep the object alive for the rest of the session.

Parameters:: add_default (bool) – If True, add a default callback that will log unsupported features as warning.

Internal

Warning

The following helpers are considered internal, so their API may change any time. They are isolated in an own namespace (pypdfium2.internal).

RotationToConst = {0: 0, 90: 1, 180: 2, 270: 3}: Convert a rotation value in degrees to a PDFium constant.

RotationToDegrees = {0: 0, 1: 90, 2: 180, 3: 270}: Convert a PDFium rotation constant to a value in degrees. Inversion of RotationToConst.

BitmapTypeToNChannels = {1: 1, 2: 3, 3: 4, 4: 4, 5: 4}: Get the number of channels for a PDFium bitmap format. (FPDFBitmap_Unknown is deliberately not handled.)

BitmapTypeToStr = {1: 'L', 2: 'BGR', 3: 'BGRX', 4: 'BGRA', 5: 'BGRa'}: Convert a PDFium bitmap format to string, assuming BGR byte order. (FPDFBitmap_Unknown is deliberately not handled.)

BitmapTypeToStrReverse = {1: 'L', 2: 'RGB', 3: 'RGBX', 4: 'RGBA', 5: 'RGBa'}: Convert a PDFium bitmap format to string, assuming RGB byte order. (FPDFBitmap_Unknown is deliberately not handled.)

BitmapStrToConst = {'BGR': 2, 'BGRA': 4, 'BGRX': 3, 'BGRa': 5, 'L': 1}: Convert a string to PDFium bitmap format, assuming BGR byte order. Inversion of BitmapTypeToStr.

BitmapStrReverseToConst = {'L': 1, 'RGB': 2, 'RGBA': 4, 'RGBX': 3, 'RGBa': 5}: Convert a string to PDFium bitmap format, assuming RGB byte order. Inversion of BitmapTypeToStrReverse.

FormTypeToStr = {0: 'None', 1: 'AcroForm', 2: 'XFA', 3: 'XFAF'}: Convert a PDFium form type (FORMTYPE_*) to string.

ColorspaceToStr = {0: '?', 1: 'DeviceGray', 2: 'DeviceRGB', 3: 'DeviceCMYK', 4: 'CalGray', 5: 'CalRGB', 6: 'Lab', 7: 'ICCBased', 8: 'Separation', 9: 'DeviceN', 10: 'Indexed', 11: 'Pattern'}: Convert a PDFium color space constant (FPDF_COLORSPACE_*) to string.

ViewmodeToStr = {0: '?', 1: 'XYZ', 2: 'Fit', 3: 'FitH', 4: 'FitV', 5: 'FitR', 6: 'FitB', 7: 'FitBH', 8: 'FitBV'}: Convert a PDFium view mode constant (PDFDEST_VIEW_*) to string.

ObjectTypeToStr = {0: '?', 1: 'text', 2: 'path', 3: 'image', 4: 'shading', 5: 'form'}: Convert a PDFium object type constant (FPDF_PAGEOBJ_*) to string.

ObjectTypeToConst = {'?': 0, 'form': 5, 'image': 3, 'path': 2, 'shading': 4, 'text': 1}: Convert an object type string to a PDFium constant. Inversion of ObjectTypeToStr.

PageModeToStr = {-1: '?', 0: 'None', 1: 'Outline', 2: 'Thumbnails', 3: 'Full-screen', 4: 'Layers', 5: 'Attachments'}: Convert a PDFium page mode constant (PAGEMODE_*) to string.

ErrorToStr = {0: 'Success', 1: 'Unknown error', 2: 'File access error', 3: 'Data format error', 4: 'Incorrect password error', 5: 'Unsupported security scheme error', 6: 'Page not found or content error'}: Convert a PDFium error constant (FPDF_ERR_*) to string.

UnsupportedInfoToStr = {1: 'XFA form', 2: 'Portable collection', 3: 'Attachment (incomplete support)', 4: 'Security', 5: 'Shared review', 6: 'Shared form (acrobat)', 7: 'Shared form (filesystem)', 8: 'Shared form (email)', 11: '3D annotation', 12: 'Movie annotation', 13: 'Sound annotation', 14: 'Screen media annotation', 15: 'Screen rich media annotation', 16: 'Attachment annotation', 17: 'Signature annotation'}: Convert a PDFium unsupported constant (FPDF_UNSP_*) to string.

class AutoCastable[source]: Bases: object

class AutoCloseable(close_func, *args, obj=None, needs_free=True, **kwargs)[source]

Bases: AutoCastable

close(_by_parent=False)[source]

color_tohex(color, rev_byteorder)[source]

set_callback(struct, fname, callback)[source]

is_stream(buf, spec='r')[source]

get_buffer(ptr, size)[source]

get_bufreader(buffer)[source]

get_bufwriter(buffer)[source]

pages_c_array(pages)[source]