PDFium is not thread-safe. It is not allowed to call pdfium functions simultaneously across different threads, not even with different documents. [1]
However, you may still use pdfium in a threaded context if it is ensured that only a single pdfium call can be made at a time (e.g. via mutex).
It is fine to do pdfium work in one thread and other work in other threads.
The same applies to pypdfium2’s helpers, or any wrapper calling pdfium, whether directly or indirectly, unless protected by mutex.
To parallelize expensive pdfium tasks such as rendering, consider processes instead of threads.
The raw PDFium API, to be used with ctypes (pypdfium2.raw or pypdfium2_raw[2]).
The support model API, which is a set of Python helper classes around the raw API (pypdfium2).
Additionally, there is the internal API, which contains various utilities that aid with using the raw API and are accessed by helpers, but do not fit in the support model namespace itself (pypdfium2.internal).
Wrapper objects provide a raw attribute to access the underlying ctypes object.
In addition, helpers automatically resolve to raw if used as C function parameter. [3]
This allows to conveniently use helpers where available, while the raw API can still be accessed as needed.
The raw API is quite stable and provides a high level of backwards compatibility (seeing as PDFium is well-tested and relied on by popular projects), but it can be difficult to use, and special care needs to be taken with memory management.
The support model API is still in beta stage. It only covers a subset of pdfium features. Backwards incompatible changes may be applied occasionally, although we try to contain them within major releases. On the other hand, it is supposed to be safer and easier to use (“pythonic”), abstracting the finnicky interaction with C functions.
This section covers the support model. It is not applicable to the raw API alone!
PDFium objects commonly need to be closed by the caller to release allocated memory. [4]
Where necessary, pypdfium2’s helper classes implement automatic closing on garbage collection using weakref.finalize. Additionally, they provide close() methods that can be used to release memory explicitly.
It may be advantageous to close objects explicitly instead of relying on Python garbage collection behaviour, to release allocated memory and acquired file handles immediately. [5]
Closed objects must not be accessed anymore.
Closing an object sets the underlying raw attribute to None, which should prevent illegal use of closed raw handles, though.
Attempts to re-close an already closed object are silently ignored.
Closing a parent object will automatically close any open children (e.g. pages derived from a pdf).
Raw objects must not be detached from their wrappers. Accessing a raw object after it was closed, whether explicitly or on garbage collection of the wrapper, is illegal (use after free).
Due to limitations in weakref, finalizers can only be attached to wrapper objects, although they logically belong to the raw object.
True for editable install, False otherwise. None if unknown.
If True, the version info is the one captured at install time. An arbitrary number of forward or reverse changes may have happened since. The actual current state is unknown.
input_data (str | pathlib.Path | bytes | ctypes.Array | BinaryIO | FPDF_DOCUMENT) – The input PDF given as file path, bytes, ctypes array, byte buffer, or raw PDFium document handle.
A byte buffer is defined as an object that implements seek()tell()read()readinto().
password (str | None) – A password to unlock the PDF, if encrypted. Otherwise, None or an empty string may be passed.
If a password is given but the PDF is not encrypted, it will be ignored (as of PDFium 5418).
autoclose (bool) – Whether byte buffer input should be automatically closed on finalization.
Raises:
PdfiumError – Raised if the document failed to load. The exception message is annotated with the reason reported by PDFium.
FileNotFoundError – Raised if an invalid or non-existent file path was given.
Hint
len() may be called to get a document’s number of pages.
Looping over a document will yield its pages from beginning to end.
Pages may be loaded using list index access.
The del keyword and list index access may be used to delete pages.
version (int | None) – The PDF version to use, given as an integer (14 for 1.4, 15 for 1.5, …).
If None (the default), PDFium will set a version automatically.
flags (int) – PDFium saving flags (defaults to FPDF_NO_INCREMENTAL).
type (int) – The identifier type to retrieve (FILEIDTYPE_*), either permanent or changing.
If the file was updated incrementally, the permanent identifier stays the same,
while the changing identifier is re-calculated.
Returns:
Unique file identifier from the PDF’s trailer dictionary.
See PDF 1.7, Section 14.4 “File Identifiers”.
Unlink the attachment at index (zero-based).
It will be hidden from the viewer, but is still present in the file (as of PDFium 5418).
Following attachments shift one slot to the left in the array representation used by PDFium’s API.
Handles to the attachment in question received from get_attachment()
must not be accessed anymore after this method has been called.
index (int | None) – Suggested zero-based index at which the page shall be inserted.
If None or larger that the document’s current last index, the page will be appended to the end.
pdf (PdfDocument) – The document from which to import pages.
pages (list[int] | str | None) – The pages to include. It may either be a list of zero-based page indices, or a string of one-based page numbers and ranges.
If None, all pages will be included.
index (int) – Zero-based index at which to insert the given pages. If None, they are appended to the end of the document.
Deprecated since version 4.19: This method will be removed with the next major release due to serious issues rooted in the original API design. Use PdfPage.render() instead.
Note that the CLI provides parallel rendering using a proper caller-side process pool with inline saving in rendering jobs.
Changed in version 4.25: Removed the original process pool implementation and turned this into a wrapper for linear rendering, due to the serious conceptual issues and possible memory load escalation, especially with expensive receiving code (e.g. PNG encoding) or long documents. See the changelog for more info
An independent page object representation of the XObject.
If multiple page objects are created from one XObject, they share resources.
Page objects created from an XObject remain valid after the XObject is closed.
is_closed (bool) – True if child items shall be collapsed, False if they shall be expanded.
None if the item has no descendants (i. e. n_kids==0).
n_kids (int) – Absolute number of child items, according to the PDF.
page_index (int | None) – Zero-based index of the page the bookmark points to.
May be None if the bookmark has no target page (or it could not be determined).
view_mode (int) – A view mode constant (PDFDEST_VIEW_*) defining how the coordinates of view_pos shall be interpreted.
view_pos (list[float]) – Target position on the page the viewport should jump to when the bookmark is clicked.
It is a sequence of float values in PDF canvas units.
Depending on view_mode, it may contain between 0 and 4 coordinates.
The page MediaBox in PDF canvas units, consisting of four coordinates (usually x0, y0, x1, y1).
If MediaBox is not defined, returns ANSI A (0, 0, 612, 792) if fallback_ok=True, None otherwise.
Due to quirks in PDFium’s public API, all get_*box() functions except get_bbox()
do not inherit from parent nodes in the page tree (as of PDFium 5418).
The page object must not belong to a page yet. If it belongs to a PDF, this page must be part of the PDF.
Position and form are defined by the object’s matrix.
If it is the identity matrix, the object will appear as-is on the bottom left corner of the page.
Remove a page object from the page.
As of PDFium 5692, detached page objects may be only re-inserted into existing pages of the same document.
If the page object is not re-inserted into a page, its close() method may be called.
filter (list[int] | None) – An optional list of page object types to filter (FPDF_PAGEOBJ_*).
Any objects whose type is not contained will be skipped.
If None or empty, all objects will be provided, regardless of their type.
max_depth (int) – Maximum recursion depth to consider when descending into Form XObjects.
scale (float) – A factor scaling the number of pixels per PDF canvas unit. This defines the resolution of the image.
To convert a DPI value to a scale factor, multiply it by the size of 1 canvas unit in inches (usually 1/72in). [6]
rotation (int) – Additional rotation in degrees (0, 90, 180, or 270).
crop (tuple[float, float, float, float]) – Amount in PDF canvas units to cut off from page borders (left, bottom, right, top). Crop is applied after rotation.
may_draw_forms (bool) – If True, render form fields (provided the document has forms and init_forms() was called).
bitmap_maker (Callable) – Callback function used to create the PdfBitmap.
color_scheme (PdfColorScheme | None) – An optional, custom rendering color scheme.
fill_to_stroke (bool) – If True and rendering with custom color scheme, fill paths will be stroked.
fill_color (tuple[int, int, int, int]) – Color the bitmap will be filled with before rendering (RGBA values from 0 to 255).
grayscale (bool) – If True, render in grayscale mode.
draw_annots (bool) – If True, render page annotations.
no_smoothtext (bool) – If True, disable text anti-aliasing. Overrides optimize_mode="lcd".
no_smoothimage (bool) – If True, disable image anti-aliasing.
no_smoothpath (bool) – If True, disable path anti-aliasing.
force_halftone (bool) – If True, always use halftone for image stretching.
limit_image_cache (bool) – If True, limit image cache size.
rev_byteorder (bool) – If True, render with reverse byte order, leading to RGB(A/X) output instead of BGR(A/X).
Other pixel formats are not affected.
prefer_bgrx (bool) – If True, prefer four-channel over three-channel pixel formats, even if the alpha byte is unused.
Other pixel formats are not affected.
force_bitmap_format (int | None) – If given, override automatic pixel format selection and enforce use of the given format (one of the FPDFBitmap_* constants).
extra_flags (int) – Additional PDFium rendering flags. May be combined with bitwise OR (| operator).
Reference to the document this pageobject belongs to. May be None if the object does not belong to a document yet.
This attribute is always set if page is set.
Filters applied by FPDFImageObj_GetImageDataDecoded(). Hereafter referred to as “simple filters”, while non-simple filters will be called “complex filters”.
pdf (PdfDocument) – The document to which the new image object shall be added.
Returns:
Handle to a new, empty image.
Note that position and size of the image are defined by its matrix, which defaults to the identity matrix.
This means that new images will appear as a tiny square of 1x1 units on the bottom left corner of the page.
Use PdfMatrix and set_matrix() to adjust size and position.
Retrieve image metadata including DPI, bits per pixel, color space, and size.
If the image does not belong to a page yet, bits per pixel and color space will be unset (0).
Note
The DPI values signify the resolution of the image on the PDF page, not the DPI metadata embedded in the image file.
Due to issues in PDFium, this function can be slow. If you only need image size, prefer the faster get_size() instead.
source (str | pathlib.Path | BinaryIO) – Input JPEG, given as file path or readable byte buffer.
pages (list[PdfPage] | None) – If replacing an image, pass in a list of loaded pages that might contain it, to update their cache.
(The same image may be shown multiple times in different transforms across a PDF.)
May be None or an empty sequence if the image is not shared.
inline (bool) – Whether to load the image content into memory. If True, the buffer may be closed after this function call.
Otherwise, the buffer needs to remain open until the PDF is closed.
autoclose (bool) – If the input is a buffer, whether it should be automatically closed once not needed by the PDF anymore.
Extract the image into an independently usable file or byte buffer.
Where possible within PDFium’s limited public API, it will be attempted to transfer the image data directly,
avoiding an unnecessary layer of decoding and re-encoding.
Otherwise, the fully decoded data will be retrieved and (re-)encoded using PIL.
As PDFium does not expose all required information, only DCTDecode (JPEG) and JPXDecode (JPEG 2000) images can be extracted directly.
For images with complex filters, the bitmap data is used. Otherwise, get_data(decode_simple=True) is used, which avoids lossy conversion for images whose bit depth or colour format is not supported by PDFium’s bitmap implementation.
Parameters:
dest (str | io.BytesIO) – File prefix or byte buffer to which the image shall be written.
fb_format (str) – The image format to use in case it is necessary to (re-)encode the data.
fb_render (bool) – Whether the image should be rendered if falling back to bitmap-based extraction.
Changed in version 4.28: For various reasons, calling this method with default params now implicitly translates to get_text_bounded() (pass force_this=True to circumvent).
The returned text’s length does not have to match count, even if it will for most PDFs.
This is because the underlying API may exclude/insert chars compared to the internal list, although rare in practice.
This means, if the char at i is excluded, get_text_range(i,2)[1] will raise an index error.
Pdfium provides raw APIs FPDFText_GetTextIndexFromCharIndex() / FPDFText_GetCharIndexFromTextIndex() to translate between the two views and identify excluded/inserted chars.
In case of leading/trailing excluded characters, pypdfium2 modifies index and count accordingly to prevent pdfium from unexpectedly reading beyond range(index,index+count).
Get the bounding box of a text rectangle at the given index.
Note that count_rects() must be called once with default parameters
before subsequent get_rect() calls for this function to work (due to PDFium’s API).
Returns:
Float values for left, bottom, right and top in PDF canvas units.
index (int) – Character index at which to start searching.
match_case (bool) – If True, the search will be case-specific (upper and lower letters treated as different characters).
match_whole_word (bool) – If True, substring occurrences will be ignored (e. g. cat would not match category).
consecutive (bool) – If False (the default), search() will skip past the current match to look for the next match.
If True, parts of the previous match may be caught again (e. g. searching for aa in aaaa would match 3 rather than 2 times).
Start character index and count of the previous occurrence (i. e. the one before the last valid occurrence),
or None if the last occurrence was passed.
This class provides built-in converters (e. g. to_pil(), to_numpy()) that may be used to create a different representation of the bitmap.
Converters can be applied on PdfBitmap objects either as bound method (bitmap.to_*()), or as function (PdfBitmap.to_*(bitmap))
The second pattern is useful for API methods that need to apply a caller-provided converter (e. g. PdfDocument.render())
Note
All attributes of PdfBitmapInfo are available in this class as well.
Warning
bitmap.close(), which frees the buffer of foreign bitmaps, is not validated for safety.
A bitmap must not be closed when other objects still depend on its buffer!
Construct a PdfBitmap wrapper around a raw PDFium bitmap handle.
Parameters:
raw (FPDF_BITMAP) – PDFium bitmap handle.
rev_byteorder (bool) – Whether the bitmap uses reverse byte order.
ex_buffer (c_ubyte | None) – If the bitmap was created from a buffer allocated by Python/ctypes, pass in the ctypes array to keep it referenced.
classmethodnew_native(width, height, format, rev_byteorder=False, buffer=None)[source]
Create a new bitmap using FPDFBitmap_CreateEx(), with a buffer allocated by Python/ctypes.
Bitmaps created by this function are always packed (no unused bytes at line end).
classmethodnew_foreign(width, height, format, rev_byteorder=False, force_packed=False)[source]
Create a new bitmap using FPDFBitmap_CreateEx(), with a buffer allocated by PDFium.
Using this method is discouraged. Prefer new_native() instead.
Create a new bitmap using FPDFBitmap_Create(). The buffer is allocated by PDFium.
The resulting bitmap is supposed to be packed (i. e. no gap of unused bytes between lines).
Using this method is discouraged. Prefer new_native() instead.
The array contains as many rows as the bitmap is high.
Each row contains as many pixels as the bitmap is wide.
The length of each pixel corresponds to the number of channels.
The resulting array is supposed to share memory with the original bitmap buffer,
so changes to the buffer should be reflected in the array, and vice versa.
Returns:
NumPy array (representation of the bitmap buffer).
For RGBA, RGBX and L buffers, PIL is supposed to share memory with
the original bitmap buffer, so changes to the buffer should be reflected in the image, and vice versa.
Otherwise, PIL will make a copy of the data.
Returns:
PIL image (representation or copy of the bitmap buffer).
Convert a PIL image to a PDFium bitmap.
Due to the restricted number of color formats and bit depths supported by PDFium’s
bitmap implementation, this may be a lossy operation.
Bitmaps returned by this function should be treated as immutable (i.e. don’t call fill_rect()).
Number of bytes per line in the bitmap buffer.
Depending on how the bitmap was created, there may be a padding of unused bytes at the end of each line, so this value can be greater than width*n_channels.
Set the attachment’s file data.
If this function is called on an existing attachment, it will be changed to point at the new data,
but the previous data will not be removed from the file (as of PDFium 5418).
Parameters:
data (bytes | ctypes.Array) – New file data for the attachment. May be any data type that can be implicitly converted to c_void_p.
The value of key in the params dictionary, if it is a string or name.
Otherwise, an empty string will be returned. On other failures, an exception will be raised.