PDFium is inherently not thread-safe. It is not allowed to call pdfium functions simultaneously across different threads, not even with different documents. [1]
However, you may still use pdfium in a threaded context if it is ensured that only a single pdfium call can be made at a time (e.g. via mutex).
It is fine to do pdfium work in one thread and other work in other threads.
The same applies to pypdfium2’s helpers, or any wrapper calling pdfium, whether directly or indirectly, unless protected by mutex.
To parallelize expensive pdfium tasks such as rendering, consider processes instead of threads.
The raw PDFium API, to be used with ctypes (pypdfium2.raw or pypdfium2_raw[2]).
The support model API, which is a set of Python helper classes around the raw API (pypdfium2).
Additionally, there is the internal API, which contains various utilities that aid with using the raw API and are accessed by helpers, but do not fit in the support model namespace itself (pypdfium2.internal).
Wrapper objects provide a raw attribute to access the underlying ctypes object.
In addition, helpers automatically resolve to raw if used as C function parameter. [3]
This allows to conveniently use helpers where available, while the raw API can still be accessed as needed.
The raw API is quite stable and provides a high level of backwards compatibility (seeing as PDFium is well-tested and relied on by popular projects), but it can be difficult to use, and special care needs to be taken with memory management.
The support model API is still in beta stage. It only covers a subset of pdfium features. Backwards incompatible changes may be applied occasionally, although we try to contain them within major releases. On the other hand, it is supposed to be safer and easier to use (“pythonic”), abstracting the finnicky interaction with C functions.
This section covers the support model. It is not applicable to the raw API alone!
PDFium objects commonly need to be closed by the caller to release allocated memory. [4]
Where necessary, pypdfium2’s helper classes implement automatic closing on garbage collection using weakref.finalize. Additionally, they provide close() methods that can be used to release memory explicitly.
It may be advantageous to close objects explicitly instead of relying on Python garbage collection behaviour, to release allocated memory and acquired file handles immediately. [5]
Closed objects must not be accessed anymore.
Closing an object sets the underlying raw attribute to None, which should prevent illegal use of closed raw handles, though.
Attempts to re-close an already closed object are silently ignored.
Closing a parent object will automatically close any open children (e.g. pages derived from a pdf).
Raw objects must not be detached from their wrappers. Accessing a raw object after it was closed, whether explicitly or on garbage collection of the wrapper, is illegal (use after free).
Due to limitations in weakref, finalizers can only be attached to wrapper objects, although they logically belong to the raw object.
input_data (str | pathlib.Path | bytes | ctypes.Array | BinaryIO | FPDF_DOCUMENT) – The input PDF given as file path, bytes, ctypes array, byte stream, or raw PDFium document handle.
A byte stream is defined as an object that implements seek()tell()read()readinto().
password (str | None) – A password to unlock the PDF, if encrypted. Otherwise, None or an empty string may be passed.
If a password is given but the PDF is not encrypted, it will be ignored (as of PDFium 5418).
autoclose (bool) – Whether byte stream input should be automatically closed on finalization.
Raises:
PdfiumError – Raised if the document failed to load. The exception is annotated with the reason reported by PDFium (via message and err_code).
FileNotFoundError – Raised if an invalid or non-existent file path was given.
Hint
Documents may be used in a with-block, closing the document on context manager exit.
This is recommended when input_data is a file path, to safely and immediately release the bound file handle.
len() may be called to get a document’s number of pages.
Pages may be loaded using list index access.
Looping over a document will yield its pages from beginning to end.
The del keyword and list index access may be used to delete pages.
version (int | None) – The PDF version to use, given as an integer (14 for 1.4, 15 for 1.5, …).
If None (the default), PDFium will set a version automatically.
flags (int) – PDFium saving flags (defaults to FPDF_NO_INCREMENTAL).
type (int) – The identifier type to retrieve (FILEIDTYPE_*), either permanent or changing.
If the file was updated incrementally, the permanent identifier stays the same,
while the changing identifier is re-calculated.
Returns:
Unique file identifier from the PDF’s trailer dictionary.
See PDF 1.7, Section 14.4 “File Identifiers”.
Unlink the attachment at given index (zero-based).
It will be hidden from the viewer, but is still present in the file (as of PDFium 5418).
Following attachments shift one slot to the left in the array representation used by PDFium’s API.
Handles to the attachment in question received from get_attachment()
must not be accessed anymore after this method has been called.
index (int | None) – Suggested zero-based index at which the page shall be inserted.
If None or larger that the document’s current last index, the page will be appended to the end.
pdf (PdfDocument) – The document from which to import pages.
pages (list[int] | str | None) – The pages to include. It may either be a list of zero-based page indices, or a string of one-based page numbers and ranges.
If None, all pages will be included.
index (int) – Zero-based index at which to insert the given pages. If None, they are appended to the end of the document.
An independent pageobject representation of the XObject.
If multiple pageobjects are created from an XObject, they share resources.
Returned pageobjects remain valid after the XObject is closed.
Signed number of child bookmarks that would be visible if the bookmark were open (i.e. recursively counting children of open children).
The bookmark’s initial state is open (expanded) if the number is positive, closed (collapsed) if negative.
Zero if the bookmark has no descendants.
A tuple of (view_mode, view_pos).
view_mode is a constant (one of PDFDEST_VIEW_*) defining how view_pos shall be interpreted.
view_pos is the target position on the page the dest points to.
It may contain between 0 to 4 float coordinates, depending on the view mode.
The page MediaBox in PDF canvas units, consisting of four coordinates (usually x0, y0, x1, y1).
If MediaBox is not defined, returns ANSI A (0, 0, 612, 792) if fallback_ok=True, None otherwise.
The pageobject must not belong to a page yet. If it belongs to a PDF, the target page must be part of that PDF.
Position and form are defined by the object’s matrix.
If it is the identity matrix, the object will appear as-is on the bottom left corner of the page.
Remove a pageobject from the page.
As of PDFium 5692, detached pageobjects may be only re-inserted into existing pages of the same document.
If the pageobject is not re-inserted into a page, its close() method may be called.
Note
If the object’s type is FPDF_PAGEOBJ_TEXT, any PdfTextPage handles to the page should be closed before removing the object.
filter (list[int] | None) – An optional list of pageobject types to filter (FPDF_PAGEOBJ_*).
Any objects whose type is not contained will be skipped.
If None or empty, all objects will be provided, regardless of their type.
max_depth (int) – Maximum recursion depth to consider when descending into Form XObjects.
Flatten form fields and annotations into page contents.
Attention
init_forms() must have been called on the parent pdf, before the page was retrieved, for this method to work. In other words, PdfPage.formenv must be non-null.
Flattening may invalidate existing handles to the page, so you may want to re-initialize these afterwards.
scale (float) – A factor scaling the number of pixels per PDF canvas unit. This defines the resolution of the image.
To convert a DPI value to a scale factor, multiply it by the size of 1 canvas unit in inches (usually 1/72in). [6]
rotation (int) – Additional rotation in degrees (0, 90, 180, or 270).
crop (tuple[float, float, float, float]) – Amount in PDF canvas units to cut off from page borders (left, bottom, right, top). Crop is applied after rotation.
may_draw_forms (bool) – If True, render form fields (provided the document has forms and init_forms() was called).
bitmap_maker (Callable) – Callback function used to create the PdfBitmap.
fill_color (tuple[int, int, int, int]) – Color the bitmap will be filled with before rendering. This uses RGBA syntax regardless of the pixel format used, with values from 0 to 255.
If the fill color is not opaque (i.e. has transparency), {BGR,RGB}A will be used.
grayscale (bool) – If True, render in grayscale mode.
draw_annots (bool) – If True, render page annotations.
no_smoothtext (bool) – If True, disable text anti-aliasing. Overrides optimize_mode="lcd".
no_smoothimage (bool) – If True, disable image anti-aliasing.
no_smoothpath (bool) – If True, disable path anti-aliasing.
force_halftone (bool) – If True, always use halftone for image stretching.
limit_image_cache (bool) – If True, limit image cache size.
rev_byteorder (bool) – If True, render with reverse byte order, leading to RGB{A/x} output rather than BGR{A/x}.
Other pixel formats are not affected.
prefer_bgrx (bool) – If True, use 4-byte {BGR/RGB}x rather than 3-byte {BGR/RGB} (i.e. add an unused byte).
Other pixel formats are not affected.
use_bgra_on_transparency (bool) – If True, use a pixel format with alpha channel (i.e. {BGR/RGB}A) if page content has transparency.
This is recommended for performance in these cases, but as page-dependent format selection is somewhat unexpected, it is not enabled by default.
force_bitmap_format (int | None) – If given, override automatic pixel format selection and enforce use of the given format (one of the FPDFBitmap_* constants). In this case, you should not pass any other format selection options, except potentially rev_byteorder.
extra_flags (int) – Additional PDFium rendering flags. May be combined with bitwise OR (| operator).
color_scheme (PdfColorScheme | None) – A custom pdfium color scheme. Note that this may flatten different colors into one, so the usability of this is limited.
fill_to_stroke (bool) – If a color_scheme is given, whether to only draw borders around fill areas using the path_stroke color, instead of filling with the path_fill color.
When constructing a PdfObject, an instance of a more specific subclass may be returned instead, depending on the object’s type (e. g. PdfImage).
Note
PdfObject.close() only takes effect on loose pageobjects.
It is a no-op otherwise, because pageobjects that are part of a page are owned by pdfium, not the caller.
Reference to the document this pageobject belongs to. May be None if the object does not belong to a document yet.
This attribute is always set if page is set.
Get the object’s quadriliteral points (i.e. the positions of its corners).
For transformed objects, this may provide tighter bounds than a rectangle (e.g. rotation by a non-multiple of 90°, shear).
Note
This function only supports image and text objects.
Returns:
Corner positions as (x, y) tuples, counter-clockwise from origin, i.e. bottom-left, bottom-right, top-right, top-left, in PDF page coordinates.
pdf (PdfDocument) – The document to which the new image object shall be added.
Returns:
Handle to a new, empty image.
Note that position and size of the image are defined by its matrix, which defaults to the identity matrix.
This means that new images will appear as a tiny square of 1x1 canvas units on the bottom left corner of the page.
Use PdfMatrix and set_matrix() to adjust size and position.
Retrieve image metadata including DPI, bits per pixel, color space, and size.
If the image does not belong to a page yet, bits per pixel and color space will be unset (0).
Note
The DPI values signify the resolution of the image on the PDF page, not the DPI metadata embedded in the image file.
Due to issues in pdfium, this function might be slow on some kinds of images. If you only need size, prefer get_px_size() instead.
source (str | pathlib.Path | BinaryIO) – Input JPEG, given as file path or readable byte stream.
pages (list[PdfPage] | None) – If replacing an image, pass in a list of loaded pages that might contain it, to update their cache.
(The same image may be shown multiple times in different transforms across a PDF.)
May be None or an empty sequence if the image is not shared.
inline (bool) – Whether to load the image content into memory. If True, the buffer may be closed after this function call.
Otherwise, the buffer needs to remain open until the PDF is closed.
autoclose (bool) – If the input is a buffer, whether it should be automatically closed once not needed by the PDF anymore.
render (bool) – Whether the image should be rendered, thereby applying possible transform matrices and alpha masks.
scale_to_original (bool) – If render is True, whether to temporarily scale the image to its native resolution, or close to that (defaults to True). This should improve output quality. Ignored if render is False.
decode_simple (bool) – If True, decode simple filters (see SIMPLE_FILTERS), so only complex filters will remain, if any. If there are no complex filters, this provides the decoded pixel data.
If False, the raw stream data will be returned instead.
Extract the image into an independently usable file or byte stream, attempting to avoid re-encoding or quality loss, as far as pdfium’s limited API permits.
This method can only extract DCTDecode (JPEG) and JPXDecode (JPEG 2000) images directly.
Otherwise, the pixel data is decoded and re-encoded using PIL, which is slower and loses the original encoding.
For images with simple filters only, get_data(decode_simple=True) is used to preserve higher bit depth or special color formats not supported by FPDF_BITMAP.
For images with complex filters other than those extracted directly, we have to resort to get_bitmap().
Note, this method is not able to account for alpha masks, and potentially other data stored separately of the main image stream, which might lead to incorrect representation of the image.
Tip
The pikepdf library is capable of preserving the original encoding in many cases where this method is not.
Parameters:
dest (str | pathlib.Path | io.BytesIO) – File path prefix or byte stream to which the image shall be written.
fb_format (str) – The image format to use in case it is necessary to (re-)encode the data.
(py)pdfium itself does not implement layout analysis, such as detecting words/lines/paragraphs.
However, there may be third-party extensions for this job, e.g.: https://github.com/VikParuchuri/pdftext
The returned text’s length does not have to match count, even if it will for most PDFs.
This is because the underlying API may exclude/insert chars compared to the internal list, although rare in practice.
This means, if the char at i is excluded, get_text_range(i,2)[1] will raise an index error.
Pdfium provides raw APIs FPDFText_GetTextIndexFromCharIndex() / FPDFText_GetCharIndexFromTextIndex() to translate between the two views and identify excluded/inserted chars.
In case of leading/trailing excluded characters, pypdfium2 modifies index and count accordingly to prevent pdfium from unexpectedly reading beyond range(index,index+count).
The index of the character at or nearby the point (x, y).
May be None if there is no character. If an internal error occurred, an exception will be raised.
index (int) – Character index at which to start searching.
match_case (bool) – If True, the search will be case-specific (upper and lower letters treated as different characters).
match_whole_word (bool) – If True, substring occurrences will be ignored (e. g. cat would not match category).
consecutive (bool) – If False (the default), search() will skip past the current match to look for the next match.
If True, parts of the previous match may be caught again (e. g. searching for aa in aaaa would match 3 rather than 2 times).
flags (int) – Passthrough of raw pdfium searching flags. Note that you may want to use the boolean options instead.
Start character index and count of the previous occurrence (i. e. the one before the last valid occurrence), or None if the last occurrence was passed.
bitmap.close(), which frees the buffer of foreign bitmaps, is not validated for safety.
A bitmap must not be closed while other objects still depend on its buffer!
Number of bytes per line in the bitmap buffer.
Depending on how the bitmap was created, there may be a padding of unused bytes at the end of each line, so this value can be greater than width*n_channels.
Construct a PdfBitmap wrapper around a raw PDFium bitmap handle.
Note
This method is primarily meant for bitmaps provided by pdfium (as in PdfImage.get_bitmap()). For bitmaps created by the caller, where the parameters are already known, it may be preferable to call the PdfBitmap constructor directly.
Parameters:
raw (FPDF_BITMAP) – PDFium bitmap handle.
rev_byteorder (bool) – Whether the bitmap uses reverse byte order.
ex_buffer (c_ubyte | None) – If the bitmap was created from a buffer allocated by Python/ctypes, pass in the ctypes array to keep it referenced.
classmethodnew_native(width, height, format, rev_byteorder=False, buffer=None, stride=None)[source]
Create a new bitmap using FPDFBitmap_CreateEx(), with a buffer allocated by Python/ctypes, or provided by the caller.
If buffer and stride are None, a packed buffer is created.
If a custom buffer is given but no stride, the buffer is assumed to be packed.
If a custom stride is given but no buffer, a stride-agnostic buffer is created.
If both custom buffer and stride are given, they are used as-is.
Caller-provided buffer/stride are subject to a logical validation.
classmethodnew_foreign(width, height, format, rev_byteorder=False, force_packed=False)[source]
Create a new bitmap using FPDFBitmap_CreateEx(), with a buffer allocated by PDFium.
There may be a padding of unused bytes at line end, unless force_packed=True is given.
Note, the recommended default bitmap creation strategy is new_native().
Create a new bitmap using FPDFBitmap_Create(). The buffer is allocated by PDFium.
PDFium docs specify that each line uses width * 4 bytes, with no gap between adjacent lines, i.e. the resulting buffer should be packed.
Contrary to the other PdfBitmap.new_*() methods, this method does not take a format constant, but a use_alpha boolean. If True, the format will be FPDFBitmap_BGRA, FPFBitmap_BGRx otherwise. Other bitmap formats cannot be used with this method.
Note, the recommended default bitmap creation strategy is new_native().
The array contains as many rows as the bitmap is high.
Each row contains as many pixels as the bitmap is wide.
Each pixel will be an array holding the channel values, or just a value if there is only one channel (see n_channels and format).
The resulting array is supposed to share memory with the original bitmap buffer,
so changes to the buffer should be reflected in the array, and vice versa.
Returns:
NumPy array (representation of the bitmap buffer).
For RGBA, RGBX and L bitmaps, PIL is supposed to share memory with
the original buffer, so changes to the buffer should be reflected in the image, and vice versa.
Otherwise, PIL will make a copy of the data.
Returns:
PIL image (representation or copy of the bitmap buffer).
Convert a PIL image to a PDFium bitmap.
Due to the limited number of color formats and bit depths supported by FPDF_BITMAP, this may be a lossy operation.
Bitmaps returned by this function should be treated as immutable.
Acquire a PdfPosConv object to translate between coordinates on the bitmap and the page it was rendered from.
This method requires passing in the page explicitly, to avoid holding a strong reference, so that bitmap and page can be independently freed by finalizer.
See the PDF 1.7 specification, Section 8.3.3 (“Common Transformations”).
Note
The PDF format uses row vectors.
Transformations operate from the origin of the coordinate system
(PDF coordinates: commonly bottom left, but can be any corner in principle. Device coordinates: top left).
Matrix calculations are implemented independently in Python.
Matrix objects are immutable, so transforming methods return a new matrix.
Matrix objects implement ctypes auto-conversion to FS_MATRIX for easy use as C function parameter.
Set the attachment’s file data.
If this function is called on an existing attachment, it will be changed to point at the new data,
but the previous data will not be removed from the file (as of PDFium 5418).
Parameters:
data (bytes | ctypes.Array) – New file data for the attachment. May be any data type that can be implicitly converted to c_void_p.
The value of key in the params dictionary, if it is a string or name.
Otherwise, an empty string will be returned. On other failures, an exception will be raised.