Shell API
pypdfium2 can also be used from the command-line.
Version
$ pypdfium2 --version
pypdfium2 5.0.0+4.gc74b8f4
pdfium 143.0.7483.0 at /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/pypdfium2_raw/libpdfium.so
Main Help
$ pypdfium2 --help
usage: pypdfium2 [-h] [--version]
{arrange,attachments,extract-images,extract-text,imgtopdf,pageobjects,pdfinfo,render,tile,toc}
...
Command line interface to the pypdfium2 library (Python binding to PDFium)
positional arguments:
{arrange,attachments,extract-images,extract-text,imgtopdf,pageobjects,pdfinfo,render,tile,toc}
arrange Rearrange/merge documents
attachments List/extract/edit embedded files
extract-images Extract images
extract-text Extract text
imgtopdf Convert images to PDF
pageobjects Print info on pageobjects
pdfinfo Print info on document and pages
render Rasterize pages
tile Tile pages (N-up)
toc Print table of contents
options:
-h, --help show this help message and exit
--version, -v show program's version number and exit
Arranger
$ pypdfium2 arrange --help
usage: pypdfium2 arrange [-h] [--pages PAGES [PAGES ...]]
[--passwords PASSWORDS [PASSWORDS ...]] --output
OUTPUT
inputs [inputs ...]
Rearrange/merge documents
positional arguments:
inputs Sequence of PDF files.
options:
-h, --help show this help message and exit
--pages PAGES [PAGES ...]
Sequence of page texts, definig the pages to include from each PDF. Use '_' as placeholder for all pages.
--passwords PASSWORDS [PASSWORDS ...]
Passwords to unlock encrypted PDFs. Any placeholder may be used for non-encrypted documents.
--output OUTPUT, -o OUTPUT
Target path for the output document
Attachments
$ pypdfium2 attachments --help
usage: pypdfium2 attachments [-h] [--password PASSWORD]
input {list,extract,edit} ...
List/extract/edit embedded files
positional arguments:
input Input PDF document
{list,extract,edit}
options:
-h, --help show this help message and exit
--password PASSWORD A password to unlock the PDF, if encrypted
$ pypdfium2 attachments file.pdf list --help
usage: pypdfium2 attachments input list [-h]
options:
-h, --help show this help message and exit
$ pypdfium2 attachments file.pdf extract --help
usage: pypdfium2 attachments input extract [-h] [--numbers NUMBERS]
--output-dir OUTPUT_DIR
options:
-h, --help show this help message and exit
--numbers NUMBERS
--output-dir OUTPUT_DIR, -o OUTPUT_DIR
$ pypdfium2 attachments file.pdf edit --help
usage: pypdfium2 attachments input edit [-h] [--del-numbers DEL_NUMBERS]
[--add-files F [F ...]] --output
OUTPUT
options:
-h, --help show this help message and exit
--del-numbers DEL_NUMBERS, -d DEL_NUMBERS
--add-files F [F ...], -a F [F ...]
--output OUTPUT, -o OUTPUT
Image Extractor
$ pypdfium2 extract-images --help
usage: pypdfium2 extract-images [-h] [--password PASSWORD] [--pages PAGES]
--output-dir OUTPUT_DIR
[--max-depth MAX_DEPTH] [--use-bitmap]
[--format FORMAT] [--render]
[--scale-to-original | --no-scale-to-original]
input
Extract images
positional arguments:
input Input PDF document
options:
-h, --help show this help message and exit
--password PASSWORD A password to unlock the PDF, if encrypted
--pages PAGES Page numbers and ranges to include
--output-dir OUTPUT_DIR, -o OUTPUT_DIR
Output directory to take the extracted images
--max-depth MAX_DEPTH
Maximum recursion depth to consider when looking for pageobjects.
--use-bitmap Enforce the use of bitmaps rather than attempting a smart extraction of the image.
--format FORMAT Image format to use when saving bitmaps. (Fallback if doing smart extraction.)
--render When --use-bitmap is given, whether to get rendered bitmaps, taking masks and transform matrices into account.
--scale-to-original, --no-scale-to-original
When --use-bitmap --render is given, whether to scale the image so it is rendered at its native resolution, or close to that. This should improve output quality. The default is True, but you may opt out.
Text Extractor
$ pypdfium2 extract-text --help
usage: pypdfium2 extract-text [-h] [--password PASSWORD] [--pages PAGES]
[--strategy {range,bounded}]
input
Extract text
Note that PDFium outputs CRLF (\r\n) style line breaks.
This may be undesirable or confusing in some situations, e.g. when processing the output with an (unaware) parser on the command line.
If this is an issue, run e.g. `dos2unix` on the output, or use the Python API.
positional arguments:
input Input PDF document
options:
-h, --help show this help message and exit
--password PASSWORD A password to unlock the PDF, if encrypted
--pages PAGES Page numbers and ranges to include
--strategy {range,bounded}
PDFium text extraction strategy (range, bounded).
Image Converter
$ pypdfium2 imgtopdf --help
usage: pypdfium2 imgtopdf [-h] --output OUTPUT [--inline] images [images ...]
Convert images to PDF
positional arguments:
images Input images
options:
-h, --help show this help message and exit
--output OUTPUT, -o OUTPUT
Target path for the new PDF
--inline If JPEG, whether to use PDFium's inline loading function.
Pageobjects Info
$ pypdfium2 pageobjects --help
usage: pypdfium2 pageobjects [-h] [--password PASSWORD] [--pages PAGES]
[--n-digits N_DIGITS] [--filter T [T ...]]
[--max-depth MAX_DEPTH]
[--info {pos,imginfo} [{pos,imginfo} ...]]
input
Print info on pageobjects
positional arguments:
input Input PDF document
options:
-h, --help show this help message and exit
--password PASSWORD A password to unlock the PDF, if encrypted
--pages PAGES Page numbers and ranges to include
--n-digits N_DIGITS Number of digits to which coordinates/sizes shall be rounded
--filter T [T ...] Object types to include. Choices: ['?', 'text', 'path', 'image', 'shading', 'form']
--max-depth MAX_DEPTH
Maximum recursion depth to consider when descending into Form XObjects.
--info {pos,imginfo} [{pos,imginfo} ...]
Object details to show.
Document Info
$ pypdfium2 pdfinfo --help
usage: pypdfium2 pdfinfo [-h] [--password PASSWORD] [--pages PAGES]
[--n-digits N_DIGITS]
input
Print info on document and pages
positional arguments:
input Input PDF document
options:
-h, --help show this help message and exit
--password PASSWORD A password to unlock the PDF, if encrypted
--pages PAGES Page numbers and ranges to include
--n-digits N_DIGITS Number of digits to which coordinates/sizes shall be rounded
Renderer
$ pypdfium2 render --help
usage: pypdfium2 render [-h] [--password PASSWORD] [--pages PAGES] --output
OUTPUT [--prefix PREFIX] [--format FORMAT]
[--engine ENGINE_CLS] [--scale SCALE]
[--rotation {0,90,180,270}] [--fill-color C C C C]
[--optimize-mode {lcd,print}] [--crop C C C C]
[--draw-annots | --no-draw-annots]
[--draw-forms | --no-draw-forms]
[--no-antialias {text,image,path} [{text,image,path} ...]]
[--force-halftone]
[--bitmap-maker {native,foreign,foreign_packed,foreign_simple}]
[--grayscale] [--byteorder REV_BYTEORDER]
[--x-channel | --no-x-channel]
[--maybe-alpha | --no-maybe-alpha] [--linear [LINEAR]]
[--processes PROCESSES]
[--parallel-strategy {spawn,forkserver,fork}]
[--parallel-lib {mp,ft}] [--parallel-map PARALLEL_MAP]
[--sample-theme] [--path-fill C C C C]
[--path-stroke C C C C] [--text-fill C C C C]
[--text-stroke C C C C] [--fill-to-stroke]
[--invert-lightness] [--exclude-images]
input
Rasterize pages
positional arguments:
input Input PDF document
options:
-h, --help show this help message and exit
--password PASSWORD A password to unlock the PDF, if encrypted
--pages PAGES Page numbers and ranges to include
--output OUTPUT, -o OUTPUT
Output directory where the serially numbered images shall be placed.
--prefix PREFIX Custom prefix for the images. Defaults to the input filename's stem.
--format FORMAT, -f FORMAT
The image format to use (default: conditional).
--engine ENGINE_CLS The saver engine to use ('pil', 'numpy+pil', 'numpy+cv2')
--scale SCALE Define the resolution of the output images. By default, one PDF point (1/72in) is rendered to 1x1 pixel. This factor scales the number of pixels that represent one point.
--rotation {0,90,180,270}
Rotate pages by 90, 180 or 270 degrees.
--fill-color C C C C Color the bitmap will be filled with before rendering. Shall be given in RGBA format as a sequence of integers ranging from 0 to 255. Defaults to white.
--optimize-mode {lcd,print}
The rendering optimisation mode. None if not given.
--crop C C C C Amount to crop from (left, bottom, right, top).
--draw-annots, --no-draw-annots
Whether annotations may be shown (default: true).
--draw-forms, --no-draw-forms
Whether forms may be shown (default: true).
--no-antialias {text,image,path} [{text,image,path} ...]
Item types that shall not be smoothed.
--force-halftone Always use halftone for image stretching.
Bitmap options:
Bitmap config, including pixel format.
--bitmap-maker {native,foreign,foreign_packed,foreign_simple}
The bitmap maker to use.
--grayscale Whether to render in grayscale mode (no colors).
--byteorder REV_BYTEORDER
Whether to use BGR or RGB byteorder (default: conditional).
--x-channel, --no-x-channel
Whether to prefer BGRx/RGBx over BGR/RGB (default: conditional).
--maybe-alpha, --no-maybe-alpha
Whether to use BGRA if page content has transparency. Note, this makes format selection page-dependent. As this behavior can be confusing, it is not currently the default, but recommended for performance in these cases.
Parallelization:
Options for rendering with multiple processes.
--linear [LINEAR] Render non-parallel if page count is less or equal to the specified value (default: 4). If this flag is given without a value, then render linear regardless of document length.
--processes PROCESSES
The maximum number of parallel rendering processes. Defaults to the number of CPU cores.
--parallel-strategy {spawn,forkserver,fork}
The process start method to use. ('fork' is discouraged due to stability issues.)
--parallel-lib {mp,ft}
The parallelization module to use (mp = multiprocessing, ft = concurrent.futures).
--parallel-map PARALLEL_MAP
The map function to use (backend specific, the default is an iterative map).
Flat color scheme:
Options for using pdfium's color scheme renderer. Note that this may flatten different colors into one, so the usability of this is limited. Alternatively, consider post-processing with lightness inversion (see below).
--sample-theme Use a dark background sample theme as base. Explicit color params override selectively.
--path-fill C C C C
--path-stroke C C C C
--text-fill C C C C
--text-stroke C C C C
--fill-to-stroke When rendering with custom color scheme, only draw borders around fill areas using the `path_stroke` color, instead of filling with the `path_fill` color. This is actually recommended, since with a single fill color for paths the boundaries of adjacent fill paths are less visible.
Post processing:
Options to post-process rendered images. Note, this may have a strongly negative impact on performance.
--invert-lightness Invert lightness using the HLS color space (e.g. white<->black, dark_blue<->light_blue). The intent is to achieve a dark theme for documents with light background, while providing better visual results than classical color inversion or a flat pdfium color scheme. However, note that --optimize-mode lcd is not recommendable when inverting lightness.
--exclude-images Whether to exclude PDF images from lightness inversion.
Page Tiler
$ pypdfium2 tile --help
usage: pypdfium2 tile [-h] [--password PASSWORD] --output OUTPUT --rows ROWS
--cols COLS --width WIDTH --height HEIGHT [--unit UNIT]
input
Tile pages (N-up)
positional arguments:
input Input PDF document
options:
-h, --help show this help message and exit
--password PASSWORD A password to unlock the PDF, if encrypted
--output OUTPUT, -o OUTPUT
Target path for the new document
--rows ROWS, -r ROWS Number of rows (horizontal tiles)
--cols COLS, -c COLS Number of columns (vertical tiles)
--width WIDTH Target width
--height HEIGHT Target height
--unit UNIT, -u UNIT Unit for target width and height (pt, mm, cm, in)
TOC Reader
$ pypdfium2 toc --help
usage: pypdfium2 toc [-h] [--password PASSWORD] [--n-digits N_DIGITS]
[--max-depth MAX_DEPTH]
input
Print table of contents
positional arguments:
input Input PDF document
options:
-h, --help show this help message and exit
--password PASSWORD A password to unlock the PDF, if encrypted
--n-digits N_DIGITS Number of digits to which coordinates/sizes shall be rounded
--max-depth MAX_DEPTH
Maximum recursion depth to consider when parsing the table of contents