Shell API

pypdfium2 can also be used from the command-line.

Version

$ pypdfium2 --version
pypdfium2 5.0.0+4.gc74b8f4
pdfium 143.0.7483.0 at /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/pypdfium2_raw/libpdfium.so

Main Help

$ pypdfium2 --help
usage: pypdfium2 [-h] [--version]
                 {arrange,attachments,extract-images,extract-text,imgtopdf,pageobjects,pdfinfo,render,tile,toc}
                 ...

Command line interface to the pypdfium2 library (Python binding to PDFium)

positional arguments:
  {arrange,attachments,extract-images,extract-text,imgtopdf,pageobjects,pdfinfo,render,tile,toc}
    arrange             Rearrange/merge documents
    attachments         List/extract/edit embedded files
    extract-images      Extract images
    extract-text        Extract text
    imgtopdf            Convert images to PDF
    pageobjects         Print info on pageobjects
    pdfinfo             Print info on document and pages
    render              Rasterize pages
    tile                Tile pages (N-up)
    toc                 Print table of contents

options:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit

Arranger

$ pypdfium2 arrange --help
usage: pypdfium2 arrange [-h] [--pages PAGES [PAGES ...]]
                         [--passwords PASSWORDS [PASSWORDS ...]] --output
                         OUTPUT
                         inputs [inputs ...]

Rearrange/merge documents

positional arguments:
  inputs                Sequence of PDF files.

options:
  -h, --help            show this help message and exit
  --pages PAGES [PAGES ...]
                        Sequence of page texts, definig the pages to include from each PDF. Use '_' as placeholder for all pages.
  --passwords PASSWORDS [PASSWORDS ...]
                        Passwords to unlock encrypted PDFs. Any placeholder may be used for non-encrypted documents.
  --output OUTPUT, -o OUTPUT
                        Target path for the output document

Attachments

$ pypdfium2 attachments --help
usage: pypdfium2 attachments [-h] [--password PASSWORD]
                             input {list,extract,edit} ...

List/extract/edit embedded files

positional arguments:
  input                Input PDF document
  {list,extract,edit}

options:
  -h, --help           show this help message and exit
  --password PASSWORD  A password to unlock the PDF, if encrypted
$ pypdfium2 attachments file.pdf list --help
usage: pypdfium2 attachments input list [-h]

options:
  -h, --help  show this help message and exit
$ pypdfium2 attachments file.pdf extract --help
usage: pypdfium2 attachments input extract [-h] [--numbers NUMBERS]
                                           --output-dir OUTPUT_DIR

options:
  -h, --help            show this help message and exit
  --numbers NUMBERS
  --output-dir OUTPUT_DIR, -o OUTPUT_DIR
$ pypdfium2 attachments file.pdf edit --help
usage: pypdfium2 attachments input edit [-h] [--del-numbers DEL_NUMBERS]
                                        [--add-files F [F ...]] --output
                                        OUTPUT

options:
  -h, --help            show this help message and exit
  --del-numbers DEL_NUMBERS, -d DEL_NUMBERS
  --add-files F [F ...], -a F [F ...]
  --output OUTPUT, -o OUTPUT

Image Extractor

$ pypdfium2 extract-images --help
usage: pypdfium2 extract-images [-h] [--password PASSWORD] [--pages PAGES]
                                --output-dir OUTPUT_DIR
                                [--max-depth MAX_DEPTH] [--use-bitmap]
                                [--format FORMAT] [--render]
                                [--scale-to-original | --no-scale-to-original]
                                input

Extract images

positional arguments:
  input                 Input PDF document

options:
  -h, --help            show this help message and exit
  --password PASSWORD   A password to unlock the PDF, if encrypted
  --pages PAGES         Page numbers and ranges to include
  --output-dir OUTPUT_DIR, -o OUTPUT_DIR
                        Output directory to take the extracted images
  --max-depth MAX_DEPTH
                        Maximum recursion depth to consider when looking for pageobjects.
  --use-bitmap          Enforce the use of bitmaps rather than attempting a smart extraction of the image.
  --format FORMAT       Image format to use when saving bitmaps. (Fallback if doing smart extraction.)
  --render              When --use-bitmap is given, whether to get rendered bitmaps, taking masks and transform matrices into account.
  --scale-to-original, --no-scale-to-original
                        When --use-bitmap --render is given, whether to scale the image so it is rendered at its native resolution, or close to that. This should improve output quality. The default is True, but you may opt out.

Text Extractor

$ pypdfium2 extract-text --help
usage: pypdfium2 extract-text [-h] [--password PASSWORD] [--pages PAGES]
                              [--strategy {range,bounded}]
                              input

Extract text

Note that PDFium outputs CRLF (\r\n) style line breaks.
This may be undesirable or confusing in some situations, e.g. when processing the output with an (unaware) parser on the command line.
If this is an issue, run e.g. `dos2unix` on the output, or use the Python API.

positional arguments:
  input                 Input PDF document

options:
  -h, --help            show this help message and exit
  --password PASSWORD   A password to unlock the PDF, if encrypted
  --pages PAGES         Page numbers and ranges to include
  --strategy {range,bounded}
                        PDFium text extraction strategy (range, bounded).

Image Converter

$ pypdfium2 imgtopdf --help
usage: pypdfium2 imgtopdf [-h] --output OUTPUT [--inline] images [images ...]

Convert images to PDF

positional arguments:
  images                Input images

options:
  -h, --help            show this help message and exit
  --output OUTPUT, -o OUTPUT
                        Target path for the new PDF
  --inline              If JPEG, whether to use PDFium's inline loading function.

Pageobjects Info

$ pypdfium2 pageobjects --help
usage: pypdfium2 pageobjects [-h] [--password PASSWORD] [--pages PAGES]
                             [--n-digits N_DIGITS] [--filter T [T ...]]
                             [--max-depth MAX_DEPTH]
                             [--info {pos,imginfo} [{pos,imginfo} ...]]
                             input

Print info on pageobjects

positional arguments:
  input                 Input PDF document

options:
  -h, --help            show this help message and exit
  --password PASSWORD   A password to unlock the PDF, if encrypted
  --pages PAGES         Page numbers and ranges to include
  --n-digits N_DIGITS   Number of digits to which coordinates/sizes shall be rounded
  --filter T [T ...]    Object types to include. Choices: ['?', 'text', 'path', 'image', 'shading', 'form']
  --max-depth MAX_DEPTH
                        Maximum recursion depth to consider when descending into Form XObjects.
  --info {pos,imginfo} [{pos,imginfo} ...]
                        Object details to show.

Document Info

$ pypdfium2 pdfinfo --help
usage: pypdfium2 pdfinfo [-h] [--password PASSWORD] [--pages PAGES]
                         [--n-digits N_DIGITS]
                         input

Print info on document and pages

positional arguments:
  input                Input PDF document

options:
  -h, --help           show this help message and exit
  --password PASSWORD  A password to unlock the PDF, if encrypted
  --pages PAGES        Page numbers and ranges to include
  --n-digits N_DIGITS  Number of digits to which coordinates/sizes shall be rounded

Renderer

$ pypdfium2 render --help
usage: pypdfium2 render [-h] [--password PASSWORD] [--pages PAGES] --output
                        OUTPUT [--prefix PREFIX] [--format FORMAT]
                        [--engine ENGINE_CLS] [--scale SCALE]
                        [--rotation {0,90,180,270}] [--fill-color C C C C]
                        [--optimize-mode {lcd,print}] [--crop C C C C]
                        [--draw-annots | --no-draw-annots]
                        [--draw-forms | --no-draw-forms]
                        [--no-antialias {text,image,path} [{text,image,path} ...]]
                        [--force-halftone]
                        [--bitmap-maker {native,foreign,foreign_packed,foreign_simple}]
                        [--grayscale] [--byteorder REV_BYTEORDER]
                        [--x-channel | --no-x-channel]
                        [--maybe-alpha | --no-maybe-alpha] [--linear [LINEAR]]
                        [--processes PROCESSES]
                        [--parallel-strategy {spawn,forkserver,fork}]
                        [--parallel-lib {mp,ft}] [--parallel-map PARALLEL_MAP]
                        [--sample-theme] [--path-fill C C C C]
                        [--path-stroke C C C C] [--text-fill C C C C]
                        [--text-stroke C C C C] [--fill-to-stroke]
                        [--invert-lightness] [--exclude-images]
                        input

Rasterize pages

positional arguments:
  input                 Input PDF document

options:
  -h, --help            show this help message and exit
  --password PASSWORD   A password to unlock the PDF, if encrypted
  --pages PAGES         Page numbers and ranges to include
  --output OUTPUT, -o OUTPUT
                        Output directory where the serially numbered images shall be placed.
  --prefix PREFIX       Custom prefix for the images. Defaults to the input filename's stem.
  --format FORMAT, -f FORMAT
                        The image format to use (default: conditional).
  --engine ENGINE_CLS   The saver engine to use ('pil', 'numpy+pil', 'numpy+cv2')
  --scale SCALE         Define the resolution of the output images. By default, one PDF point (1/72in) is rendered to 1x1 pixel. This factor scales the number of pixels that represent one point.
  --rotation {0,90,180,270}
                        Rotate pages by 90, 180 or 270 degrees.
  --fill-color C C C C  Color the bitmap will be filled with before rendering. Shall be given in RGBA format as a sequence of integers ranging from 0 to 255. Defaults to white.
  --optimize-mode {lcd,print}
                        The rendering optimisation mode. None if not given.
  --crop C C C C        Amount to crop from (left, bottom, right, top).
  --draw-annots, --no-draw-annots
                        Whether annotations may be shown (default: true).
  --draw-forms, --no-draw-forms
                        Whether forms may be shown (default: true).
  --no-antialias {text,image,path} [{text,image,path} ...]
                        Item types that shall not be smoothed.
  --force-halftone      Always use halftone for image stretching.

Bitmap options:
  Bitmap config, including pixel format.

  --bitmap-maker {native,foreign,foreign_packed,foreign_simple}
                        The bitmap maker to use.
  --grayscale           Whether to render in grayscale mode (no colors).
  --byteorder REV_BYTEORDER
                        Whether to use BGR or RGB byteorder (default: conditional).
  --x-channel, --no-x-channel
                        Whether to prefer BGRx/RGBx over BGR/RGB (default: conditional).
  --maybe-alpha, --no-maybe-alpha
                        Whether to use BGRA if page content has transparency. Note, this makes format selection page-dependent. As this behavior can be confusing, it is not currently the default, but recommended for performance in these cases.

Parallelization:
  Options for rendering with multiple processes.

  --linear [LINEAR]     Render non-parallel if page count is less or equal to the specified value (default: 4). If this flag is given without a value, then render linear regardless of document length.
  --processes PROCESSES
                        The maximum number of parallel rendering processes. Defaults to the number of CPU cores.
  --parallel-strategy {spawn,forkserver,fork}
                        The process start method to use. ('fork' is discouraged due to stability issues.)
  --parallel-lib {mp,ft}
                        The parallelization module to use (mp = multiprocessing, ft = concurrent.futures).
  --parallel-map PARALLEL_MAP
                        The map function to use (backend specific, the default is an iterative map).

Flat color scheme:
  Options for using pdfium's color scheme renderer. Note that this may flatten different colors into one, so the usability of this is limited. Alternatively, consider post-processing with lightness inversion (see below).

  --sample-theme        Use a dark background sample theme as base. Explicit color params override selectively.
  --path-fill C C C C
  --path-stroke C C C C
  --text-fill C C C C
  --text-stroke C C C C
  --fill-to-stroke      When rendering with custom color scheme, only draw borders around fill areas using the `path_stroke` color, instead of filling with the `path_fill` color. This is actually recommended, since with a single fill color for paths the boundaries of adjacent fill paths are less visible.

Post processing:
  Options to post-process rendered images. Note, this may have a strongly negative impact on performance.

  --invert-lightness    Invert lightness using the HLS color space (e.g. white<->black, dark_blue<->light_blue). The intent is to achieve a dark theme for documents with light background, while providing better visual results than classical color inversion or a flat pdfium color scheme. However, note that --optimize-mode lcd is not recommendable when inverting lightness.
  --exclude-images      Whether to exclude PDF images from lightness inversion.

Page Tiler

$ pypdfium2 tile --help
usage: pypdfium2 tile [-h] [--password PASSWORD] --output OUTPUT --rows ROWS
                      --cols COLS --width WIDTH --height HEIGHT [--unit UNIT]
                      input

Tile pages (N-up)

positional arguments:
  input                 Input PDF document

options:
  -h, --help            show this help message and exit
  --password PASSWORD   A password to unlock the PDF, if encrypted
  --output OUTPUT, -o OUTPUT
                        Target path for the new document
  --rows ROWS, -r ROWS  Number of rows (horizontal tiles)
  --cols COLS, -c COLS  Number of columns (vertical tiles)
  --width WIDTH         Target width
  --height HEIGHT       Target height
  --unit UNIT, -u UNIT  Unit for target width and height (pt, mm, cm, in)

TOC Reader

$ pypdfium2 toc --help
usage: pypdfium2 toc [-h] [--password PASSWORD] [--n-digits N_DIGITS]
                     [--max-depth MAX_DEPTH]
                     input

Print table of contents

positional arguments:
  input                 Input PDF document

options:
  -h, --help            show this help message and exit
  --password PASSWORD   A password to unlock the PDF, if encrypted
  --n-digits N_DIGITS   Number of digits to which coordinates/sizes shall be rounded
  --max-depth MAX_DEPTH
                        Maximum recursion depth to consider when parsing the table of contents