pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

8,362

775

8,362

View on GitHub

Top Related Projects

camelot

3,457

A Python library to extract tabular data from PDFs

pdfminer

5,299

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PyMuPDF

7,705

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

pdfminer.six

6,743

Community maintained fork of pdfminer - we fathom PDF

pypdf

9,443

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

tabula

7,198

Tabula is a tool for liberating data tables trapped inside PDF files

Quick Overview

PDFPlumber is a Python library for extracting information from PDF files. It goes beyond simple text extraction by providing tools to analyze the structure, layout, and content of PDFs, including tables, images, and form fields.

Pros

Powerful table extraction capabilities
Ability to extract text with precise positioning information
Support for extracting images and form fields
Detailed documentation and examples

Cons

Can be slower than some other PDF libraries for large documents
May struggle with complex layouts or poorly formatted PDFs
Requires additional dependencies for image extraction
Limited support for encrypted PDFs

Code Examples

Extracting all text from a PDF:

import pdfplumber

with pdfplumber.open('example.pdf') as pdf:
    text = ''
    for page in pdf.pages:
        text += page.extract_text()
    print(text)

Extracting tables from a specific page:

import pdfplumber

with pdfplumber.open('example.pdf') as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            print(row)

Extracting images from a PDF:

import pdfplumber

with pdfplumber.open('example.pdf') as pdf:
    for page in pdf.pages:
        for image in page.images:
            with open(f'image_{image["name"]}.png', 'wb') as f:
                f.write(image['stream'].get_data())

Getting Started

Install PDFPlumber:
```
pip install pdfplumber
```

Basic usage:

import pdfplumber

with pdfplumber.open('example.pdf') as pdf:
    first_page = pdf.pages[0]
    print(first_page.extract_text())

For more advanced usage, refer to the official documentation.

Competitor Comparisons

camelot

3,457

A Python library to extract tabular data from PDFs

Pros of Camelot

Better at handling complex table structures, especially in scanned PDFs
Offers both stream-based and lattice-based extraction methods
Provides a command-line interface for quick extractions

Cons of Camelot

Slower performance compared to PDFPlumber
More complex setup and dependencies
Less flexible for non-tabular data extraction

Code Comparison

PDFPlumber:

import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()

Camelot:

import camelot

tables = camelot.read_pdf("example.pdf")
table = tables[0].df

Both libraries offer straightforward ways to extract tables from PDFs, but Camelot provides more options for fine-tuning the extraction process. PDFPlumber's approach is simpler and more intuitive for basic use cases, while Camelot offers more advanced features for complex table structures.

PDFPlumber is generally faster and easier to set up, making it a good choice for simpler PDFs or when processing speed is a priority. Camelot excels in handling more complex table layouts and scanned documents, but at the cost of increased complexity and slower performance.

pdfminer

5,299

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Pros of pdfminer

More established and mature project with a longer history
Offers lower-level control and flexibility for advanced users
Supports a wider range of PDF features and structures

Cons of pdfminer

Less user-friendly and requires more setup and configuration
Documentation can be sparse and outdated in some areas
Performance may be slower for certain operations compared to pdfplumber

Code Comparison

pdfminer:

from pdfminer.high_level import extract_text

text = extract_text('document.pdf')
print(text)

pdfplumber:

import pdfplumber

with pdfplumber.open('document.pdf') as pdf:
    text = pdf.pages[0].extract_text()
    print(text)

Both libraries offer text extraction capabilities, but pdfplumber provides a more straightforward API for common tasks. pdfminer requires additional setup for more complex operations, while pdfplumber abstracts away some of the complexity.

pdfplumber is generally easier to use for beginners and offers convenient methods for extracting tables and working with page layouts. pdfminer, on the other hand, provides more granular control over the PDF parsing process, making it suitable for advanced use cases and custom implementations.

PyMuPDF

7,705

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Pros of PyMuPDF

Faster performance, especially for large PDFs
More comprehensive PDF manipulation capabilities (editing, merging, etc.)
Better support for complex PDF structures and annotations

Cons of PyMuPDF

Steeper learning curve due to more complex API
Less focused on text extraction, which may require more code for simple tasks
Larger library size and more dependencies

Code Comparison

PyMuPDF text extraction:

import fitz
doc = fitz.open("example.pdf")
text = ""
for page in doc:
    text += page.get_text()

pdfplumber text extraction:

import pdfplumber
with pdfplumber.open("example.pdf") as pdf:
    text = ""
    for page in pdf.pages:
        text += page.extract_text()

Both libraries offer efficient ways to extract text from PDFs, but PyMuPDF generally provides faster performance for larger documents. pdfplumber, on the other hand, offers a more straightforward API for simple text extraction tasks. PyMuPDF excels in comprehensive PDF manipulation, while pdfplumber focuses on data extraction and analysis. The choice between the two depends on the specific requirements of your project, balancing between performance, ease of use, and additional PDF processing needs.

pdfminer.six

6,743

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

More comprehensive and lower-level PDF parsing capabilities
Better support for complex PDF structures and encodings
Wider range of output formats (e.g., HTML, XML, Tagged PDF)

Cons of pdfminer.six

Steeper learning curve and more complex API
Slower performance for simple text extraction tasks
Less user-friendly documentation and examples

Code Comparison

pdfminer.six:

from pdfminer.high_level import extract_text_to_fp
from io import StringIO

output_string = StringIO()
with open('example.pdf', 'rb') as fin:
    extract_text_to_fp(fin, output_string)
text = output_string.getvalue().strip()

pdfplumber:

import pdfplumber

with pdfplumber.open('example.pdf') as pdf:
    text = ''
    for page in pdf.pages:
        text += page.extract_text()

pdfplumber offers a more straightforward API for basic text extraction, making it easier to use for simple tasks. However, pdfminer.six provides more control and flexibility for complex PDF parsing scenarios, albeit with a more verbose syntax.

pypdf

9,443

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Pros of pypdf

Lightweight and focused on basic PDF operations like merging, splitting, and extracting text
Faster processing for simple tasks due to its streamlined design
More mature project with a larger user base and longer development history

Cons of pypdf

Limited capabilities for complex text extraction and layout analysis
Less accurate in handling complex PDF structures or heavily formatted documents
Fewer advanced features for data extraction and analysis

Code Comparison

pypdf:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()
print(text)

pdfplumber:

import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)

Both libraries offer similar basic functionality for text extraction, but pdfplumber provides more advanced features for complex documents and layout analysis. pypdf is better suited for simple PDF operations, while pdfplumber excels in detailed text and data extraction from PDFs with complex structures or formatting.

tabula

7,198

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of Tabula

User-friendly GUI for non-programmers
Supports multiple output formats (CSV, TSV, JSON)
Can handle complex table structures effectively

Cons of Tabula

Limited to extracting tabular data only
Requires Java runtime environment
Less flexible for programmatic integration

Code Comparison

PDFPlumber:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()

Tabula (using tabula-py wrapper):

import tabula

tables = tabula.read_pdf("document.pdf", pages="all")

PDFPlumber offers more granular control over PDF parsing and extraction, while Tabula focuses specifically on table extraction with a simpler API. PDFPlumber allows for more advanced text and layout analysis, making it suitable for a wider range of PDF processing tasks beyond just table extraction.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

pdfplumber

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer.six.

Currently tested on Python 3.8, 3.9, 3.10, 3.11.

Translations of this document are available in: Chinese (by @hbh112233abc).

To report a bug or request a feature, please file an issue. To ask a question or request assistance with a specific PDF, please use the discussions forum.

Installation
Command line interface
Python library
Visual debugging
Extracting text
Extracting tables
Extracting form values
Demonstrations
Comparison to other libraries
Acknowledgments / Contributors
Contributing

Installation

pip install pdfplumber

Command line interface

Basic example

curl "https://raw.githubusercontent.com/jsvine/pdfplumber/stable/examples/pdfs/background-checks.pdf" > background-checks.pdf
pdfplumber background-checks.pdf > background-checks.csv

The output will be a CSV containing info about every character, line, and rectangle in the PDF.

Options

Argument	Description
`--format [format]`	`csv`, `json`, or `text`. The `csv` and `json` formats return information about each object. Of those two, the `json` format returns more information; it includes PDF-level and page-level metadata, plus dictionary-nested attributes. The `text` option returns a plain-text representation of the PDF, using `Page.extract_text(layout=True)`.
`--pages [list of pages]`	A space-delimited, `1`-indexed list of pages or hyphenated page ranges. E.g., `1, 11-15`, which would return data for pages 1, 11, 12, 13, 14, and 15.
`--types [list of object types to extract]`	Choices are `char`, `rect`, `line`, `curve`, `image`, `annot`, et cetera. Defaults to all available.
`--laparams`	A JSON-formatted string (e.g., `'{"detect_vertical": true}'`) to pass to `pdfplumber.open(..., laparams=...)`.
`--precision [integer]`	The number of decimal places to round floating-point numbers. Defaults to no rounding.

Python library

Basic example

import pdfplumber

with pdfplumber.open("path/to/file.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.chars[0])

Loading a PDF

To start working with a PDF, call pdfplumber.open(x), where x can be a:

path to your PDF file
file object, loaded as bytes
file-like object, loaded as bytes

The open method returns an instance of the pdfplumber.PDF class.

To load a password-protected PDF, pass the password keyword argument, e.g., pdfplumber.open("file.pdf", password = "test").

To set layout analysis parameters to pdfminer.six's layout engine, pass the laparams keyword argument, e.g., pdfplumber.open("file.pdf", laparams = { "line_overlap": 0.7 }).

To pre-normalize Unicode text, pass unicode_norm=..., where ... is one of the four Unicode normalization forms: "NFC", "NFD", "NFKC", or "NFKD".

Invalid metadata values are treated as a warning by default. If that is not intended, pass strict_metadata=True to the open method and pdfplumber.open will raise an exception if it is unable to parse the metadata.

The `pdfplumber.PDF` class

The top-level pdfplumber.PDF class represents a single PDF and has two main properties:

Property	Description
`.metadata`	A dictionary of metadata key/value pairs, drawn from the PDF's `Info` trailers. Typically includes "CreationDate," "ModDate," "Producer," et cetera.
`.pages`	A list containing one `pdfplumber.Page` instance per page loaded.

... and also has the following method:

Method	Description
`.close()`	Calling this method calls `Page.close()` on each page, and also closes the file stream (except in cases when the stream is external, i.e., already opened and passed directly to `pdfplumber`).

The `pdfplumber.Page` class

The pdfplumber.Page class is at the core of pdfplumber. Most things you'll do with pdfplumber will revolve around this class. It has these main properties:

Property	Description
`.page_number`	The sequential page number, starting with `1` for the first page, `2` for the second, and so on.
`.width`	The page's width.
`.height`	The page's height.
`.objects` / `.chars` / `.lines` / `.rects` / `.curves` / `.images`	Each of these properties is a list, and each list contains one dictionary for each such object embedded on the page. For more detail, see "Objects" below.

... and these main methods:

Method	Description
`.crop(bounding_box, relative=False, strict=True)`	Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values `(x0, top, x1, bottom)`. Cropped pages retain objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box. If `relative=True`, the bounding box is calculated as an offset from the top-left of the page's bounding box, rather than an absolute positioning. (See Issue #245 for a visual example and explanation.) When `strict=True` (the default), the crop's bounding box must fall entirely within the page's bounding box.
`.within_bbox(bounding_box, relative=False, strict=True)`	Similar to `.crop`, but only retains objects that fall entirely within the bounding box.
`.outside_bbox(bounding_box, relative=False, strict=True)`	Similar to `.crop` and `.within_bbox`, but only retains objects that fall entirely outside the bounding box.
`.filter(test_function)`	Returns a version of the page with only the `.objects` for which `test_function(obj)` returns `True`.

... and also has the following method:

Method	Description
`.close()`	By default, `Page` objects cache their layout and object information to avoid having to reprocess it. When parsing large PDFs, however, these cached properties can require a lot of memory. You can use this method to flush the cache and release the memory.

Additional methods are described in the sections below:

Visual debugging
Extracting text
Extracting tables

Objects

Each instance of pdfplumber.PDF and pdfplumber.Page provides access to several types of PDF objects, all derived from pdfminer.six PDF parsing. The following properties each return a Python list of the matching objects:

.chars, each representing a single text character.
.lines, each representing a single 1-dimensional line.
.rects, each representing a single 2-dimensional rectangle.
.curves, each representing any series of connected points that pdfminer.six does not recognize as a line or rectangle.
.images, each representing an image.
.annots, each representing a single PDF annotation (cf. Section 8.4 of the official PDF specification for details)
.hyperlinks, each representing a single PDF annotation of the subtype Link and having an URI action attribute

Each object is represented as a simple Python dict, with the following properties:

`char` properties

Property	Description
`page_number`	Page number on which this character was found.
`text`	E.g., "z", or "Z" or " ".
`fontname`	Name of the character's font face.
`size`	Font size.
`adv`	Equal to text width * the font size * scaling factor.
`upright`	Whether the character is upright.
`height`	Height of the character.
`width`	Width of the character.
`x0`	Distance of left side of character from left side of page.
`x1`	Distance of right side of character from left side of page.
`y0`	Distance of bottom of character from bottom of page.
`y1`	Distance of top of character from bottom of page.
`top`	Distance of top of character from top of page.
`bottom`	Distance of bottom of the character from top of page.
`doctop`	Distance of top of character from top of document.
`matrix`	The "current transformation matrix" for this character. (See below for details.)
`mcid`	The marked content section ID for this character if any (otherwise `None`). Experimental attribute.
`tag`	The marked content section tag for this character if any (otherwise `None`). Experimental attribute.
`ncs`	TKTK
`stroking_pattern`	TKTK
`non_stroking_pattern`	TKTK
`stroking_color`	The color of the character's outline (i.e., stroke). See docs/colors.md for details.
`non_stroking_color`	The character's interior color. See docs/colors.md for details.
`object_type`	"char"

Note: A characterâs matrix property represents the âcurrent transformation matrix,â as described in Section 4.2.2 of the PDF Reference (6th Ed.). The matrix controls the characterâs scale, skew, and positional translation. Rotation is a combination of scale and skew, but in most cases can be considered equal to the x-axis skew. The pdfplumber.ctm submodule defines a class, CTM, that assists with these calculations. For instance:

from pdfplumber.ctm import CTM
my_char = pdf.pages[0].chars[3]
my_char_ctm = CTM(*my_char["matrix"])
my_char_rotation = my_char_ctm.skew_x

`line` properties

Property	Description
`page_number`	Page number on which this line was found.
`height`	Height of line.
`width`	Width of line.
`x0`	Distance of left-side extremity from left side of page.
`x1`	Distance of right-side extremity from left side of page.
`y0`	Distance of bottom extremity from bottom of page.
`y1`	Distance of top extremity bottom of page.
`top`	Distance of top of line from top of page.
`bottom`	Distance of bottom of the line from top of page.
`doctop`	Distance of top of line from top of document.
`linewidth`	Thickness of line.
`stroking_color`	The color of the line. See docs/colors.md for details.
`non_stroking_color`	The non-stroking color specified for the lineâs path. See docs/colors.md for details.
`mcid`	The marked content section ID for this line if any (otherwise `None`). Experimental attribute.
`tag`	The marked content section tag for this line if any (otherwise `None`). Experimental attribute.
`object_type`	"line"

`rect` properties

Property	Description
`page_number`	Page number on which this rectangle was found.
`height`	Height of rectangle.
`width`	Width of rectangle.
`x0`	Distance of left side of rectangle from left side of page.
`x1`	Distance of right side of rectangle from left side of page.
`y0`	Distance of bottom of rectangle from bottom of page.
`y1`	Distance of top of rectangle from bottom of page.
`top`	Distance of top of rectangle from top of page.
`bottom`	Distance of bottom of the rectangle from top of page.
`doctop`	Distance of top of rectangle from top of document.
`linewidth`	Thickness of line.
`stroking_color`	The color of the rectangle's outline. See docs/colors.md for details.
`non_stroking_color`	The rectangleâs fill color. See docs/colors.md for details.
`mcid`	The marked content section ID for this rect if any (otherwise `None`). Experimental attribute.
`tag`	The marked content section tag for this rect if any (otherwise `None`). Experimental attribute.
`object_type`	"rect"

`curve` properties

Property	Description
`page_number`	Page number on which this curve was found.
`pts`	A list of `(x, top)` tuples indicating the points on the curve.
`path`	A list of `(cmd, (x, top))` tuples describing the full path description*, including (for example) control points used in Bezier curves.
`height`	Height of curve's bounding box.
`width`	Width of curve's bounding box.
`x0`	Distance of curve's left-most point from left side of page.
`x1`	Distance of curve's right-most point from left side of the page.
`y0`	Distance of curve's lowest point from bottom of page.
`y1`	Distance of curve's highest point from bottom of page.
`top`	Distance of curve's highest point from top of page.
`bottom`	Distance of curve's lowest point from top of page.
`doctop`	Distance of curve's highest point from top of document.
`linewidth`	Thickness of line.
`fill`	Whether the shape defined by the curve's path is filled.
`stroking_color`	The color of the curve's outline. See docs/colors.md for details.
`non_stroking_color`	The curveâs fill color. See docs/colors.md for details.
`dash`	A `([dash_array], dash_phase)` tuple describing the curve's dash style. See Table 4.6 of the PDF specification for details.
`mcid`	The marked content section ID for this curve if any (otherwise `None`). Experimental attribute.
`tag`	The marked content section tag for this curve if any (otherwise `None`). Experimental attribute.
`object_type`	"curve"

Derived properties

Additionally, both pdfplumber.PDF and pdfplumber.Page provide access to several derived lists of objects: .rect_edges (which decomposes each rectangle into its four lines), .curve_edges (which does the same for curve objects), and .edges (which combines .rect_edges, .curve_edges, and .lines).

`image` properties

Note: Although the positioning and characteristics of image objects are available via pdfplumber, this library does not provide direct support for reconstructing image content. For that, please see this suggestion.

Property	Description
`page_number`	Page number on which the image was found.
`height`	Height of the image.
`width`	Width of the image.
`x0`	Distance of left side of the image from left side of page.
`x1`	Distance of right side of the image from left side of page.
`y0`	Distance of bottom of the image from bottom of page.
`y1`	Distance of top of the image from bottom of page.
`top`	Distance of top of the image from top of page.
`bottom`	Distance of bottom of the image from top of page.
`doctop`	Distance of top of rectangle from top of document.
`srcsize`	The image original dimensions, as a `(width, height)` tuple.
`colorspace`	Color domain of the image (e.g., RGB).
`bits`	The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).
`stream`	Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.
`imagemask`	A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."
`name`	"The name by which this image XObject is referenced in the XObject subdictionary of the current resource dictionary." ð
`mcid`	The marked content section ID for this image if any (otherwise `None`). Experimental attribute.
`tag`	The marked content section tag for this image if any (otherwise `None`). Experimental attribute.
`object_type`	"image"

Obtaining higher-level layout objects via `pdfminer.six`

If you pass the pdfminer.six-handling laparams parameter to pdfplumber.open(...), then each page's .objects dictionary will also contain pdfminer.six's higher-level layout objects, such as "textboxhorizontal".

Visual debugging

pdfplumber's visual debugging tools can be helpful in understanding the structure of a PDF and the objects that have been extracted from it.

Creating a `PageImage` with `.to_image()`

To turn any page (including cropped pages) into an PageImage object, call my_page.to_image(). You can optionally pass one of the following keyword arguments:

resolution: The desired number pixels per inch. Default: 72. Type: int.
width: The desired image width in pixels. Default: unset, determined by resolution. Type: int.
height: The desired image width in pixels. Default: unset, determined by resolution. Type: int.
antialias: Whether to use antialiasing when creating the image. Setting to True creates images with less-jagged text and graphics, but with larger file sizes. Default: False. Type: bool.
force_mediabox: Use the page's .mediabox dimensions, rather than the .cropbox dimensions. Default: False. Type: bool.

For instance:

im = my_pdf.pages[0].to_image(resolution=150)

From a script or REPL, im.show() will open the image in your local image viewer. But PageImage objects also play nicely with Jupyter notebooks; they automatically render as cell outputs. For example:

Visual debugging in Jupyter

Note: .to_image(...) works as expected with Page.crop(...)/CroppedPage instances, but is unable to incorporate changes made via Page.filter(...)/FilteredPage instances.

Basic `PageImage` methods

Method	Description
`im.reset()`	Clears anything you've drawn so far.
`im.copy()`	Copies the image to a new `PageImage` object.
`im.show()`	Opens the image in your local image viewer.
`im.save(path_or_fileobject, format="PNG", quantize=True, colors=256, bits=8)`	Saves the annotated image as a PNG file. The default arguments quantize the image to a palette of 256 colors, saving the PNG with 8-bit color depth. You can disable quantization by passing `quantize=False` or adjust the size of the color palette by passing `colors=N`.

Drawing methods

You can pass explicit coordinates or any pdfplumber PDF object (e.g., char, line, rect) to these methods.

Single-object method	Bulk method	Description
`im.draw_line(line, stroke={color}, stroke_width=1)`	`im.draw_lines(list_of_lines, **kwargs)`	Draws a line from a `line`, `curve`, or a 2-tuple of 2-tuples (e.g., `((x, y), (x, y))`).
`im.draw_vline(location, stroke={color}, stroke_width=1)`	`im.draw_vlines(list_of_locations, **kwargs)`	Draws a vertical line at the x-coordinate indicated by `location`.
`im.draw_hline(location, stroke={color}, stroke_width=1)`	`im.draw_hlines(list_of_locations, **kwargs)`	Draws a horizontal line at the y-coordinate indicated by `location`.
`im.draw_rect(bbox_or_obj, fill={color}, stroke={color}, stroke_width=1)`	`im.draw_rects(list_of_rects, **kwargs)`	Draws a rectangle from a `rect`, `char`, etc., or 4-tuple bounding box.
`im.draw_circle(center_or_obj, radius=5, fill={color}, stroke={color})`	`im.draw_circles(list_of_circles, **kwargs)`	Draws a circle at `(x, y)` coordinate or at the center of a `char`, `rect`, etc.

Note: The methods above are built on Pillow's ImageDraw methods, but the parameters have been tweaked for consistency with SVG's fill/stroke/stroke_width nomenclature.

Visually debugging the table-finder

im.debug_tablefinder(table_settings={}) will return a version of the PageImage with the detected lines (in red), intersections (circles), and tables (light blue) overlaid.

Extracting text

pdfplumber can extract text from any given page (including cropped and derived pages). It can also attempt to preserve the layout of that text, as well as to identify the coordinates of words and search queries. Page objects can call the following text-extraction methods:

Method	Description
`.extract_text(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, layout=False, x_density=7.25, y_density=13, line_dir_render=None, char_dir_render=None, **kwargs)`	Collates all of the page's character objects into a single string. When `layout=False`: Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`. When `layout=True` (experimental feature): Attempts to mimic the structural layout of the text on the page(s), using `x_density` and `y_density` to determine the minimum number of characters/newlines per "point," the PDF unit of measurement. Passing `line_dir_render="ttb"/"btt"/"ltr"/"rtl"` and/or `char_dir_render="ttb"/"btt"/"ltr"/"rtl"` will output the the lines/characters in a different direction than the default. All remaining `**kwargs` are passed to `.extract_words(...)` (see below), the first step in calculating the layout.
`.extract_text_simple(x_tolerance=3, y_tolerance=3)`	A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.
`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir="ttb", char_dir="ltr", line_dir_rotated="ttb", char_dir_rotated="ltr", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`	Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` and where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are "ttb" (top-to-bottom), "btt" (bottom-to-top), "ltr" (left-to-right), and "rtl" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those attributes, and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., split_at_punctuation='!"&'()*+,.:;<=>?@[]^`{\|}~'. Unless you set `expand_ligatures=False`, ligatures such as `ï¬` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `"chars"` field.
`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`	Experimental feature that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.
`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs)`	Experimental feature that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `kwargs` are those you would pass to `.extract_text(layout=True, ...)`. Note**: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page.
`.dedupe_chars(tolerance=1, extra_attrs=("fontname", "size"))`	Returns a version of the page with duplicate chars âÂ those sharing the same text, positioning (within `tolerance` x/y), and `extra_attrs` as other characters âÂ removed. (See Issue #71 to understand the motivation.)

Extracting tables

pdfplumber's approach to table detection borrows heavily from Anssi Nurminen's master's thesis, and is inspired by Tabula. It works like this:

For any given PDF page, find the lines that are (a) explicitly defined and/or (b) implied by the alignment of words on the page.
Merge overlapping, or nearly-overlapping, lines.
Find the intersections of all those lines.
Find the most granular set of rectangles (i.e., cells) that use these intersections as their vertices.
Group contiguous cells into tables.

Table-extraction methods

pdfplumber.Page objects can call the following table methods:

Method	Description
`.find_tables(table_settings={})`	Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.
`.find_table(table_settings={})`	Similar to `.find_tables(...)`, but returns the largest table on the page, as a `Table` object. If multiple tables have the same size âÂ as measured by the number of cells âÂ this method returns the table closest to the top of the page.
`.extract_tables(table_settings={})`	Returns the text extracted from all tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.
`.extract_table(table_settings={})`	Returns the text extracted from the largest table on the page (see `.find_table(...)` above), represented as a list of lists, with the structure `row -> cell`.
`.debug_tablefinder(table_settings={})`	Returns an instance of the `TableFinder` class, with access to the `.edges`, `.intersections`, `.cells`, and `.tables` properties.

For example:

pdf = pdfplumber.open("path/to/my.pdf")
page = pdf.pages[0]
page.extract_table()

Click here for a more detailed example.

Table-extraction settings

By default, extract_tables uses the page's vertical and horizontal lines (or rectangle edges) as cell-separators. But the method is highly customizable via the table_settings argument. The possible settings, and their defaults:

{
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "snap_x_tolerance": 3,
    "snap_y_tolerance": 3,
    "join_tolerance": 3,
    "join_x_tolerance": 3,
    "join_y_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": 3,
    "intersection_y_tolerance": 3,
    "text_tolerance": 3,
    "text_x_tolerance": 3,
    "text_y_tolerance": 3,
    "text_*": â¦, # See below
}

Setting	Description
`"vertical_strategy"`	Either `"lines"`, `"lines_strict"`, `"text"`, or `"explicit"`. See explanation below.
`"horizontal_strategy"`	Either `"lines"`, `"lines_strict"`, `"text"`, or `"explicit"`. See explanation below.
`"explicit_vertical_lines"`	A list of vertical lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers âÂ indicating the `x` coordinate of a line the full height of the page âÂ or `line`/`rect`/`curve` objects.
`"explicit_horizontal_lines"`	A list of horizontal lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers âÂ indicating the `y` coordinate of a line the full height of the page âÂ or `line`/`rect`/`curve` objects.
`"snap_tolerance"`, `"snap_x_tolerance"`, `"snap_y_tolerance"`	Parallel lines within `snap_tolerance` points will be "snapped" to the same horizontal or vertical position.
`"join_tolerance"`, `"join_x_tolerance"`, `"join_y_tolerance"`	Line segments on the same infinite line, and whose ends are within `join_tolerance` of one another, will be "joined" into a single line segment.
`"edge_min_length"`	Edges shorter than `edge_min_length` will be discarded before attempting to reconstruct the table.
`"min_words_vertical"`	When using `"vertical_strategy": "text"`, at least `min_words_vertical` words must share the same alignment.
`"min_words_horizontal"`	When using `"horizontal_strategy": "text"`, at least `min_words_horizontal` words must share the same alignment.
`"intersection_tolerance"`, `"intersection_x_tolerance"`, `"intersection_y_tolerance"`	When combining edges into cells, orthogonal edges must be within `intersection_tolerance` points to be considered intersecting.
`"text_*"`	All settings prefixed with `text_` are then used when extracting text from each discovered table. All possible arguments to `Page.extract_text(...)` are also valid here.
`"text_x_tolerance"`, `"text_y_tolerance"`	These `text_`-prefixed settings also apply to the table-identification algorithm when the `text` strategy is used. I.e., when that algorithm searches for words, it will expect the individual letters in each word to be no more than `text_x_tolerance`/`text_y_tolerance` points apart.

Table-extraction strategies

Both vertical_strategy and horizontal_strategy accept the following options:

Strategy	Description
`"lines"`	Use the page's graphical lines âÂ including the sides of rectangle objects âÂ as the borders of potential table-cells.
`"lines_strict"`	Use the page's graphical lines âÂ but not the sides of rectangle objects âÂ as the borders of potential table-cells.
`"text"`	For `vertical_strategy`: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For `horizontal_strategy`, the same but using the tops of words.
`"explicit"`	Only use the lines explicitly defined in `explicit_vertical_lines` / `explicit_horizontal_lines`.

Notes

Often it's helpful to crop a page âÂ Page.crop(bounding_box) âÂ before trying to extract the table.
Table extraction for pdfplumber was radically redesigned for v0.5.0, and introduced breaking changes.

Extracting form values

Sometimes PDF files can contain forms that include inputs that people can fill out and save. While values in form fields appear like other text in a PDF file, form data is handled differently. If you want the gory details, see page 671 of this specification.

pdfplumber doesn't have an interface for working with form data, but you can access it using pdfplumber's wrappers around pdfminer.

For example, this snippet will retrieve form field names and values and store them in a dictionary.

import pdfplumber
from pdfplumber.utils.pdfinternals import resolve_and_decode, resolve

pdf = pdfplumber.open("document_with_form.pdf")

def parse_field_helper(form_data, field, prefix=None):
    """ appends any PDF AcroForm field/value pairs in `field` to provided `form_data` list

        if `field` has child fields, those will be parsed recursively.
    """
    resolved_field = field.resolve()
    field_name = '.'.join(filter(lambda x: x, [prefix, resolve_and_decode(resolved_field.get("T"))]))
    if "Kids" in resolved_field:
        for kid_field in resolved_field["Kids"]:
            parse_field_helper(form_data, kid_field, prefix=field_name)
    if "T" in resolved_field or "TU" in resolved_field:
        # "T" is a field-name, but it's sometimes absent.
        # "TU" is the "alternate field name" and is often more human-readable
        # your PDF may have one, the other, or both.
        alternate_field_name  = resolve_and_decode(resolved_field.get("TU")) if resolved_field.get("TU") else None
        field_value = resolve_and_decode(resolved_field["V"]) if 'V' in resolved_field else None
        form_data.append([field_name, alternate_field_name, field_value])


form_data = []
fields = resolve(resolve(pdf.doc.catalog["AcroForm"])["Fields"])
for field in fields:
    parse_field_helper(form_data, field)

Once you run this script, form_data is a list containing a three-element tuple for each form element. For instance, a PDF form with a city and state field might look like this.

[
 ['STATE.0', 'enter STATE', 'CA'],
 ['section 2  accident infoRmation.1.0',
  'enter city of accident',
  'SAN FRANCISCO']
]

Thanks to @jeremybmerrill for helping to maintain the form-parsing code above.

Demonstrations

Using extract_table on a California Worker Adjustment and Retraining Notification (WARN) report. Demonstrates basic visual debugging and table extraction.
Using extract_table on the FBI's National Instant Criminal Background Check System PDFs. Demonstrates how to use visual debugging to find optimal table extraction settings. Also demonstrates Page.crop(...) and Page.extract_text(...).
Inspecting and visualizing curve objects.
Extracting fixed-width data from a San Jose PD firearm search report, an example of using Page.extract_text(...).

Comparison to other libraries

Several other Python libraries help users to extract information from PDFs. As a broad overview, pdfplumber distinguishes itself from other PDF processing libraries by combining these features:

Easy access to detailed information about each PDF object
Higher-level, customizable methods for extracting text and tables
Tightly integrated visual debugging
Other useful utility functions, such as filtering objects via a crop-box

It's also helpful to know what features pdfplumber does not provide:

PDF generation
PDF modification
Optical character recognition (OCR)
Strong support for extracting tables from OCR'ed documents

Specific comparisons

pdfminer.six provides the foundation for pdfplumber. It primarily focuses on parsing PDFs, analyzing PDF layouts and object positioning, and extracting text. It does not provide tools for table extraction or visual debugging. License: MIT.
PyPDF2 is a pure-Python library "capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files." It can extract page text, but does not provide easy access to shape objects (rectangles, lines, etc.), table-extraction, or visually debugging tools. License: BSD.
pymupdf is substantially faster than pdfminer.six (and thus also pdfplumber) and can generate and modify PDFs, but the library requires installation of non-Python software (MuPDF). It also does not enable easy access to shape objects (rectangles, lines, etc.), and does not provide table-extraction or visual debugging tools. License: AGPL.
camelot, tabula-py, and pdftables all focus primarily on extracting tables. In some cases, they may be better suited to the particular tables you are trying to extract. License: MIT (camelot), MIT (tabula-py), BSD (pdftables).

Acknowledgments / Contributors

Many thanks to the following users who've contributed ideas, features, and fixes:

Contributing

Pull requests are welcome, but please submit a proposal issue first, as the library is in active development.

Current maintainers:

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Camelot

Cons of Camelot

Code Comparison

Pros of pdfminer

Cons of pdfminer

Code Comparison

Pros of PyMuPDF

Cons of PyMuPDF

Code Comparison

Pros of pdfminer.six

Cons of pdfminer.six

Code Comparison

Pros of pypdf

Cons of pypdf

Code Comparison

Pros of Tabula

Cons of Tabula

Code Comparison

Convert designs to code with AI

README

pdfplumber

Table of Contents

Installation

Command line interface

Basic example

Options

Python library

Basic example

Loading a PDF

The pdfplumber.PDF class

The pdfplumber.Page class

Objects

char properties

line properties

rect properties

curve properties

Derived properties

image properties

Obtaining higher-level layout objects via pdfminer.six

Visual debugging

Creating a PageImage with .to_image()

Basic PageImage methods

Drawing methods

Visually debugging the table-finder

Extracting text

Extracting tables

Table-extraction methods

Table-extraction settings

Table-extraction strategies

Notes

Extracting form values

Demonstrations

Comparison to other libraries

Specific comparisons

Acknowledgments / Contributors

Contributing

Top Related Projects

Convert designs to code with AI

The `pdfplumber.PDF` class

The `pdfplumber.Page` class

`char` properties

`line` properties

`rect` properties

`curve` properties

`image` properties

Obtaining higher-level layout objects via `pdfminer.six`

Creating a `PageImage` with `.to_image()`

Basic `PageImage` methods