pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

5,293

1,124

5,293

243

View on GitHub

Top Related Projects

PyMuPDF

7,705

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

camelot

3,333

A Python library to extract tabular data from PDFs

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

Quick Overview

PDFMiner is a Python library for extracting text and metadata from PDF files. It provides tools for parsing PDF documents, analyzing their structure, and extracting content, making it useful for various text processing and data extraction tasks involving PDFs.

Pros

Highly customizable and flexible, allowing fine-grained control over PDF parsing and extraction
Supports various PDF features, including encrypted PDFs and different text encodings
Can extract not just text, but also images and other elements from PDF files
Open-source and actively maintained

Cons

Can be slower compared to some other PDF extraction libraries
Requires some understanding of PDF structure for advanced usage
Documentation could be more comprehensive and user-friendly
May have occasional issues with complex PDF layouts or certain PDF features

Code Examples

Extracting text from a PDF file:

from pdfminer.high_level import extract_text

text = extract_text('sample.pdf')
print(text)

Extracting text from specific pages:

from pdfminer.high_level import extract_text_to_fp
from io import StringIO

output = StringIO()
with open('sample.pdf', 'rb') as file:
    extract_text_to_fp(file, output, page_numbers=[0, 2])  # Extract from first and third pages
    text = output.getvalue().strip()
print(text)

Extracting metadata from a PDF:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

with open('sample.pdf', 'rb') as file:
    parser = PDFParser(file)
    doc = PDFDocument(parser)
    metadata = doc.info[0]
    print(metadata)

Getting Started

To get started with PDFMiner, first install it using pip:

pip install pdfminer.six

Then, you can use it in your Python script:

from pdfminer.high_level import extract_text

# Extract text from a PDF file
text = extract_text('path/to/your/pdf/file.pdf')
print(text)

This basic example extracts all text from the specified PDF file. For more advanced usage, refer to the library's documentation and examples.

Competitor Comparisons

PyMuPDF

7,705

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Pros of PyMuPDF

Faster performance, especially for large PDF files
More comprehensive feature set, including PDF editing and creation
Better documentation and active community support

Cons of PyMuPDF

Larger library size and more dependencies
Steeper learning curve for beginners
Commercial use requires a license

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("example.pdf")
page = doc[0]
text = page.get_text()

PDFMiner:

from pdfminer.high_level import extract_text
text = extract_text("example.pdf")

PyMuPDF offers more granular control over PDF processing, while PDFMiner provides a simpler interface for basic text extraction. PyMuPDF's approach allows for more advanced operations, such as accessing individual pages and performing specific actions on them. PDFMiner's high-level function is more straightforward for simple text extraction tasks but may be less flexible for complex PDF manipulations.

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

Python 3 compatibility, ensuring support for modern Python versions
Active maintenance and regular updates
Improved performance and bug fixes compared to the original pdfminer

Cons of pdfminer.six

Potential compatibility issues with legacy Python 2 code
Some API changes may require updates to existing scripts

Code Comparison

pdfminer:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

# ... (rest of the code)

pdfminer.six:

from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams
from io import StringIO

# ... (rest of the code)

The main difference in code usage is that pdfminer.six provides higher-level functions like extract_text_to_fp, simplifying the extraction process compared to the original pdfminer, which requires more manual setup of interpreters and converters.

Both libraries share similar core functionality for PDF text extraction, but pdfminer.six offers a more streamlined API and better support for modern Python environments.

camelot

3,333

A Python library to extract tabular data from PDFs

Pros of Camelot

Specialized for extracting tables from PDFs
Provides both stream-based and lattice-based extraction methods
Offers a user-friendly API and command-line interface

Cons of Camelot

Limited to table extraction, not suitable for general PDF text extraction
May struggle with complex or poorly formatted PDFs
Requires additional dependencies (e.g., Ghostscript) for full functionality

Code Comparison

Camelot

import camelot

tables = camelot.read_pdf('example.pdf')
df = tables[0].df

PDFMiner

from pdfminer.high_level import extract_text

text = extract_text('example.pdf')

Summary

Camelot is specifically designed for extracting tables from PDFs, offering specialized methods and a user-friendly interface. However, it's limited to table extraction and may require additional setup. PDFMiner, on the other hand, is a more general-purpose PDF text extraction tool with broader capabilities but may require more manual processing for table extraction. The choice between the two depends on the specific requirements of your project, particularly whether you need focused table extraction or more general PDF text extraction capabilities.

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of Tabula

Specifically designed for extracting tables from PDFs
User-friendly GUI for non-technical users
Supports multiple output formats (CSV, TSV, JSON)

Cons of Tabula

Limited to table extraction, less versatile for general PDF parsing
Requires Java runtime environment
Less suitable for large-scale, automated processing

Code Comparison

PDFMiner (Python):

from pdfminer.high_level import extract_text

text = extract_text('document.pdf')
print(text)

Tabula (Java):

import technology.tabula.ObjectExtractor;
import technology.tabula.Page;

ObjectExtractor oe = new ObjectExtractor(PDDocument.load(new File("document.pdf")));
Page page = oe.extract(1);
List<Table> tables = new TableExtractor().extractTables(page);

PDFMiner offers a more straightforward approach for general text extraction, while Tabula provides specialized methods for table extraction. PDFMiner is more flexible for various PDF parsing tasks, but Tabula excels in accurately identifying and extracting tabular data from PDFs.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PDFMiner

PDFMiner is a text extraction tool for PDF documents.

Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but this project is largely dormant. For the active project, check out its fork pdfminer.six.

Features:

Pure Python (3.6 or above).
Supports PDF-1.7. (well, almost)
Obtains the exact location of text as well as other layout information (fonts, etc.).
Performs automatic layout analysis.
Can convert PDF into other formats (HTML/XML).
Can extract an outline (TOC).
Can extract tagged contents.
Supports basic encryption (RC4 and AES).
Supports various font types (Type1, TrueType, Type3, and CID).
Supports CJK languages and vertical writing scripts.
Has an extensible PDF parser that can be used for other purposes.

How to Use:

> pip install pdfminer
> pdf2txt.py samples/simple1.pdf

Command Line Syntax:

pdf2txt.py

pdf2txt.py extracts all the texts that are rendered programmatically. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text segment. It does not recognize text in images. A password needs to be provided for restricted PDF documents.

> pdf2txt.py [-P password] [-o output] [-t text|html|xml|tag]
             [-O output_dir] [-c encoding] [-s scale] [-R rotation]
             [-Y normal|loose|exact] [-p pagenos] [-m maxpages]
             [-S] [-C] [-n] [-A] [-V]
             [-M char_margin] [-L line_margin] [-W word_margin]
             [-F boxes_flow] [-d]
             input.pdf ...

-P password : PDF password.
-o output : Output file name.
-t text|html|xml|tag : Output type. (default: automatically inferred from the output file name.)
-O output_dir : Output directory for extracted images.
-c encoding : Output encoding. (default: utf-8)
-s scale : Output scale.
-R rotation : Rotates the page in degree.
-Y normal|loose|exact : Specifies the layout mode. (only for HTML output.)
-p pagenos : Processes certain pages only.
-m maxpages : Limits the number of maximum pages to process.
-S : Strips control characters.
-C : Disables resource caching.
-n : Disables layout analysis.
-A : Applies layout analysis for all texts including figures.
-V : Automatically detects vertical writing.
-M char_margin : Speficies the char margin.
-W word_margin : Speficies the word margin.
-L line_margin : Speficies the line margin.
-F boxes_flow : Speficies the box flow ratio.
-d : Turns on Debug output.

dumppdf.py

dumppdf.py is used for debugging PDFs. It dumps all the internal contents in pseudo-XML format.

> dumppdf.py [-P password] [-a] [-p pageid] [-i objid]
             [-o output] [-r|-b|-t] [-T] [-O directory] [-d]
             input.pdf ...

-P password : PDF password.
-a : Extracts all objects.
-p pageid : Extracts a Page object.
-i objid : Extracts a certain object.
-o output : Output file name.
-r : Raw mode. Dumps the raw compressed/encoded streams.
-b : Binary mode. Dumps the uncompressed/decoded streams.
-t : Text mode. Dumps the streams in text format.
-T : Tagged mode. Dumps the tagged contents.
-O output_dir : Output directory for extracted streams.

TODO

Replace STRICT variable with something better.
Improve the debugging functions.
Use logging module instead of sys.stderr.
Proper test cases.
PEP-8 and PEP-257 conformance.
Better documentation.
Crypto stream filter support.

Related Projects

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot