pdfminer.six

Community maintained fork of pdfminer - we fathom PDF

5,793

919

5,793

234

View on GitHub

Top Related Projects

PyMuPDF

5,027

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

pypdf

8,045

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

pikepdf

2,125

A Python library for reading and writing PDF, powered by QPDF

Quick Overview

PDFMiner.six is a Python library for extracting text, images, and other information from PDF files. It's a fork of the original PDFMiner project, updated to work with both Python 2 and 3. The library provides tools for parsing PDF documents, analyzing their structure, and extracting content.

Pros

Supports both Python 2 and 3
Offers detailed control over PDF parsing and extraction
Can extract various elements including text, images, and metadata
Actively maintained with regular updates

Cons

Slower performance compared to some other PDF libraries
Steeper learning curve for beginners
Documentation could be more comprehensive
May require additional processing for complex PDF layouts

Code Examples

Extracting text from a PDF:

from pdfminer.high_level import extract_text

text = extract_text('sample.pdf')
print(text)

Converting PDF to HTML:

from pdfminer.high_level import extract_text_to_fp
from io import StringIO

output_string = StringIO()
with open('sample.pdf', 'rb') as fin:
    extract_text_to_fp(fin, output_string, laparams=None, output_type='html', codec=None)
print(output_string.getvalue())

Extracting images from a PDF:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTImage

for page_layout in extract_pages("sample.pdf"):
    for element in page_layout:
        if isinstance(element, LTImage):
            print(f"Found image on page {page_layout.pageid}: {element}")

Getting Started

To get started with PDFMiner.six, follow these steps:

Install the library using pip:
```
pip install pdfminer.six
```

Import the necessary modules in your Python script:

from pdfminer.high_level import extract_text, extract_pages
from pdfminer.layout import LAParams

Use the library to extract text from a PDF:

text = extract_text('path/to/your/pdf/file.pdf', laparams=LAParams())
print(text)

This will extract the text content from the specified PDF file and print it to the console. You can further customize the extraction process by adjusting the LAParams or using other functions provided by the library.

Competitor Comparisons

PyMuPDF

5,027

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Pros of PyMuPDF

Faster performance, especially for large PDF files
More comprehensive PDF manipulation capabilities (e.g., editing, merging)
Better support for complex PDF structures and annotations

Cons of PyMuPDF

Larger library size and more dependencies
Steeper learning curve due to more extensive API
Commercial use requires a license

Code Comparison

PyMuPDF:

import fitz
doc = fitz.open("example.pdf")
page = doc[0]
text = page.get_text()

pdfminer.six:

from pdfminer.high_level import extract_text
text = extract_text("example.pdf")

PyMuPDF offers more granular control over PDF processing, while pdfminer.six provides a simpler API for basic text extraction. PyMuPDF's approach allows for more advanced operations, but pdfminer.six's simplicity can be advantageous for straightforward tasks.

Both libraries have their strengths, and the choice between them depends on the specific requirements of your project, such as performance needs, desired features, and licensing considerations.

pypdf

8,045

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Pros of pypdf

Simpler API and easier to use for basic PDF operations
Faster performance for common tasks like reading and writing PDFs
Better documentation and more active community support

Cons of pypdf

Less powerful for complex PDF parsing and analysis
Limited support for extracting text with precise formatting
Fewer advanced features for handling PDF structure and metadata

Code Comparison

pypdf:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()

pdfminer.six:

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")

While both libraries can extract text from PDFs, pypdf offers a more object-oriented approach with separate reader and page objects. pdfminer.six provides a higher-level function for direct text extraction, which can be more convenient for simple use cases but less flexible for complex operations.

pikepdf

2,125

A Python library for reading and writing PDF, powered by QPDF

Pros of pikepdf

High-performance PDF manipulation using C++ library QPDF
Supports creation, editing, and merging of PDF files
Provides low-level access to PDF structures

Cons of pikepdf

Steeper learning curve for complex PDF operations
Limited text extraction capabilities compared to pdfminer.six
Requires compilation of C++ extensions

Code Comparison

pdfminer.six:

from pdfminer.high_level import extract_text

text = extract_text('document.pdf')
print(text)

pikepdf:

import pikepdf

pdf = pikepdf.Pdf.open('document.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)

Summary

pdfminer.six excels in text extraction and analysis, making it ideal for content-focused tasks. pikepdf, built on QPDF, offers robust PDF manipulation capabilities, including creation and editing. While pdfminer.six provides easier text extraction, pikepdf offers more comprehensive PDF handling at the cost of a steeper learning curve. The choice between them depends on whether the primary focus is on text extraction (pdfminer.six) or general PDF manipulation (pikepdf).

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

pdfminer.six

We fathom PDF

Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.

It is built in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own interpreter or rendering device that uses the power of pdfminer.six for other purposes than text analysis.

Check out the full documentation on Read the Docs.

Features

Written entirely in Python.
Parse, analyze, and convert PDF documents.
Extract content as text, images, html or hOCR.
PDF-1.7 specification support. (well, almost).
CJK languages and vertical writing scripts support.
Various font types (Type1, TrueType, Type3, and CID) support.
Support for extracting images (JPG, JBIG2, Bitmaps).
Support for various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode)
Support for RC4 and AES encryption.
Support for AcroForm interactive form extraction.
Table of contents extraction.
Tagged contents extraction.
Automatic layout analysis.

How to use

Install Python 3.8 or newer.
Install pdfminer.six.
```
pip install pdfminer.six
```
(Optionally) install extra dependencies for extracting images.
```
pip install 'pdfminer.six[image]'
```
Use the command-line interface to extract text from pdf.
```
pdf2txt.py example.pdf
```

Or use it with Python.

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")
print(text)

Contributing

Be sure to read the contribution guidelines.

Acknowledgement

This repository includes code from pyHanko ; the original license has been included here.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot