Top Related Projects
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Community maintained fork of pdfminer - we fathom PDF
A Python library to extract tabular data from PDFs
Tabula is a tool for liberating data tables trapped inside PDF files
Quick Overview
PDFMiner is a Python library for extracting text and metadata from PDF files. It provides tools for parsing PDF documents, analyzing their structure, and extracting content, making it useful for various text processing and data extraction tasks involving PDFs.
Pros
- Highly customizable and flexible, allowing fine-grained control over PDF parsing and extraction
- Supports various PDF features, including encrypted PDFs and different text encodings
- Can extract not just text, but also images and other elements from PDF files
- Open-source and actively maintained
Cons
- Can be slower compared to some other PDF extraction libraries
- Requires some understanding of PDF structure for advanced usage
- Documentation could be more comprehensive and user-friendly
- May have occasional issues with complex PDF layouts or certain PDF features
Code Examples
- Extracting text from a PDF file:
from pdfminer.high_level import extract_text
text = extract_text('sample.pdf')
print(text)
- Extracting text from specific pages:
from pdfminer.high_level import extract_text_to_fp
from io import StringIO
output = StringIO()
with open('sample.pdf', 'rb') as file:
extract_text_to_fp(file, output, page_numbers=[0, 2]) # Extract from first and third pages
text = output.getvalue().strip()
print(text)
- Extracting metadata from a PDF:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
with open('sample.pdf', 'rb') as file:
parser = PDFParser(file)
doc = PDFDocument(parser)
metadata = doc.info[0]
print(metadata)
Getting Started
To get started with PDFMiner, first install it using pip:
pip install pdfminer.six
Then, you can use it in your Python script:
from pdfminer.high_level import extract_text
# Extract text from a PDF file
text = extract_text('path/to/your/pdf/file.pdf')
print(text)
This basic example extracts all text from the specified PDF file. For more advanced usage, refer to the library's documentation and examples.
Competitor Comparisons
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Pros of PyMuPDF
- Faster performance, especially for large PDF files
- More comprehensive feature set, including PDF editing and creation
- Better documentation and active community support
Cons of PyMuPDF
- Larger library size and more dependencies
- Steeper learning curve for beginners
- Commercial use requires a license
Code Comparison
PyMuPDF:
import fitz
doc = fitz.open("example.pdf")
page = doc[0]
text = page.get_text()
PDFMiner:
from pdfminer.high_level import extract_text
text = extract_text("example.pdf")
PyMuPDF offers more granular control over PDF processing, while PDFMiner provides a simpler interface for basic text extraction. PyMuPDF's approach allows for more advanced operations, such as accessing individual pages and performing specific actions on them. PDFMiner's high-level function is more straightforward for simple text extraction tasks but may be less flexible for complex PDF manipulations.
Community maintained fork of pdfminer - we fathom PDF
Pros of pdfminer.six
- Python 3 compatibility, ensuring support for modern Python versions
- Active maintenance and regular updates
- Improved performance and bug fixes compared to the original pdfminer
Cons of pdfminer.six
- Potential compatibility issues with legacy Python 2 code
- Some API changes may require updates to existing scripts
Code Comparison
pdfminer:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
# ... (rest of the code)
pdfminer.six:
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams
from io import StringIO
# ... (rest of the code)
The main difference in code usage is that pdfminer.six provides higher-level functions like extract_text_to_fp
, simplifying the extraction process compared to the original pdfminer, which requires more manual setup of interpreters and converters.
Both libraries share similar core functionality for PDF text extraction, but pdfminer.six offers a more streamlined API and better support for modern Python environments.
A Python library to extract tabular data from PDFs
Pros of Camelot
- Specialized for extracting tables from PDFs
- Provides both stream-based and lattice-based extraction methods
- Offers a user-friendly API and command-line interface
Cons of Camelot
- Limited to table extraction, not suitable for general PDF text extraction
- May struggle with complex or poorly formatted PDFs
- Requires additional dependencies (e.g., Ghostscript) for full functionality
Code Comparison
Camelot
import camelot
tables = camelot.read_pdf('example.pdf')
df = tables[0].df
PDFMiner
from pdfminer.high_level import extract_text
text = extract_text('example.pdf')
Summary
Camelot is specifically designed for extracting tables from PDFs, offering specialized methods and a user-friendly interface. However, it's limited to table extraction and may require additional setup. PDFMiner, on the other hand, is a more general-purpose PDF text extraction tool with broader capabilities but may require more manual processing for table extraction. The choice between the two depends on the specific requirements of your project, particularly whether you need focused table extraction or more general PDF text extraction capabilities.
Tabula is a tool for liberating data tables trapped inside PDF files
Pros of Tabula
- Specifically designed for extracting tables from PDFs
- User-friendly GUI for non-technical users
- Supports multiple output formats (CSV, TSV, JSON)
Cons of Tabula
- Limited to table extraction, less versatile for general PDF parsing
- Requires Java runtime environment
- Less suitable for large-scale, automated processing
Code Comparison
PDFMiner (Python):
from pdfminer.high_level import extract_text
text = extract_text('document.pdf')
print(text)
Tabula (Java):
import technology.tabula.ObjectExtractor;
import technology.tabula.Page;
ObjectExtractor oe = new ObjectExtractor(PDDocument.load(new File("document.pdf")));
Page page = oe.extract(1);
List<Table> tables = new TableExtractor().extractTables(page);
PDFMiner offers a more straightforward approach for general text extraction, while Tabula provides specialized methods for table extraction. PDFMiner is more flexible for various PDF parsing tasks, but Tabula excels in accurately identifying and extracting tabular data from PDFs.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
PDFMiner
PDFMiner is a text extraction tool for PDF documents.
Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but this project is largely dormant. For the active project, check out its fork pdfminer.six.
Features:
- Pure Python (3.6 or above).
- Supports PDF-1.7. (well, almost)
- Obtains the exact location of text as well as other layout information (fonts, etc.).
- Performs automatic layout analysis.
- Can convert PDF into other formats (HTML/XML).
- Can extract an outline (TOC).
- Can extract tagged contents.
- Supports basic encryption (RC4 and AES).
- Supports various font types (Type1, TrueType, Type3, and CID).
- Supports CJK languages and vertical writing scripts.
- Has an extensible PDF parser that can be used for other purposes.
How to Use:
> pip install pdfminer
> pdf2txt.py samples/simple1.pdf
Command Line Syntax:
pdf2txt.py
pdf2txt.py extracts all the texts that are rendered programmatically. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text segment. It does not recognize text in images. A password needs to be provided for restricted PDF documents.
> pdf2txt.py [-P password] [-o output] [-t text|html|xml|tag]
[-O output_dir] [-c encoding] [-s scale] [-R rotation]
[-Y normal|loose|exact] [-p pagenos] [-m maxpages]
[-S] [-C] [-n] [-A] [-V]
[-M char_margin] [-L line_margin] [-W word_margin]
[-F boxes_flow] [-d]
input.pdf ...
-P password
: PDF password.-o output
: Output file name.-t text|html|xml|tag
: Output type. (default: automatically inferred from the output file name.)-O output_dir
: Output directory for extracted images.-c encoding
: Output encoding. (default: utf-8)-s scale
: Output scale.-R rotation
: Rotates the page in degree.-Y normal|loose|exact
: Specifies the layout mode. (only for HTML output.)-p pagenos
: Processes certain pages only.-m maxpages
: Limits the number of maximum pages to process.-S
: Strips control characters.-C
: Disables resource caching.-n
: Disables layout analysis.-A
: Applies layout analysis for all texts including figures.-V
: Automatically detects vertical writing.-M char_margin
: Speficies the char margin.-W word_margin
: Speficies the word margin.-L line_margin
: Speficies the line margin.-F boxes_flow
: Speficies the box flow ratio.-d
: Turns on Debug output.
dumppdf.py
dumppdf.py is used for debugging PDFs. It dumps all the internal contents in pseudo-XML format.
> dumppdf.py [-P password] [-a] [-p pageid] [-i objid]
[-o output] [-r|-b|-t] [-T] [-O directory] [-d]
input.pdf ...
-P password
: PDF password.-a
: Extracts all objects.-p pageid
: Extracts a Page object.-i objid
: Extracts a certain object.-o output
: Output file name.-r
: Raw mode. Dumps the raw compressed/encoded streams.-b
: Binary mode. Dumps the uncompressed/decoded streams.-t
: Text mode. Dumps the streams in text format.-T
: Tagged mode. Dumps the tagged contents.-O output_dir
: Output directory for extracted streams.
TODO
- Replace STRICT variable with something better.
- Improve the debugging functions.
- Use logging module instead of sys.stderr.
- Proper test cases.
- PEP-8 and PEP-257 conformance.
- Better documentation.
- Crypto stream filter support.
Related Projects
Top Related Projects
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Community maintained fork of pdfminer - we fathom PDF
A Python library to extract tabular data from PDFs
Tabula is a tool for liberating data tables trapped inside PDF files
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot