Top Related Projects
A Python library to extract tabular data from PDFs
Tabula is a tool for liberating data tables trapped inside PDF files
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
Quick Overview
Camelot is a Python library designed to extract tables from PDF files. It provides two methods for table extraction: stream and lattice, which can handle different types of table structures in PDFs. Camelot aims to make it easy for users to extract tabular data from PDFs with high accuracy.
Pros
- Highly accurate table extraction, especially for well-structured PDFs
- Supports both stream and lattice-based extraction methods
- Provides options for fine-tuning extraction parameters
- Offers output in various formats (CSV, JSON, HTML, etc.)
Cons
- May struggle with complex or poorly formatted PDFs
- Requires external dependencies (Ghostscript) for certain functionalities
- Can be slower compared to some other PDF extraction tools
- Limited support for scanned PDFs or images
Code Examples
- Basic table extraction:
import camelot
tables = camelot.read_pdf('example.pdf')
print(f"Total tables extracted: {len(tables)}")
print(tables[0].df) # Print the first table as a pandas DataFrame
- Extracting tables from specific pages:
tables = camelot.read_pdf('example.pdf', pages='1,3-5')
for table in tables:
print(f"Table on page {table.page}")
print(table.df)
- Using the lattice method with custom parameters:
tables = camelot.read_pdf('example.pdf', flavor='lattice', line_scale=40, process_background=True)
tables[0].to_csv('output.csv') # Save the first table as CSV
Getting Started
To get started with Camelot, follow these steps:
-
Install Camelot and its dependencies:
pip install camelot-py[cv]
-
Install Ghostscript (if not already installed):
- On macOS:
brew install ghostscript
- On Ubuntu:
apt-get install ghostscript
- On Windows: Download from the Ghostscript website
- On macOS:
-
Use Camelot in your Python script:
import camelot tables = camelot.read_pdf('your_pdf_file.pdf') print(tables[0].df) # Print the first extracted table
For more advanced usage and options, refer to the Camelot documentation.
Competitor Comparisons
A Python library to extract tabular data from PDFs
Pros of Camelot
- More actively maintained with recent updates and releases
- Supports both Python 2 and Python 3
- Includes additional features like stream processing and flavor detection
Cons of Camelot
- May have a steeper learning curve due to additional features
- Potentially slower performance for simple table extraction tasks
- Requires more dependencies, which could increase setup complexity
Code Comparison
Camelot:
import camelot
tables = camelot.read_pdf('example.pdf')
tables[0].to_csv('output.csv')
atlanhq/camelot:
from camelot.pdf import PDF
pdf = PDF('example.pdf')
tables = pdf.extract_tables()
tables[0].to_csv('output.csv')
Both libraries offer similar basic functionality for extracting tables from PDFs. However, Camelot provides more advanced features and options for customization, while atlanhq/camelot focuses on simplicity and ease of use.
The main differences lie in the import statement and the method used to read the PDF and extract tables. Camelot uses a single function call, while atlanhq/camelot requires creating a PDF object first.
When choosing between the two, consider your specific needs, such as Python version compatibility, required features, and performance requirements.
Tabula is a tool for liberating data tables trapped inside PDF files
Pros of Tabula
- User-friendly GUI for non-programmers
- Supports multiple output formats (CSV, TSV, JSON)
- Can be used as a command-line tool or Java library
Cons of Tabula
- Limited to extracting tables from PDFs
- Less accurate on complex or poorly formatted tables
- Fewer advanced features compared to Camelot
Code Comparison
Tabula (Java):
PDDocument document = PDDocument.load(new File("input.pdf"));
ObjectExtractor oe = new ObjectExtractor(document);
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
Page page = oe.extract(1);
List<Table> tables = sea.extract(page);
Camelot (Python):
import camelot
tables = camelot.read_pdf("input.pdf")
tables[0].to_csv("output.csv")
Camelot offers a more concise API for table extraction, while Tabula requires more setup code. Camelot also provides additional features like automatic table detection and multiple extraction methods, making it more versatile for complex PDFs. However, Tabula's Java implementation may be preferred in certain environments or for integration with existing Java projects.
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
Pros of pdftabextract
- Specialized in extracting tables from scanned PDFs
- Includes advanced image processing techniques for table detection
- Offers flexibility in handling complex table structures
Cons of pdftabextract
- Less actively maintained compared to Camelot
- Limited documentation and examples
- Steeper learning curve for beginners
Code Comparison
pdftabextract:
from pdftabextract import imgproc, extract, common
# Load image and process
img = imgproc.load_image("example.png")
processed_img = imgproc.process_image(img)
# Extract tables
tables = extract.extract_tables(processed_img)
Camelot:
import camelot
# Read PDF and extract tables
tables = camelot.read_pdf("example.pdf")
# Convert to DataFrame
df = tables[0].df
Both libraries aim to extract tables from PDFs, but they differ in their approach and use cases. pdftabextract is more focused on scanned PDFs and image processing, while Camelot offers a more straightforward API for general PDF table extraction. Camelot is generally easier to use and has better documentation, making it more suitable for beginners. However, pdftabextract may be more effective for complex scanned documents where advanced image processing is required.
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Pros of pdfplumber
- More versatile, capable of extracting text, images, and other elements from PDFs
- Better handling of complex PDF layouts and formatting
- Provides detailed information about text positioning and styling
Cons of pdfplumber
- Slower performance compared to Camelot, especially for large PDFs
- May require more manual configuration for optimal results
- Less specialized for table extraction tasks
Code Comparison
pdfplumber:
import pdfplumber
with pdfplumber.open("example.pdf") as pdf:
page = pdf.pages[0]
table = page.extract_table()
Camelot:
import camelot
tables = camelot.read_pdf("example.pdf")
table = tables[0].df
Both libraries offer straightforward ways to extract tables from PDFs, but Camelot's approach is more concise and specifically tailored for table extraction. pdfplumber provides a more comprehensive set of tools for working with various PDF elements, which can be beneficial for complex documents but may require additional code for table-specific tasks.
Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
Pros of tabula-py
- Simpler installation process, as it's a pure Python wrapper for Tabula
- Lighter weight and potentially faster for basic table extraction tasks
- Better integration with Java-based environments due to its use of Tabula
Cons of tabula-py
- Less feature-rich compared to Camelot, with fewer table detection and extraction options
- May struggle with complex PDF layouts or non-standard table formats
- Requires Java to be installed on the system, which can be a limitation in some environments
Code Comparison
tabula-py:
import tabula
# Read PDF file
tables = tabula.read_pdf("example.pdf", pages="all")
# Convert to DataFrame
df = tables[0]
Camelot:
import camelot
# Read PDF file
tables = camelot.read_pdf("example.pdf", pages="all")
# Access first table
table = tables[0]
df = table.df
Both libraries offer similar basic functionality for extracting tables from PDFs. However, Camelot provides more advanced options for table detection and extraction, making it more suitable for complex PDF layouts. tabula-py, being a wrapper for Tabula, offers a simpler interface but may be limited in handling intricate table structures.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Camelot: PDF Table Extraction for Humans
Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!
Note: You can also check out Excalibur, which is a web interface for Camelot!
Here's how you can extract tables from PDF files. Check out the PDF used in this example here.
>>> import camelot >>> tables = camelot.read_pdf('foo.pdf') >>> tables <TableList n=1> >>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite >>> tables[0] <Table shape=(7, 7)> >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite >>> tables[0].df # get a pandas DataFrame!
Cycle Name | KI (1/km) | Distance (mi) | Percent Fuel Savings | |||
---|---|---|---|---|---|---|
Improved Speed | Decreased Accel | Eliminate Stops | Decreased Idle | |||
2012_2 | 3.30 | 1.3 | 5.9% | 9.5% | 29.2% | 17.4% |
2145_1 | 0.68 | 11.2 | 2.4% | 0.1% | 9.5% | 2.7% |
4234_1 | 0.59 | 58.7 | 8.5% | 1.3% | 8.5% | 3.3% |
2032_2 | 0.17 | 57.8 | 21.7% | 0.3% | 2.7% | 1.2% |
4171_1 | 0.07 | 173.9 | 58.1% | 1.6% | 2.1% | 0.5% |
There's a command-line interface too!
Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
Why Camelot?
- You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
- Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
- Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
- Export to multiple formats, including JSON, Excel, HTML and Sqlite.
See comparison with other PDF table extraction libraries and tools.
Installation
Using conda
The easiest way to install Camelot is to install it with conda, which is a package manager and environment management system for the Anaconda distribution.
$ conda install -c conda-forge camelot-py
Using pip
After installing the dependencies (tk and ghostscript), you can simply use pip to install Camelot:
$ pip install camelot-py[cv]
From the source code
After installing the dependencies, clone the repo using:
$ git clone https://www.github.com/camelot-dev/camelot
and install Camelot using pip:
$ cd camelot $ pip install ".[cv]"
Documentation
Great documentation is available at http://camelot-py.readthedocs.io/.
Development
The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.
Source code
You can check the latest sources with:
$ git clone https://www.github.com/camelot-dev/camelot
Setting up a development environment
You can install the development dependencies easily, using pip:
$ pip install camelot-py[dev]
Testing
After installation, you can run tests using:
$ python setup.py test
Versioning
Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.
License
This project is licensed under the MIT License, see the LICENSE file for details.
Top Related Projects
A Python library to extract tabular data from PDFs
Tabula is a tool for liberating data tables trapped inside PDF files
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot