Top Related Projects
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Tabula is a tool for liberating data tables trapped inside PDF files
Camelot: PDF Table Extraction for Humans
Community maintained fork of pdfminer - we fathom PDF
pdfrw is a pure Python library that reads and writes PDFs
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
Quick Overview
Camelot is a Python library designed to extract tabular data from PDF files. It provides two methods for table extraction: stream and lattice, which can handle different types of table structures. Camelot aims to simplify the process of extracting tables from PDFs, a task that is often challenging due to the complex nature of PDF documents.
Pros
- Highly accurate table extraction, especially for well-structured tables
- Supports both stream and lattice-based extraction methods
- Provides options for fine-tuning extraction parameters
- Outputs data in various formats, including CSV, Excel, and JSON
Cons
- May struggle with complex or poorly formatted tables
- Requires external dependencies (Ghostscript) for certain functionalities
- Limited to table extraction, not suitable for other types of PDF data
- Performance can be slow for large or complex PDFs
Code Examples
- Basic table extraction:
import camelot
tables = camelot.read_pdf('example.pdf')
print(f"Total tables extracted: {len(tables)}")
print(tables[0].df) # Print the first table as a pandas DataFrame
- Extracting tables from specific pages:
tables = camelot.read_pdf('example.pdf', pages='1,3-5')
for i, table in enumerate(tables):
table.to_csv(f'table_{i}.csv')
- Using lattice mode for tables with borders:
tables = camelot.read_pdf('example.pdf', flavor='lattice')
tables[0].to_excel('output.xlsx')
- Adjusting extraction parameters:
tables = camelot.read_pdf('example.pdf', flavor='stream',
table_areas=['0,700,800,100'],
columns=['150,250,350,450,550'])
tables[0].parsing_report
Getting Started
To get started with Camelot, follow these steps:
-
Install Camelot and its dependencies:
pip install camelot-py[cv]
-
Install Ghostscript (required for certain functionalities):
- On macOS:
brew install ghostscript
- On Windows: Download from the Ghostscript website
- On Linux: Use your distribution's package manager
- On macOS:
-
Basic usage:
import camelot tables = camelot.read_pdf('your_pdf_file.pdf') tables[0].to_csv('output.csv')
For more advanced usage and configuration options, refer to the official Camelot documentation.
Competitor Comparisons
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Pros of pdfplumber
- More flexible in handling complex PDF layouts and structures
- Better at extracting text with precise positioning information
- Supports extraction of images and other non-text elements from PDFs
Cons of pdfplumber
- Generally slower performance compared to Camelot
- May require more manual configuration for optimal results
- Less specialized for tabular data extraction
Code Comparison
pdfplumber:
import pdfplumber
with pdfplumber.open("example.pdf") as pdf:
page = pdf.pages[0]
text = page.extract_text()
table = page.extract_table()
Camelot:
import camelot
tables = camelot.read_pdf("example.pdf")
df = tables[0].df
Both libraries offer Python-based solutions for PDF data extraction, but they have different strengths. pdfplumber excels in handling complex layouts and providing detailed positioning information, while Camelot is more focused on efficient table extraction. The choice between them depends on the specific requirements of your project and the nature of the PDFs you're working with.
Tabula is a tool for liberating data tables trapped inside PDF files
Pros of Tabula
- User-friendly GUI for non-technical users
- Supports multiple output formats (CSV, TSV, JSON)
- Can be used as a standalone application without coding knowledge
Cons of Tabula
- Limited to extracting tables from PDFs
- Less flexibility in handling complex table structures
- Fewer options for fine-tuning extraction parameters
Code Comparison
Tabula (Java):
PDDocument document = PDDocument.load(new File("input.pdf"));
ObjectExtractor oe = new ObjectExtractor(document);
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
Page page = oe.extract(1);
List<Table> tables = sea.extract(page);
Camelot (Python):
import camelot
tables = camelot.read_pdf("input.pdf", pages="1")
tables[0].to_csv("output.csv")
Camelot offers a more concise Python API for table extraction, while Tabula's Java implementation requires more setup code. Camelot provides greater control over extraction parameters and supports both stream and lattice table extraction methods, making it more versatile for complex table structures. However, Tabula's GUI makes it more accessible for users without programming experience.
Camelot: PDF Table Extraction for Humans
Pros of Camelot (atlanhq)
- More recent updates and active development
- Broader functionality, including support for image-based table extraction
- Larger community and more frequent releases
Cons of Camelot (atlanhq)
- Potentially more complex to use due to additional features
- May have higher system requirements for image processing capabilities
Code Comparison
Camelot (camelot-dev):
import camelot
tables = camelot.read_pdf('example.pdf')
df = tables[0].df
Camelot (atlanhq):
from camelot.io import read_pdf
tables = read_pdf('example.pdf', flavor='lattice')
df = tables[0].df
The basic usage is similar, but Camelot (atlanhq) offers additional options:
tables = read_pdf('example.pdf', pages='1-3', flavor='stream')
Both libraries provide similar core functionality for extracting tables from PDFs. However, Camelot (atlanhq) has expanded its capabilities to include image-based extraction and offers more customization options. The trade-off is potentially increased complexity for simple use cases. Users should consider their specific needs when choosing between the two, with Camelot (atlanhq) being more suitable for complex or diverse table extraction tasks.
Community maintained fork of pdfminer - we fathom PDF
Pros of pdfminer.six
- More flexible and low-level PDF parsing capabilities
- Supports a wider range of PDF features and structures
- Can be used for various PDF-related tasks beyond just table extraction
Cons of pdfminer.six
- Requires more manual work and coding to extract structured data
- Less user-friendly for beginners or those seeking quick table extraction
- May require additional processing to obtain clean, tabular data
Code Comparison
pdfminer.six:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("document.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
print(element.get_text())
Camelot:
import camelot
tables = camelot.read_pdf("document.pdf")
print(tables[0].df) # Print first table as pandas DataFrame
The code comparison shows that pdfminer.six requires more manual parsing and processing of PDF elements, while Camelot provides a higher-level API for direct table extraction and conversion to pandas DataFrames. This illustrates the trade-off between flexibility and ease of use between the two libraries.
pdfrw is a pure Python library that reads and writes PDFs
Pros of pdfrw
- Lightweight and focused on PDF manipulation
- Pure Python implementation, no external dependencies
- Suitable for both reading and writing PDF files
Cons of pdfrw
- Limited functionality for extracting tabular data
- Less active development and community support
- Lacks advanced features for complex PDF parsing
Code Comparison
pdfrw:
from pdfrw import PdfReader, PdfWriter
reader = PdfReader('input.pdf')
writer = PdfWriter()
writer.addpage(reader.pages[0])
writer.write('output.pdf')
Camelot:
import camelot
tables = camelot.read_pdf('input.pdf')
tables[0].to_csv('output.csv')
Summary
pdfrw is a lightweight, pure Python library for basic PDF manipulation, while Camelot is specifically designed for extracting tabular data from PDFs. pdfrw offers more general PDF handling capabilities but lacks advanced features for complex data extraction. Camelot, on the other hand, excels at table extraction but has a narrower focus. The choice between the two depends on the specific requirements of your project, with pdfrw being more suitable for general PDF operations and Camelot for targeted table extraction tasks.
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
Pros of pdfminer
- More flexible and customizable for extracting text and metadata from PDFs
- Supports a wider range of PDF features and structures
- Can be used as a library or command-line tool
Cons of pdfminer
- Requires more coding knowledge and effort to extract tabular data
- Less user-friendly for non-programmers
- May require additional processing to clean and structure extracted data
Code Comparison
pdfminer:
from pdfminer.high_level import extract_text
text = extract_text('example.pdf')
print(text)
Camelot:
import camelot
tables = camelot.read_pdf('example.pdf')
print(tables[0].df)
pdfminer focuses on extracting raw text content, while Camelot is specifically designed for extracting tabular data from PDFs. pdfminer provides more low-level control over the extraction process, but requires more code to handle complex structures. Camelot offers a simpler API for table extraction, making it easier to use for specific tabular data needs.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Camelot: PDF Table Extraction for Humans
Camelot is a Python library that can help you extract tables from PDFs!
Note: You can also check out Excalibur, the web interface to Camelot!
Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.
>>> import camelot >>> tables = camelot.read_pdf('foo.pdf') >>> tables <TableList n=1> >>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite >>> tables[0] <Table shape=(7, 7)> >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite >>> tables[0].df # get a pandas DataFrame!
Cycle Name | KI (1/km) | Distance (mi) | Percent Fuel Savings | |||
---|---|---|---|---|---|---|
Improved Speed | Decreased Accel | Eliminate Stops | Decreased Idle | |||
2012_2 | 3.30 | 1.3 | 5.9% | 9.5% | 29.2% | 17.4% |
2145_1 | 0.68 | 11.2 | 2.4% | 0.1% | 9.5% | 2.7% |
4234_1 | 0.59 | 58.7 | 8.5% | 1.3% | 8.5% | 3.3% |
2032_2 | 0.17 | 57.8 | 21.7% | 0.3% | 2.7% | 1.2% |
4171_1 | 0.07 | 173.9 | 58.1% | 1.6% | 2.1% | 0.5% |
Camelot also comes packaged with a command-line interface!
Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
You can check out some frequently asked questions here.
Why Camelot?
- Configurability: Camelot gives you control over the table extraction process with tweakable settings.
- Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
- Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.
See comparison with similar libraries and tools.
Support the development
If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective.
Installation
Using conda
The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.
$ conda install -c conda-forge camelot-py
Using pip
After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:
$ pip install "camelot-py[base]"
From the source code
After installing the dependencies, clone the repo using:
$ git clone https://www.github.com/camelot-dev/camelot
and install Camelot using pip:
$ cd camelot $ pip install ".[base]"
Documentation
The documentation is available at http://camelot-py.readthedocs.io/.
Wrappers
- camelot-php provides a PHP wrapper on Camelot.
Contributing
The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.
Versioning
Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.
License
This project is licensed under the MIT License, see the LICENSE file for details.
Top Related Projects
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Tabula is a tool for liberating data tables trapped inside PDF files
Camelot: PDF Table Extraction for Humans
Community maintained fork of pdfminer - we fathom PDF
pdfrw is a pure Python library that reads and writes PDFs
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot