camelot

Camelot: PDF Table Extraction for Humans

3,691

360

3,691

118

View on GitHub

Top Related Projects

camelot

3,333

A Python library to extract tabular data from PDFs

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

pdftabextract

2,244

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

pdfplumber

7,889

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

tabula-py

2,259

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

Quick Overview

Camelot is a Python library designed to extract tables from PDF files. It provides two methods for table extraction: stream and lattice, which can handle different types of table structures in PDFs. Camelot aims to make it easy for users to extract tabular data from PDFs with high accuracy.

Pros

Highly accurate table extraction, especially for well-structured PDFs
Supports both stream and lattice-based extraction methods
Provides options for fine-tuning extraction parameters
Offers output in various formats (CSV, JSON, HTML, etc.)

Cons

May struggle with complex or poorly formatted PDFs
Requires external dependencies (Ghostscript) for certain functionalities
Can be slower compared to some other PDF extraction tools
Limited support for scanned PDFs or images

Code Examples

Basic table extraction:

import camelot

tables = camelot.read_pdf('example.pdf')
print(f"Total tables extracted: {len(tables)}")
print(tables[0].df)  # Print the first table as a pandas DataFrame

Extracting tables from specific pages:

tables = camelot.read_pdf('example.pdf', pages='1,3-5')
for table in tables:
    print(f"Table on page {table.page}")
    print(table.df)

Using the lattice method with custom parameters:

tables = camelot.read_pdf('example.pdf', flavor='lattice', line_scale=40, process_background=True)
tables[0].to_csv('output.csv')  # Save the first table as CSV

Getting Started

To get started with Camelot, follow these steps:

Install Camelot and its dependencies:
```
pip install camelot-py[cv]
```
Install Ghostscript (if not already installed):
- On macOS: brew install ghostscript
- On Ubuntu: apt-get install ghostscript
- On Windows: Download from the Ghostscript website

Use Camelot in your Python script:

import camelot

tables = camelot.read_pdf('your_pdf_file.pdf')
print(tables[0].df)  # Print the first extracted table

For more advanced usage and options, refer to the Camelot documentation.

Competitor Comparisons

camelot

3,333

A Python library to extract tabular data from PDFs

Pros of Camelot

More actively maintained with recent updates and releases
Supports both Python 2 and Python 3
Includes additional features like stream processing and flavor detection

Cons of Camelot

May have a steeper learning curve due to additional features
Potentially slower performance for simple table extraction tasks
Requires more dependencies, which could increase setup complexity

Code Comparison

Camelot:

import camelot

tables = camelot.read_pdf('example.pdf')
tables[0].to_csv('output.csv')

atlanhq/camelot:

from camelot.pdf import PDF

pdf = PDF('example.pdf')
tables = pdf.extract_tables()
tables[0].to_csv('output.csv')

Both libraries offer similar basic functionality for extracting tables from PDFs. However, Camelot provides more advanced features and options for customization, while atlanhq/camelot focuses on simplicity and ease of use.

The main differences lie in the import statement and the method used to read the PDF and extract tables. Camelot uses a single function call, while atlanhq/camelot requires creating a PDF object first.

When choosing between the two, consider your specific needs, such as Python version compatibility, required features, and performance requirements.

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of Tabula

User-friendly GUI for non-programmers
Supports multiple output formats (CSV, TSV, JSON)
Can be used as a command-line tool or Java library

Cons of Tabula

Limited to extracting tables from PDFs
Less accurate on complex or poorly formatted tables
Fewer advanced features compared to Camelot

Code Comparison

Tabula (Java):

PDDocument document = PDDocument.load(new File("input.pdf"));
ObjectExtractor oe = new ObjectExtractor(document);
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
Page page = oe.extract(1);
List<Table> tables = sea.extract(page);

Camelot (Python):

import camelot

tables = camelot.read_pdf("input.pdf")
tables[0].to_csv("output.csv")

Camelot offers a more concise API for table extraction, while Tabula requires more setup code. Camelot also provides additional features like automatic table detection and multiple extraction methods, making it more versatile for complex PDFs. However, Tabula's Java implementation may be preferred in certain environments or for integration with existing Java projects.

pdftabextract

2,244

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

Pros of pdftabextract

Specialized in extracting tables from scanned PDFs
Includes advanced image processing techniques for table detection
Offers flexibility in handling complex table structures

Cons of pdftabextract

Less actively maintained compared to Camelot
Limited documentation and examples
Steeper learning curve for beginners

Code Comparison

pdftabextract:

from pdftabextract import imgproc, extract, common
# Load image and process
img = imgproc.load_image("example.png")
processed_img = imgproc.process_image(img)
# Extract tables
tables = extract.extract_tables(processed_img)

Camelot:

import camelot
# Read PDF and extract tables
tables = camelot.read_pdf("example.pdf")
# Convert to DataFrame
df = tables[0].df

Both libraries aim to extract tables from PDFs, but they differ in their approach and use cases. pdftabextract is more focused on scanned PDFs and image processing, while Camelot offers a more straightforward API for general PDF table extraction. Camelot is generally easier to use and has better documentation, making it more suitable for beginners. However, pdftabextract may be more effective for complex scanned documents where advanced image processing is required.

pdfplumber

7,889

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Pros of pdfplumber

More versatile, capable of extracting text, images, and other elements from PDFs
Better handling of complex PDF layouts and formatting
Provides detailed information about text positioning and styling

Cons of pdfplumber

Slower performance compared to Camelot, especially for large PDFs
May require more manual configuration for optimal results
Less specialized for table extraction tasks

Code Comparison

pdfplumber:

import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()

Camelot:

import camelot

tables = camelot.read_pdf("example.pdf")
table = tables[0].df

Both libraries offer straightforward ways to extract tables from PDFs, but Camelot's approach is more concise and specifically tailored for table extraction. pdfplumber provides a more comprehensive set of tools for working with various PDF elements, which can be beneficial for complex documents but may require additional code for table-specific tasks.

tabula-py

2,259

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

Pros of tabula-py

Simpler installation process, as it's a pure Python wrapper for Tabula
Lighter weight and potentially faster for basic table extraction tasks
Better integration with Java-based environments due to its use of Tabula

Cons of tabula-py

Less feature-rich compared to Camelot, with fewer table detection and extraction options
May struggle with complex PDF layouts or non-standard table formats
Requires Java to be installed on the system, which can be a limitation in some environments

Code Comparison

tabula-py:

import tabula

# Read PDF file
tables = tabula.read_pdf("example.pdf", pages="all")

# Convert to DataFrame
df = tables[0]

Camelot:

import camelot

# Read PDF file
tables = camelot.read_pdf("example.pdf", pages="all")

# Access first table
table = tables[0]
df = table.df

Both libraries offer similar basic functionality for extracting tables from PDFs. However, Camelot provides more advanced options for table detection and extraction, making it more suitable for complex PDF layouts. tabula-py, being a wrapper for Tabula, offers a simpler interface but may be limited in handling intricate table structures.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Camelot: PDF Table Extraction for Humans

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!

Note: You can also check out Excalibur, which is a web interface for Camelot!

Here's how you can extract tables from PDF files. Check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite
>>> tables[0].df # get a pandas DataFrame!

Cycle Name	KI (1/km)	Distance (mi)	Percent Fuel Savings
			Improved Speed	Decreased Accel	Eliminate Stops	Decreased Idle
2012_2	3.30	1.3	5.9%	9.5%	29.2%	17.4%
2145_1	0.68	11.2	2.4%	0.1%	9.5%	2.7%
4234_1	0.59	58.7	8.5%	1.3%	8.5%	3.3%
2032_2	0.17	57.8	21.7%	0.3%	2.7%	1.2%
4171_1	0.07	173.9	58.1%	1.6%	2.1%	0.5%

There's a command-line interface too!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

Why Camelot?

You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
Export to multiple formats, including JSON, Excel, HTML and Sqlite.

See comparison with other PDF table extraction libraries and tools.

Installation

Using conda

The easiest way to install Camelot is to install it with conda, which is a package manager and environment management system for the Anaconda distribution.

$ conda install -c conda-forge camelot-py

Using pip

After installing the dependencies (tk and ghostscript), you can simply use pip to install Camelot:

$ pip install camelot-py[cv]

From the source code

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/camelot-dev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install ".[cv]"

Documentation

Great documentation is available at http://camelot-py.readthedocs.io/.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

$ git clone https://www.github.com/camelot-dev/camelot

Setting up a development environment

You can install the development dependencies easily, using pip:

$ pip install camelot-py[dev]

Testing

After installation, you can run tests using:

$ python setup.py test

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot