camelot

A Python library to extract tabular data from PDFs

3,333

498

3,333

219

View on GitHub

Top Related Projects

pdfplumber

7,889

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

camelot

3,691

Camelot: PDF Table Extraction for Humans

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

pdfrw

1,896

pdfrw is a pure Python library that reads and writes PDFs

pdfminer

5,293

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Quick Overview

Camelot is a Python library designed to extract tabular data from PDF files. It provides two methods for table extraction: stream and lattice, which can handle different types of table structures. Camelot aims to simplify the process of extracting tables from PDFs, a task that is often challenging due to the complex nature of PDF documents.

Pros

Highly accurate table extraction, especially for well-structured tables
Supports both stream and lattice-based extraction methods
Provides options for fine-tuning extraction parameters
Outputs data in various formats, including CSV, Excel, and JSON

Cons

May struggle with complex or poorly formatted tables
Requires external dependencies (Ghostscript) for certain functionalities
Limited to table extraction, not suitable for other types of PDF data
Performance can be slow for large or complex PDFs

Code Examples

Basic table extraction:

import camelot

tables = camelot.read_pdf('example.pdf')
print(f"Total tables extracted: {len(tables)}")
print(tables[0].df)  # Print the first table as a pandas DataFrame

Extracting tables from specific pages:

tables = camelot.read_pdf('example.pdf', pages='1,3-5')
for i, table in enumerate(tables):
    table.to_csv(f'table_{i}.csv')

Using lattice mode for tables with borders:

tables = camelot.read_pdf('example.pdf', flavor='lattice')
tables[0].to_excel('output.xlsx')

Adjusting extraction parameters:

tables = camelot.read_pdf('example.pdf', flavor='stream', 
                          table_areas=['0,700,800,100'],
                          columns=['150,250,350,450,550'])
tables[0].parsing_report

Getting Started

To get started with Camelot, follow these steps:

Install Camelot and its dependencies:
```
pip install camelot-py[cv]
```
Install Ghostscript (required for certain functionalities):
- On macOS: brew install ghostscript
- On Windows: Download from the Ghostscript website
- On Linux: Use your distribution's package manager

Basic usage:

import camelot

tables = camelot.read_pdf('your_pdf_file.pdf')
tables[0].to_csv('output.csv')

For more advanced usage and configuration options, refer to the official Camelot documentation.

Competitor Comparisons

pdfplumber

7,889

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Pros of pdfplumber

More flexible in handling complex PDF layouts and structures
Better at extracting text with precise positioning information
Supports extraction of images and other non-text elements from PDFs

Cons of pdfplumber

Generally slower performance compared to Camelot
May require more manual configuration for optimal results
Less specialized for tabular data extraction

Code Comparison

pdfplumber:

import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    table = page.extract_table()

Camelot:

import camelot

tables = camelot.read_pdf("example.pdf")
df = tables[0].df

Both libraries offer Python-based solutions for PDF data extraction, but they have different strengths. pdfplumber excels in handling complex layouts and providing detailed positioning information, while Camelot is more focused on efficient table extraction. The choice between them depends on the specific requirements of your project and the nature of the PDFs you're working with.

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of Tabula

User-friendly GUI for non-technical users
Supports multiple output formats (CSV, TSV, JSON)
Can be used as a standalone application without coding knowledge

Cons of Tabula

Limited to extracting tables from PDFs
Less flexibility in handling complex table structures
Fewer options for fine-tuning extraction parameters

Code Comparison

Tabula (Java):

PDDocument document = PDDocument.load(new File("input.pdf"));
ObjectExtractor oe = new ObjectExtractor(document);
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
Page page = oe.extract(1);
List<Table> tables = sea.extract(page);

Camelot (Python):

import camelot
tables = camelot.read_pdf("input.pdf", pages="1")
tables[0].to_csv("output.csv")

Camelot offers a more concise Python API for table extraction, while Tabula's Java implementation requires more setup code. Camelot provides greater control over extraction parameters and supports both stream and lattice table extraction methods, making it more versatile for complex table structures. However, Tabula's GUI makes it more accessible for users without programming experience.

camelot

3,691

Camelot: PDF Table Extraction for Humans

Pros of Camelot (atlanhq)

More recent updates and active development
Broader functionality, including support for image-based table extraction
Larger community and more frequent releases

Cons of Camelot (atlanhq)

Potentially more complex to use due to additional features
May have higher system requirements for image processing capabilities

Code Comparison

Camelot (camelot-dev):

import camelot
tables = camelot.read_pdf('example.pdf')
df = tables[0].df

Camelot (atlanhq):

from camelot.io import read_pdf
tables = read_pdf('example.pdf', flavor='lattice')
df = tables[0].df

The basic usage is similar, but Camelot (atlanhq) offers additional options:

tables = read_pdf('example.pdf', pages='1-3', flavor='stream')

Both libraries provide similar core functionality for extracting tables from PDFs. However, Camelot (atlanhq) has expanded its capabilities to include image-based extraction and offers more customization options. The trade-off is potentially increased complexity for simple use cases. Users should consider their specific needs when choosing between the two, with Camelot (atlanhq) being more suitable for complex or diverse table extraction tasks.

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

More flexible and low-level PDF parsing capabilities
Supports a wider range of PDF features and structures
Can be used for various PDF-related tasks beyond just table extraction

Cons of pdfminer.six

Requires more manual work and coding to extract structured data
Less user-friendly for beginners or those seeking quick table extraction
May require additional processing to obtain clean, tabular data

Code Comparison

pdfminer.six:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer

for page_layout in extract_pages("document.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

Camelot:

import camelot

tables = camelot.read_pdf("document.pdf")
print(tables[0].df)  # Print first table as pandas DataFrame

The code comparison shows that pdfminer.six requires more manual parsing and processing of PDF elements, while Camelot provides a higher-level API for direct table extraction and conversion to pandas DataFrames. This illustrates the trade-off between flexibility and ease of use between the two libraries.

pdfrw

1,896

pdfrw is a pure Python library that reads and writes PDFs

Pros of pdfrw

Lightweight and focused on PDF manipulation
Pure Python implementation, no external dependencies
Suitable for both reading and writing PDF files

Cons of pdfrw

Limited functionality for extracting tabular data
Less active development and community support
Lacks advanced features for complex PDF parsing

Code Comparison

pdfrw:

from pdfrw import PdfReader, PdfWriter

reader = PdfReader('input.pdf')
writer = PdfWriter()
writer.addpage(reader.pages[0])
writer.write('output.pdf')

Camelot:

import camelot

tables = camelot.read_pdf('input.pdf')
tables[0].to_csv('output.csv')

Summary

pdfrw is a lightweight, pure Python library for basic PDF manipulation, while Camelot is specifically designed for extracting tabular data from PDFs. pdfrw offers more general PDF handling capabilities but lacks advanced features for complex data extraction. Camelot, on the other hand, excels at table extraction but has a narrower focus. The choice between the two depends on the specific requirements of your project, with pdfrw being more suitable for general PDF operations and Camelot for targeted table extraction tasks.

pdfminer

5,293

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Pros of pdfminer

More flexible and customizable for extracting text and metadata from PDFs
Supports a wider range of PDF features and structures
Can be used as a library or command-line tool

Cons of pdfminer

Requires more coding knowledge and effort to extract tabular data
Less user-friendly for non-programmers
May require additional processing to clean and structure extracted data

Code Comparison

pdfminer:

from pdfminer.high_level import extract_text

text = extract_text('example.pdf')
print(text)

Camelot:

import camelot

tables = camelot.read_pdf('example.pdf')
print(tables[0].df)

pdfminer focuses on extracting raw text content, while Camelot is specifically designed for extracting tabular data from PDFs. pdfminer provides more low-level control over the extraction process, but requires more code to handle complex structures. Camelot offers a simpler API for table extraction, making it easier to use for specific tabular data needs.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Camelot: PDF Table Extraction for Humans

Camelot is a Python library that can help you extract tables from PDFs.

Extract tables from PDFs in just a few lines of code:

Try it yourself in our interactive quickstart notebook.

Or check out a simple example using this pdf.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!

Cycle Name	KI (1/km)	Distance (mi)	Percent Fuel Savings
			Improved Speed	Decreased Accel	Eliminate Stops	Decreased Idle
2012_2	3.30	1.3	5.9%	9.5%	29.2%	17.4%
2145_1	0.68	11.2	2.4%	0.1%	9.5%	2.7%
4234_1	0.59	58.7	8.5%	1.3%	8.5%	3.3%
2032_2	0.17	57.8	21.7%	0.3%	2.7%	1.2%
4171_1	0.07	173.9	58.1%	1.6%	2.1%	0.5%

Camelot also comes packaged with a command-line interface!

Refer to the QuickStart Guide to quickly get started with Camelot, extract tables from PDFs and explore some basic options.

Tip: Visit the parser-comparison-notebook to get an overview of all the packed parsers and their features.

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

You can check out some frequently asked questions here.

Why Camelot?

Configurability: Camelot gives you control over the table extraction process with tweakable settings.
Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.

See comparison with similar libraries and tools.

Installation

Using conda

The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.

conda install -c conda-forge camelot-py

Using pip

After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:

pip install "camelot-py[base]"

From the source code

After installing the dependencies, clone the repo using:

git clone https://github.com/camelot-dev/camelot.git

and install using pip:

cd camelot
pip install "."

Documentation

The documentation is available at http://camelot-py.readthedocs.io/.

Wrappers

camelot-php provides a PHP wrapper on Camelot.

Related projects

camelot-sharp provides a C sharp implementation of Camelot.

Contributing

The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out the releases page.

License

This project is licensed under the MIT License, see the LICENSE file for details.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot