Convert Figma logo to code with AI

atlanhq logocamelot

Camelot: PDF Table Extraction for Humans

3,676
361
3,676
118

Top Related Projects

3,058

A Python library to extract tabular data from PDFs

6,831

Tabula is a tool for liberating data tables trapped inside PDF files

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

Quick Overview

Camelot is a Python library designed to extract tables from PDF files. It provides two methods for table extraction: stream and lattice, which can handle different types of table structures in PDFs. Camelot aims to make it easy for users to extract tabular data from PDFs with high accuracy.

Pros

  • Highly accurate table extraction, especially for well-structured PDFs
  • Supports both stream and lattice-based extraction methods
  • Provides options for fine-tuning extraction parameters
  • Offers output in various formats (CSV, JSON, HTML, etc.)

Cons

  • May struggle with complex or poorly formatted PDFs
  • Requires external dependencies (Ghostscript) for certain functionalities
  • Can be slower compared to some other PDF extraction tools
  • Limited support for scanned PDFs or images

Code Examples

  1. Basic table extraction:
import camelot

tables = camelot.read_pdf('example.pdf')
print(f"Total tables extracted: {len(tables)}")
print(tables[0].df)  # Print the first table as a pandas DataFrame
  1. Extracting tables from specific pages:
tables = camelot.read_pdf('example.pdf', pages='1,3-5')
for table in tables:
    print(f"Table on page {table.page}")
    print(table.df)
  1. Using the lattice method with custom parameters:
tables = camelot.read_pdf('example.pdf', flavor='lattice', line_scale=40, process_background=True)
tables[0].to_csv('output.csv')  # Save the first table as CSV

Getting Started

To get started with Camelot, follow these steps:

  1. Install Camelot and its dependencies:

    pip install camelot-py[cv]
    
  2. Install Ghostscript (if not already installed):

    • On macOS: brew install ghostscript
    • On Ubuntu: apt-get install ghostscript
    • On Windows: Download from the Ghostscript website
  3. Use Camelot in your Python script:

    import camelot
    
    tables = camelot.read_pdf('your_pdf_file.pdf')
    print(tables[0].df)  # Print the first extracted table
    

For more advanced usage and options, refer to the Camelot documentation.

Competitor Comparisons

3,058

A Python library to extract tabular data from PDFs

Pros of Camelot

  • More actively maintained with recent updates and releases
  • Supports both Python 2 and Python 3
  • Includes additional features like stream processing and flavor detection

Cons of Camelot

  • May have a steeper learning curve due to additional features
  • Potentially slower performance for simple table extraction tasks
  • Requires more dependencies, which could increase setup complexity

Code Comparison

Camelot:

import camelot

tables = camelot.read_pdf('example.pdf')
tables[0].to_csv('output.csv')

atlanhq/camelot:

from camelot.pdf import PDF

pdf = PDF('example.pdf')
tables = pdf.extract_tables()
tables[0].to_csv('output.csv')

Both libraries offer similar basic functionality for extracting tables from PDFs. However, Camelot provides more advanced features and options for customization, while atlanhq/camelot focuses on simplicity and ease of use.

The main differences lie in the import statement and the method used to read the PDF and extract tables. Camelot uses a single function call, while atlanhq/camelot requires creating a PDF object first.

When choosing between the two, consider your specific needs, such as Python version compatibility, required features, and performance requirements.

6,831

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of Tabula

  • User-friendly GUI for non-programmers
  • Supports multiple output formats (CSV, TSV, JSON)
  • Can be used as a command-line tool or Java library

Cons of Tabula

  • Limited to extracting tables from PDFs
  • Less accurate on complex or poorly formatted tables
  • Fewer advanced features compared to Camelot

Code Comparison

Tabula (Java):

PDDocument document = PDDocument.load(new File("input.pdf"));
ObjectExtractor oe = new ObjectExtractor(document);
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
Page page = oe.extract(1);
List<Table> tables = sea.extract(page);

Camelot (Python):

import camelot

tables = camelot.read_pdf("input.pdf")
tables[0].to_csv("output.csv")

Camelot offers a more concise API for table extraction, while Tabula requires more setup code. Camelot also provides additional features like automatic table detection and multiple extraction methods, making it more versatile for complex PDFs. However, Tabula's Java implementation may be preferred in certain environments or for integration with existing Java projects.

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

Pros of pdftabextract

  • Specialized in extracting tables from scanned PDFs
  • Includes advanced image processing techniques for table detection
  • Offers flexibility in handling complex table structures

Cons of pdftabextract

  • Less actively maintained compared to Camelot
  • Limited documentation and examples
  • Steeper learning curve for beginners

Code Comparison

pdftabextract:

from pdftabextract import imgproc, extract, common
# Load image and process
img = imgproc.load_image("example.png")
processed_img = imgproc.process_image(img)
# Extract tables
tables = extract.extract_tables(processed_img)

Camelot:

import camelot
# Read PDF and extract tables
tables = camelot.read_pdf("example.pdf")
# Convert to DataFrame
df = tables[0].df

Both libraries aim to extract tables from PDFs, but they differ in their approach and use cases. pdftabextract is more focused on scanned PDFs and image processing, while Camelot offers a more straightforward API for general PDF table extraction. Camelot is generally easier to use and has better documentation, making it more suitable for beginners. However, pdftabextract may be more effective for complex scanned documents where advanced image processing is required.

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Pros of pdfplumber

  • More versatile, capable of extracting text, images, and other elements from PDFs
  • Better handling of complex PDF layouts and formatting
  • Provides detailed information about text positioning and styling

Cons of pdfplumber

  • Slower performance compared to Camelot, especially for large PDFs
  • May require more manual configuration for optimal results
  • Less specialized for table extraction tasks

Code Comparison

pdfplumber:

import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()

Camelot:

import camelot

tables = camelot.read_pdf("example.pdf")
table = tables[0].df

Both libraries offer straightforward ways to extract tables from PDFs, but Camelot's approach is more concise and specifically tailored for table extraction. pdfplumber provides a more comprehensive set of tools for working with various PDF elements, which can be beneficial for complex documents but may require additional code for table-specific tasks.

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

Pros of tabula-py

  • Simpler installation process, as it's a pure Python wrapper for Tabula
  • Lighter weight and potentially faster for basic table extraction tasks
  • Better integration with Java-based environments due to its use of Tabula

Cons of tabula-py

  • Less feature-rich compared to Camelot, with fewer table detection and extraction options
  • May struggle with complex PDF layouts or non-standard table formats
  • Requires Java to be installed on the system, which can be a limitation in some environments

Code Comparison

tabula-py:

import tabula

# Read PDF file
tables = tabula.read_pdf("example.pdf", pages="all")

# Convert to DataFrame
df = tables[0]

Camelot:

import camelot

# Read PDF file
tables = camelot.read_pdf("example.pdf", pages="all")

# Access first table
table = tables[0]
df = table.df

Both libraries offer similar basic functionality for extracting tables from PDFs. However, Camelot provides more advanced options for table detection and extraction, making it more suitable for complex PDF layouts. tabula-py, being a wrapper for Tabula, offers a simpler interface but may be limited in handling intricate table structures.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Camelot: PDF Table Extraction for Humans

Build Status Documentation Status codecov.io image image image Gitter chat image

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!

Note: You can also check out Excalibur, which is a web interface for Camelot!


Here's how you can extract tables from PDF files. Check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
Cycle NameKI (1/km)Distance (mi)Percent Fuel Savings
Improved SpeedDecreased AccelEliminate StopsDecreased Idle
2012_23.301.35.9%9.5%29.2%17.4%
2145_10.6811.22.4%0.1%9.5%2.7%
4234_10.5958.78.5%1.3%8.5%3.3%
2032_20.1757.821.7%0.3%2.7%1.2%
4171_10.07173.958.1%1.6%2.1%0.5%

There's a command-line interface too!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

Why Camelot?

  • You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
  • Export to multiple formats, including JSON, Excel, HTML and Sqlite.

See comparison with other PDF table extraction libraries and tools.

Installation

Using conda

The easiest way to install Camelot is to install it with conda, which is a package manager and environment management system for the Anaconda distribution.

$ conda install -c conda-forge camelot-py

Using pip

After installing the dependencies (tk and ghostscript), you can simply use pip to install Camelot:

$ pip install camelot-py[cv]

From the source code

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/camelot-dev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install ".[cv]"

Documentation

Great documentation is available at http://camelot-py.readthedocs.io/.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

$ git clone https://www.github.com/camelot-dev/camelot

Setting up a development environment

You can install the development dependencies easily, using pip:

$ pip install camelot-py[dev]

Testing

After installation, you can run tests using:

$ python setup.py test

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.