Convert Figma logo to code with AI

camelot-dev logocamelot

A Python library to extract tabular data from PDFs

2,908
462
2,908
269

Top Related Projects

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

6,685

Tabula is a tool for liberating data tables trapped inside PDF files

3,639

Camelot: PDF Table Extraction for Humans

Community maintained fork of pdfminer - we fathom PDF

1,855

pdfrw is a pure Python library that reads and writes PDFs

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Quick Overview

Camelot is a Python library designed to extract tabular data from PDF files. It provides two methods for table extraction: stream and lattice, which can handle different types of table structures. Camelot aims to simplify the process of extracting tables from PDFs, a task that is often challenging due to the complex nature of PDF documents.

Pros

  • Highly accurate table extraction, especially for well-structured tables
  • Supports both stream and lattice-based extraction methods
  • Provides options for fine-tuning extraction parameters
  • Outputs data in various formats, including CSV, Excel, and JSON

Cons

  • May struggle with complex or poorly formatted tables
  • Requires external dependencies (Ghostscript) for certain functionalities
  • Limited to table extraction, not suitable for other types of PDF data
  • Performance can be slow for large or complex PDFs

Code Examples

  1. Basic table extraction:
import camelot

tables = camelot.read_pdf('example.pdf')
print(f"Total tables extracted: {len(tables)}")
print(tables[0].df)  # Print the first table as a pandas DataFrame
  1. Extracting tables from specific pages:
tables = camelot.read_pdf('example.pdf', pages='1,3-5')
for i, table in enumerate(tables):
    table.to_csv(f'table_{i}.csv')
  1. Using lattice mode for tables with borders:
tables = camelot.read_pdf('example.pdf', flavor='lattice')
tables[0].to_excel('output.xlsx')
  1. Adjusting extraction parameters:
tables = camelot.read_pdf('example.pdf', flavor='stream', 
                          table_areas=['0,700,800,100'],
                          columns=['150,250,350,450,550'])
tables[0].parsing_report

Getting Started

To get started with Camelot, follow these steps:

  1. Install Camelot and its dependencies:

    pip install camelot-py[cv]
    
  2. Install Ghostscript (required for certain functionalities):

    • On macOS: brew install ghostscript
    • On Windows: Download from the Ghostscript website
    • On Linux: Use your distribution's package manager
  3. Basic usage:

    import camelot
    
    tables = camelot.read_pdf('your_pdf_file.pdf')
    tables[0].to_csv('output.csv')
    

For more advanced usage and configuration options, refer to the official Camelot documentation.

Competitor Comparisons

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Pros of pdfplumber

  • More flexible in handling complex PDF layouts and structures
  • Better at extracting text with precise positioning information
  • Supports extraction of images and other non-text elements from PDFs

Cons of pdfplumber

  • Generally slower performance compared to Camelot
  • May require more manual configuration for optimal results
  • Less specialized for tabular data extraction

Code Comparison

pdfplumber:

import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    table = page.extract_table()

Camelot:

import camelot

tables = camelot.read_pdf("example.pdf")
df = tables[0].df

Both libraries offer Python-based solutions for PDF data extraction, but they have different strengths. pdfplumber excels in handling complex layouts and providing detailed positioning information, while Camelot is more focused on efficient table extraction. The choice between them depends on the specific requirements of your project and the nature of the PDFs you're working with.

6,685

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of Tabula

  • User-friendly GUI for non-technical users
  • Supports multiple output formats (CSV, TSV, JSON)
  • Can be used as a standalone application without coding knowledge

Cons of Tabula

  • Limited to extracting tables from PDFs
  • Less flexibility in handling complex table structures
  • Fewer options for fine-tuning extraction parameters

Code Comparison

Tabula (Java):

PDDocument document = PDDocument.load(new File("input.pdf"));
ObjectExtractor oe = new ObjectExtractor(document);
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
Page page = oe.extract(1);
List<Table> tables = sea.extract(page);

Camelot (Python):

import camelot
tables = camelot.read_pdf("input.pdf", pages="1")
tables[0].to_csv("output.csv")

Camelot offers a more concise Python API for table extraction, while Tabula's Java implementation requires more setup code. Camelot provides greater control over extraction parameters and supports both stream and lattice table extraction methods, making it more versatile for complex table structures. However, Tabula's GUI makes it more accessible for users without programming experience.

3,639

Camelot: PDF Table Extraction for Humans

Pros of Camelot (atlanhq)

  • More recent updates and active development
  • Broader functionality, including support for image-based table extraction
  • Larger community and more frequent releases

Cons of Camelot (atlanhq)

  • Potentially more complex to use due to additional features
  • May have higher system requirements for image processing capabilities

Code Comparison

Camelot (camelot-dev):

import camelot
tables = camelot.read_pdf('example.pdf')
df = tables[0].df

Camelot (atlanhq):

from camelot.io import read_pdf
tables = read_pdf('example.pdf', flavor='lattice')
df = tables[0].df

The basic usage is similar, but Camelot (atlanhq) offers additional options:

tables = read_pdf('example.pdf', pages='1-3', flavor='stream')

Both libraries provide similar core functionality for extracting tables from PDFs. However, Camelot (atlanhq) has expanded its capabilities to include image-based extraction and offers more customization options. The trade-off is potentially increased complexity for simple use cases. Users should consider their specific needs when choosing between the two, with Camelot (atlanhq) being more suitable for complex or diverse table extraction tasks.

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

  • More flexible and low-level PDF parsing capabilities
  • Supports a wider range of PDF features and structures
  • Can be used for various PDF-related tasks beyond just table extraction

Cons of pdfminer.six

  • Requires more manual work and coding to extract structured data
  • Less user-friendly for beginners or those seeking quick table extraction
  • May require additional processing to obtain clean, tabular data

Code Comparison

pdfminer.six:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer

for page_layout in extract_pages("document.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

Camelot:

import camelot

tables = camelot.read_pdf("document.pdf")
print(tables[0].df)  # Print first table as pandas DataFrame

The code comparison shows that pdfminer.six requires more manual parsing and processing of PDF elements, while Camelot provides a higher-level API for direct table extraction and conversion to pandas DataFrames. This illustrates the trade-off between flexibility and ease of use between the two libraries.

1,855

pdfrw is a pure Python library that reads and writes PDFs

Pros of pdfrw

  • Lightweight and focused on PDF manipulation
  • Pure Python implementation, no external dependencies
  • Suitable for both reading and writing PDF files

Cons of pdfrw

  • Limited functionality for extracting tabular data
  • Less active development and community support
  • Lacks advanced features for complex PDF parsing

Code Comparison

pdfrw:

from pdfrw import PdfReader, PdfWriter

reader = PdfReader('input.pdf')
writer = PdfWriter()
writer.addpage(reader.pages[0])
writer.write('output.pdf')

Camelot:

import camelot

tables = camelot.read_pdf('input.pdf')
tables[0].to_csv('output.csv')

Summary

pdfrw is a lightweight, pure Python library for basic PDF manipulation, while Camelot is specifically designed for extracting tabular data from PDFs. pdfrw offers more general PDF handling capabilities but lacks advanced features for complex data extraction. Camelot, on the other hand, excels at table extraction but has a narrower focus. The choice between the two depends on the specific requirements of your project, with pdfrw being more suitable for general PDF operations and Camelot for targeted table extraction tasks.

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Pros of pdfminer

  • More flexible and customizable for extracting text and metadata from PDFs
  • Supports a wider range of PDF features and structures
  • Can be used as a library or command-line tool

Cons of pdfminer

  • Requires more coding knowledge and effort to extract tabular data
  • Less user-friendly for non-programmers
  • May require additional processing to clean and structure extracted data

Code Comparison

pdfminer:

from pdfminer.high_level import extract_text

text = extract_text('example.pdf')
print(text)

Camelot:

import camelot

tables = camelot.read_pdf('example.pdf')
print(tables[0].df)

pdfminer focuses on extracting raw text content, while Camelot is specifically designed for extracting tabular data from PDFs. pdfminer provides more low-level control over the extraction process, but requires more code to handle complex structures. Camelot offers a simpler API for table extraction, making it easier to use for specific tabular data needs.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Camelot: PDF Table Extraction for Humans

tests Documentation Status codecov.io image image image Gitter chat image

Camelot is a Python library that can help you extract tables from PDFs!

Note: You can also check out Excalibur, the web interface to Camelot!


Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
Cycle NameKI (1/km)Distance (mi)Percent Fuel Savings
Improved SpeedDecreased AccelEliminate StopsDecreased Idle
2012_23.301.35.9%9.5%29.2%17.4%
2145_10.6811.22.4%0.1%9.5%2.7%
4234_10.5958.78.5%1.3%8.5%3.3%
2032_20.1757.821.7%0.3%2.7%1.2%
4171_10.07173.958.1%1.6%2.1%0.5%

Camelot also comes packaged with a command-line interface!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

You can check out some frequently asked questions here.

Why Camelot?

  • Configurability: Camelot gives you control over the table extraction process with tweakable settings.
  • Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
  • Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.

See comparison with similar libraries and tools.

Support the development

If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective.

Installation

Using conda

The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.

$ conda install -c conda-forge camelot-py

Using pip

After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:

$ pip install "camelot-py[base]"

From the source code

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/camelot-dev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install ".[base]"

Documentation

The documentation is available at http://camelot-py.readthedocs.io/.

Wrappers

Contributing

The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.