tabula-py

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

2,259

297

2,259

View on GitHub

Top Related Projects

camelot

3,333

A Python library to extract tabular data from PDFs

pdfplumber

7,889

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

pdfminer

5,293

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

PyMuPDF

7,041

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Quick Overview

Tabula-py is a Python wrapper for Tabula, a Java library and command-line tool for extracting tables from PDF files. It allows users to read tables from PDFs directly into pandas DataFrames, making it easier to work with tabular data embedded in PDF documents.

Pros

Simplifies the process of extracting tables from PDFs in Python
Integrates well with pandas, allowing direct conversion to DataFrames
Supports both local and remote PDF files
Offers various options for customizing table extraction

Cons

Requires Java to be installed on the system
May struggle with complex or poorly formatted PDF tables
Performance can be slow for large PDFs or many tables
Limited support for certain PDF formats or layouts

Code Examples

Basic table extraction:

import tabula

# Read table from PDF
df = tabula.read_pdf("path/to/pdf/file.pdf", pages="all")

# Print the first DataFrame (if multiple tables are found)
print(df[0])

Extracting tables with specific options:

import tabula

# Extract tables with custom options
df = tabula.read_pdf("path/to/pdf/file.pdf",
                     pages="1-3",
                     multiple_tables=True,
                     guess=False,
                     area=[20, 20, 580, 770])

# Print all extracted tables
for table in df:
    print(table)

Converting PDF tables to CSV:

import tabula

# Convert PDF tables to CSV
tabula.convert_into("path/to/pdf/file.pdf", "output.csv", output_format="csv", pages="all")

Getting Started

To get started with tabula-py:

Install Java if not already installed on your system.
Install tabula-py using pip:

pip install tabula-py

Import and use in your Python script:

import tabula

# Read tables from a PDF file
df = tabula.read_pdf("path/to/your/pdf/file.pdf")

# Work with the extracted data
print(df[0])  # Print the first table

Note: Make sure to replace "path/to/your/pdf/file.pdf" with the actual path to your PDF file.

Competitor Comparisons

camelot

3,333

A Python library to extract tabular data from PDFs

Pros of Camelot

More accurate table extraction, especially for complex layouts
Supports both stream and lattice-based extraction methods
Built-in table analysis and manipulation features

Cons of Camelot

Slower processing speed compared to Tabula-py
More complex setup and dependencies
Limited to Python, while Tabula-py can leverage Java libraries

Code Comparison

Tabula-py:

import tabula

df = tabula.read_pdf("file.pdf", pages="all")
tabula.convert_into("file.pdf", "output.csv", output_format="csv", pages="all")

Camelot:

import camelot

tables = camelot.read_pdf("file.pdf", pages="all")
tables[0].to_csv("output.csv")
tables[0].df  # Access extracted data as a pandas DataFrame

Both libraries aim to extract tables from PDF files, but Camelot offers more advanced features and control over the extraction process. Tabula-py provides a simpler interface and faster processing, making it suitable for straightforward table extraction tasks. Camelot excels in handling complex layouts and offers built-in analysis tools, but requires more setup and has a steeper learning curve.

pdfplumber

7,889

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Pros of pdfplumber

Pure Python implementation, no Java dependency
More flexible for extracting various types of data (text, tables, images)
Better handling of complex PDF layouts

Cons of pdfplumber

Generally slower performance compared to tabula-py
May require more manual configuration for table extraction

Code Comparison

pdfplumber:

import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()

tabula-py:

import tabula

table = tabula.read_pdf("example.pdf", pages=1)

Both libraries aim to extract data from PDFs, but pdfplumber offers more flexibility at the cost of performance, while tabula-py is faster but may be less accurate for complex layouts. pdfplumber is better suited for projects requiring detailed PDF analysis, while tabula-py excels in quick table extraction from simple PDFs. The choice between them depends on the specific requirements of your project, such as processing speed, accuracy, and the complexity of the PDFs you're working with.

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

More flexible and customizable for extracting various types of content from PDFs
Better support for complex PDF structures and layouts
Can extract text, images, and metadata from PDFs

Cons of pdfminer.six

Steeper learning curve and more complex API
May require more code to extract tabular data specifically
Slower performance for large PDFs compared to tabula-py

Code Comparison

tabula-py:

import tabula

df = tabula.read_pdf("example.pdf", pages="all")
print(df)

pdfminer.six:

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")
print(text)

Summary

tabula-py is specifically designed for extracting tabular data from PDFs, making it easier to use for this specific task. It's faster and more straightforward for table extraction but limited in other PDF processing capabilities.

pdfminer.six offers more comprehensive PDF processing features, including text, image, and metadata extraction. It's more versatile but requires more setup and coding for specific tasks like table extraction.

Choose tabula-py for quick and easy table extraction, and pdfminer.six for more complex PDF processing needs or when dealing with PDFs that have varied content and structures.

pdfminer

5,293

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Pros of pdfminer

Pure Python implementation, no external dependencies required
More flexible for extracting various types of content from PDFs (text, images, metadata)
Supports a wider range of PDF features and formats

Cons of pdfminer

Generally slower performance compared to tabula-py
Less specialized for table extraction, may require more custom coding for complex tables
Steeper learning curve for beginners

Code Comparison

pdfminer:

from pdfminer.high_level import extract_text

text = extract_text('document.pdf')
print(text)

tabula-py:

import tabula

tables = tabula.read_pdf('document.pdf', pages='all')
print(tables)

pdfminer offers more granular control over PDF parsing, while tabula-py provides a simpler interface specifically for table extraction. pdfminer requires more code to extract tables, but offers greater flexibility for other PDF content. tabula-py is more straightforward for table extraction but may be limited for other PDF processing tasks.

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of tabula

Written in Java, offering potentially better performance for large-scale PDF processing
More mature project with a longer development history and larger community
Provides a command-line interface for easy integration into various workflows

Cons of tabula

Requires Java runtime environment, which may not be available on all systems
Less convenient for Python developers who prefer native Python libraries
May have a steeper learning curve for those unfamiliar with Java

Code comparison

tabula (Java):

import technology.tabula.ObjectExtractor;
import technology.tabula.Page;
import technology.tabula.Table;

ObjectExtractor oe = new ObjectExtractor(pdfDocument);
Page page = oe.extract(1);
List<Table> tables = page.getTables();

tabula-py (Python):

import tabula

tables = tabula.read_pdf("path/to/pdf", pages="1")

tabula-py provides a more straightforward Python interface, while tabula offers more granular control over the extraction process in Java. The Python version simplifies usage for those already working in Python environments, whereas the Java version may be more suitable for enterprise-level applications or when deeper customization is required.

PyMuPDF

7,041

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Pros of PyMuPDF

Broader PDF manipulation capabilities beyond table extraction
Faster performance for general PDF operations
More comprehensive documentation and examples

Cons of PyMuPDF

Less specialized for table extraction from PDFs
May require more code to extract tables compared to tabula-py
Steeper learning curve for specific table extraction tasks

Code Comparison

tabula-py:

import tabula

# Extract table from PDF
df = tabula.read_pdf("input.pdf", pages="all")

PyMuPDF:

import fitz

doc = fitz.open("input.pdf")
page = doc[0]
tables = page.find_tables()
for table in tables:
    print(table.extract())

Both libraries offer PDF table extraction capabilities, but tabula-py provides a more straightforward approach specifically for this task. PyMuPDF requires more code but offers greater flexibility for various PDF operations.

tabula-py is ideal for projects focused primarily on table extraction from PDFs, while PyMuPDF is better suited for more comprehensive PDF manipulation tasks that may include table extraction as one of many features.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

tabula-py

PyPI - Downloads

tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV or a JSON file.

You can see the example notebook and try it on Google Colab, or we highly recommend reading our documentation, especially the FAQ section.

tabula-py example

Requirements

Java 8+
Python 3.9+

OS

I confirmed working on macOS and Ubuntu. But some people confirm it works on Windows 10. See also the documentation for the detailed installation for Windows 10.

Usage

Documentation
- FAQ would be helpful if you have an issue
Example notebook on Google Colaboratory

Install

Ensure you have a Java runtime and set the PATH for it.

pip install tabula-py

If you want to leverage faster execution with jpype, install with jpype extra.

pip install tabula-py[jpype]

Example

tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save theÂ file as a CSV, a TSV, or a JSON.Â Â

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')

# Read remote pdf into list of DataFrame
dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# convert PDF into CSV file
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')

See an example notebook for more details. I also recommend reading the tutorial article written by @aegis4048, and another tutorial written by @tdpetrou.

Contributing

Interested in helping out? I'd love to have your help!

You can help by:

Reporting a bug.
Adding or editing documentation.
Contributing code via a Pull Request. See also for the contribution
Write a blog post or spread the word about tabula-py to people who might be able to benefit from using it.

Contributors

Another support

You can also support our continued work on tabula-py with a donation on GitHub Sponsors or Patreon.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot