Convert Figma logo to code with AI

chezou logotabula-py

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

2,142
300
2,142
0

Top Related Projects

2,908

A Python library to extract tabular data from PDFs

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Community maintained fork of pdfminer - we fathom PDF

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

6,685

Tabula is a tool for liberating data tables trapped inside PDF files

5,027

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Quick Overview

Tabula-py is a Python wrapper for Tabula, a Java library and command-line tool for extracting tables from PDF files. It allows users to read tables from PDFs directly into pandas DataFrames, making it easier to work with tabular data embedded in PDF documents.

Pros

  • Simplifies the process of extracting tables from PDFs in Python
  • Integrates well with pandas, allowing direct conversion to DataFrames
  • Supports both local and remote PDF files
  • Offers various options for customizing table extraction

Cons

  • Requires Java to be installed on the system
  • May struggle with complex or poorly formatted PDF tables
  • Performance can be slow for large PDFs or many tables
  • Limited support for certain PDF formats or layouts

Code Examples

  1. Basic table extraction:
import tabula

# Read table from PDF
df = tabula.read_pdf("path/to/pdf/file.pdf", pages="all")

# Print the first DataFrame (if multiple tables are found)
print(df[0])
  1. Extracting tables with specific options:
import tabula

# Extract tables with custom options
df = tabula.read_pdf("path/to/pdf/file.pdf",
                     pages="1-3",
                     multiple_tables=True,
                     guess=False,
                     area=[20, 20, 580, 770])

# Print all extracted tables
for table in df:
    print(table)
  1. Converting PDF tables to CSV:
import tabula

# Convert PDF tables to CSV
tabula.convert_into("path/to/pdf/file.pdf", "output.csv", output_format="csv", pages="all")

Getting Started

To get started with tabula-py:

  1. Install Java if not already installed on your system.
  2. Install tabula-py using pip:
pip install tabula-py
  1. Import and use in your Python script:
import tabula

# Read tables from a PDF file
df = tabula.read_pdf("path/to/your/pdf/file.pdf")

# Work with the extracted data
print(df[0])  # Print the first table

Note: Make sure to replace "path/to/your/pdf/file.pdf" with the actual path to your PDF file.

Competitor Comparisons

2,908

A Python library to extract tabular data from PDFs

Pros of Camelot

  • More accurate table extraction, especially for complex layouts
  • Supports both stream and lattice-based extraction methods
  • Built-in table analysis and manipulation features

Cons of Camelot

  • Slower processing speed compared to Tabula-py
  • More complex setup and dependencies
  • Limited to Python, while Tabula-py can leverage Java libraries

Code Comparison

Tabula-py:

import tabula

df = tabula.read_pdf("file.pdf", pages="all")
tabula.convert_into("file.pdf", "output.csv", output_format="csv", pages="all")

Camelot:

import camelot

tables = camelot.read_pdf("file.pdf", pages="all")
tables[0].to_csv("output.csv")
tables[0].df  # Access extracted data as a pandas DataFrame

Both libraries aim to extract tables from PDF files, but Camelot offers more advanced features and control over the extraction process. Tabula-py provides a simpler interface and faster processing, making it suitable for straightforward table extraction tasks. Camelot excels in handling complex layouts and offers built-in analysis tools, but requires more setup and has a steeper learning curve.

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Pros of pdfplumber

  • Pure Python implementation, no Java dependency
  • More flexible for extracting various types of data (text, tables, images)
  • Better handling of complex PDF layouts

Cons of pdfplumber

  • Generally slower performance compared to tabula-py
  • May require more manual configuration for table extraction

Code Comparison

pdfplumber:

import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()

tabula-py:

import tabula

table = tabula.read_pdf("example.pdf", pages=1)

Both libraries aim to extract data from PDFs, but pdfplumber offers more flexibility at the cost of performance, while tabula-py is faster but may be less accurate for complex layouts. pdfplumber is better suited for projects requiring detailed PDF analysis, while tabula-py excels in quick table extraction from simple PDFs. The choice between them depends on the specific requirements of your project, such as processing speed, accuracy, and the complexity of the PDFs you're working with.

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

  • More flexible and customizable for extracting various types of content from PDFs
  • Better support for complex PDF structures and layouts
  • Can extract text, images, and metadata from PDFs

Cons of pdfminer.six

  • Steeper learning curve and more complex API
  • May require more code to extract tabular data specifically
  • Slower performance for large PDFs compared to tabula-py

Code Comparison

tabula-py:

import tabula

df = tabula.read_pdf("example.pdf", pages="all")
print(df)

pdfminer.six:

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")
print(text)

Summary

tabula-py is specifically designed for extracting tabular data from PDFs, making it easier to use for this specific task. It's faster and more straightforward for table extraction but limited in other PDF processing capabilities.

pdfminer.six offers more comprehensive PDF processing features, including text, image, and metadata extraction. It's more versatile but requires more setup and coding for specific tasks like table extraction.

Choose tabula-py for quick and easy table extraction, and pdfminer.six for more complex PDF processing needs or when dealing with PDFs that have varied content and structures.

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Pros of pdfminer

  • Pure Python implementation, no external dependencies required
  • More flexible for extracting various types of content from PDFs (text, images, metadata)
  • Supports a wider range of PDF features and formats

Cons of pdfminer

  • Generally slower performance compared to tabula-py
  • Less specialized for table extraction, may require more custom coding for complex tables
  • Steeper learning curve for beginners

Code Comparison

pdfminer:

from pdfminer.high_level import extract_text

text = extract_text('document.pdf')
print(text)

tabula-py:

import tabula

tables = tabula.read_pdf('document.pdf', pages='all')
print(tables)

pdfminer offers more granular control over PDF parsing, while tabula-py provides a simpler interface specifically for table extraction. pdfminer requires more code to extract tables, but offers greater flexibility for other PDF content. tabula-py is more straightforward for table extraction but may be limited for other PDF processing tasks.

6,685

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of tabula

  • Written in Java, offering potentially better performance for large-scale PDF processing
  • More mature project with a longer development history and larger community
  • Provides a command-line interface for easy integration into various workflows

Cons of tabula

  • Requires Java runtime environment, which may not be available on all systems
  • Less convenient for Python developers who prefer native Python libraries
  • May have a steeper learning curve for those unfamiliar with Java

Code comparison

tabula (Java):

import technology.tabula.ObjectExtractor;
import technology.tabula.Page;
import technology.tabula.Table;

ObjectExtractor oe = new ObjectExtractor(pdfDocument);
Page page = oe.extract(1);
List<Table> tables = page.getTables();

tabula-py (Python):

import tabula

tables = tabula.read_pdf("path/to/pdf", pages="1")

tabula-py provides a more straightforward Python interface, while tabula offers more granular control over the extraction process in Java. The Python version simplifies usage for those already working in Python environments, whereas the Java version may be more suitable for enterprise-level applications or when deeper customization is required.

5,027

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Pros of PyMuPDF

  • Broader PDF manipulation capabilities beyond table extraction
  • Faster performance for general PDF operations
  • More comprehensive documentation and examples

Cons of PyMuPDF

  • Less specialized for table extraction from PDFs
  • May require more code to extract tables compared to tabula-py
  • Steeper learning curve for specific table extraction tasks

Code Comparison

tabula-py:

import tabula

# Extract table from PDF
df = tabula.read_pdf("input.pdf", pages="all")

PyMuPDF:

import fitz

doc = fitz.open("input.pdf")
page = doc[0]
tables = page.find_tables()
for table in tables:
    print(table.extract())

Both libraries offer PDF table extraction capabilities, but tabula-py provides a more straightforward approach specifically for this task. PyMuPDF requires more code but offers greater flexibility for various PDF operations.

tabula-py is ideal for projects focused primarily on table extraction from PDFs, while PyMuPDF is better suited for more comprehensive PDF manipulation tasks that may include table extraction as one of many features.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

tabula-py

Build Status PyPI version Documentation Status PyPI - Downloads

tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV or a JSON file.

You can see the example notebook and try it on Google Colab, or we highly recommend reading our documentation, especially the FAQ section.

tabula-py example

Requirements

  • Java 8+
  • Python 3.8+

OS

I confirmed working on macOS and Ubuntu. But some people confirm it works on Windows 10. See also the documentation for the detailed installation for Windows 10.

Usage

Install

Ensure you have a Java runtime and set the PATH for it.

pip install tabula-py

If you want to leverage faster execution with jpype, install with jpype extra.

pip install tabula-py[jpype]

Example

tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save the file as a CSV, a TSV, or a JSON.  

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')

# Read remote pdf into list of DataFrame
dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# convert PDF into CSV file
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')

See an example notebook for more details. I also recommend reading the tutorial article written by @aegis4048, and another tutorial written by @tdpetrou.

Contributing

Interested in helping out? I'd love to have your help!

You can help by:

  • Reporting a bug.
  • Adding or editing documentation.
  • Contributing code via a Pull Request. See also for the contribution
  • Write a blog post or spread the word about tabula-py to people who might be able to benefit from using it.

Contributors

Another support

You can also support our continued work on tabula-py with a donation on GitHub Sponsors or Patreon.