Top Related Projects
A Python library to extract tabular data from PDFs
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Community maintained fork of pdfminer - we fathom PDF
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
Tabula is a tool for liberating data tables trapped inside PDF files
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Quick Overview
Tabula-py is a Python wrapper for Tabula, a Java library and command-line tool for extracting tables from PDF files. It allows users to read tables from PDFs directly into pandas DataFrames, making it easier to work with tabular data embedded in PDF documents.
Pros
- Simplifies the process of extracting tables from PDFs in Python
- Integrates well with pandas, allowing direct conversion to DataFrames
- Supports both local and remote PDF files
- Offers various options for customizing table extraction
Cons
- Requires Java to be installed on the system
- May struggle with complex or poorly formatted PDF tables
- Performance can be slow for large PDFs or many tables
- Limited support for certain PDF formats or layouts
Code Examples
- Basic table extraction:
import tabula
# Read table from PDF
df = tabula.read_pdf("path/to/pdf/file.pdf", pages="all")
# Print the first DataFrame (if multiple tables are found)
print(df[0])
- Extracting tables with specific options:
import tabula
# Extract tables with custom options
df = tabula.read_pdf("path/to/pdf/file.pdf",
pages="1-3",
multiple_tables=True,
guess=False,
area=[20, 20, 580, 770])
# Print all extracted tables
for table in df:
print(table)
- Converting PDF tables to CSV:
import tabula
# Convert PDF tables to CSV
tabula.convert_into("path/to/pdf/file.pdf", "output.csv", output_format="csv", pages="all")
Getting Started
To get started with tabula-py:
- Install Java if not already installed on your system.
- Install tabula-py using pip:
pip install tabula-py
- Import and use in your Python script:
import tabula
# Read tables from a PDF file
df = tabula.read_pdf("path/to/your/pdf/file.pdf")
# Work with the extracted data
print(df[0]) # Print the first table
Note: Make sure to replace "path/to/your/pdf/file.pdf" with the actual path to your PDF file.
Competitor Comparisons
A Python library to extract tabular data from PDFs
Pros of Camelot
- More accurate table extraction, especially for complex layouts
- Supports both stream and lattice-based extraction methods
- Built-in table analysis and manipulation features
Cons of Camelot
- Slower processing speed compared to Tabula-py
- More complex setup and dependencies
- Limited to Python, while Tabula-py can leverage Java libraries
Code Comparison
Tabula-py:
import tabula
df = tabula.read_pdf("file.pdf", pages="all")
tabula.convert_into("file.pdf", "output.csv", output_format="csv", pages="all")
Camelot:
import camelot
tables = camelot.read_pdf("file.pdf", pages="all")
tables[0].to_csv("output.csv")
tables[0].df # Access extracted data as a pandas DataFrame
Both libraries aim to extract tables from PDF files, but Camelot offers more advanced features and control over the extraction process. Tabula-py provides a simpler interface and faster processing, making it suitable for straightforward table extraction tasks. Camelot excels in handling complex layouts and offers built-in analysis tools, but requires more setup and has a steeper learning curve.
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Pros of pdfplumber
- Pure Python implementation, no Java dependency
- More flexible for extracting various types of data (text, tables, images)
- Better handling of complex PDF layouts
Cons of pdfplumber
- Generally slower performance compared to tabula-py
- May require more manual configuration for table extraction
Code Comparison
pdfplumber:
import pdfplumber
with pdfplumber.open("example.pdf") as pdf:
page = pdf.pages[0]
table = page.extract_table()
tabula-py:
import tabula
table = tabula.read_pdf("example.pdf", pages=1)
Both libraries aim to extract data from PDFs, but pdfplumber offers more flexibility at the cost of performance, while tabula-py is faster but may be less accurate for complex layouts. pdfplumber is better suited for projects requiring detailed PDF analysis, while tabula-py excels in quick table extraction from simple PDFs. The choice between them depends on the specific requirements of your project, such as processing speed, accuracy, and the complexity of the PDFs you're working with.
Community maintained fork of pdfminer - we fathom PDF
Pros of pdfminer.six
- More flexible and customizable for extracting various types of content from PDFs
- Better support for complex PDF structures and layouts
- Can extract text, images, and metadata from PDFs
Cons of pdfminer.six
- Steeper learning curve and more complex API
- May require more code to extract tabular data specifically
- Slower performance for large PDFs compared to tabula-py
Code Comparison
tabula-py:
import tabula
df = tabula.read_pdf("example.pdf", pages="all")
print(df)
pdfminer.six:
from pdfminer.high_level import extract_text
text = extract_text("example.pdf")
print(text)
Summary
tabula-py is specifically designed for extracting tabular data from PDFs, making it easier to use for this specific task. It's faster and more straightforward for table extraction but limited in other PDF processing capabilities.
pdfminer.six offers more comprehensive PDF processing features, including text, image, and metadata extraction. It's more versatile but requires more setup and coding for specific tasks like table extraction.
Choose tabula-py for quick and easy table extraction, and pdfminer.six for more complex PDF processing needs or when dealing with PDFs that have varied content and structures.
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
Pros of pdfminer
- Pure Python implementation, no external dependencies required
- More flexible for extracting various types of content from PDFs (text, images, metadata)
- Supports a wider range of PDF features and formats
Cons of pdfminer
- Generally slower performance compared to tabula-py
- Less specialized for table extraction, may require more custom coding for complex tables
- Steeper learning curve for beginners
Code Comparison
pdfminer:
from pdfminer.high_level import extract_text
text = extract_text('document.pdf')
print(text)
tabula-py:
import tabula
tables = tabula.read_pdf('document.pdf', pages='all')
print(tables)
pdfminer offers more granular control over PDF parsing, while tabula-py provides a simpler interface specifically for table extraction. pdfminer requires more code to extract tables, but offers greater flexibility for other PDF content. tabula-py is more straightforward for table extraction but may be limited for other PDF processing tasks.
Tabula is a tool for liberating data tables trapped inside PDF files
Pros of tabula
- Written in Java, offering potentially better performance for large-scale PDF processing
- More mature project with a longer development history and larger community
- Provides a command-line interface for easy integration into various workflows
Cons of tabula
- Requires Java runtime environment, which may not be available on all systems
- Less convenient for Python developers who prefer native Python libraries
- May have a steeper learning curve for those unfamiliar with Java
Code comparison
tabula (Java):
import technology.tabula.ObjectExtractor;
import technology.tabula.Page;
import technology.tabula.Table;
ObjectExtractor oe = new ObjectExtractor(pdfDocument);
Page page = oe.extract(1);
List<Table> tables = page.getTables();
tabula-py (Python):
import tabula
tables = tabula.read_pdf("path/to/pdf", pages="1")
tabula-py provides a more straightforward Python interface, while tabula offers more granular control over the extraction process in Java. The Python version simplifies usage for those already working in Python environments, whereas the Java version may be more suitable for enterprise-level applications or when deeper customization is required.
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Pros of PyMuPDF
- Broader PDF manipulation capabilities beyond table extraction
- Faster performance for general PDF operations
- More comprehensive documentation and examples
Cons of PyMuPDF
- Less specialized for table extraction from PDFs
- May require more code to extract tables compared to tabula-py
- Steeper learning curve for specific table extraction tasks
Code Comparison
tabula-py:
import tabula
# Extract table from PDF
df = tabula.read_pdf("input.pdf", pages="all")
PyMuPDF:
import fitz
doc = fitz.open("input.pdf")
page = doc[0]
tables = page.find_tables()
for table in tables:
print(table.extract())
Both libraries offer PDF table extraction capabilities, but tabula-py provides a more straightforward approach specifically for this task. PyMuPDF requires more code but offers greater flexibility for various PDF operations.
tabula-py is ideal for projects focused primarily on table extraction from PDFs, while PyMuPDF is better suited for more comprehensive PDF manipulation tasks that may include table extraction as one of many features.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
tabula-py
tabula-py
is a simple Python wrapper of tabula-java, which can read tables in a PDF.
You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV or a JSON file.
You can see the example notebook and try it on Google Colab, or we highly recommend reading our documentation, especially the FAQ section.
Requirements
- Java 8+
- Python 3.8+
OS
I confirmed working on macOS and Ubuntu. But some people confirm it works on Windows 10. See also the documentation for the detailed installation for Windows 10.
Usage
- Documentation
- FAQ would be helpful if you have an issue
- Example notebook on Google Colaboratory
Install
Ensure you have a Java runtime and set the PATH for it.
pip install tabula-py
If you want to leverage faster execution with jpype, install with jpype
extra.
pip install tabula-py[jpype]
Example
tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save the file as a CSV, a TSV, or a JSON. Â
import tabula
# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')
# Read remote pdf into list of DataFrame
dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")
# convert PDF into CSV file
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')
# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')
See an example notebook for more details. I also recommend reading the tutorial article written by @aegis4048, and another tutorial written by @tdpetrou.
Contributing
Interested in helping out? I'd love to have your help!
You can help by:
- Reporting a bug.
- Adding or editing documentation.
- Contributing code via a Pull Request. See also for the contribution
- Write a blog post or spread the word about
tabula-py
to people who might be able to benefit from using it.
Contributors
- @lahoffm
- @jakekara
- @lcd1232
- @kirkholloway
- @CurtLH
- @nikhilgk
- @krassowski
- @alexandreio
- @rmnevesLH
- @red-bin
- @Gallaecio
- @red-bin
- @alexandreio
- @bpben
- @Bueddl
- @cjotade
- @codeboy5
- @manohar-voggu
- @deveshSingh06
- @grfeller
- @djbrown
- @swar
- @mvoggu
- @tdpetrou
Another support
You can also support our continued work on tabula-py
with a donation on GitHub Sponsors or Patreon.
Top Related Projects
A Python library to extract tabular data from PDFs
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Community maintained fork of pdfminer - we fathom PDF
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
Tabula is a tool for liberating data tables trapped inside PDF files
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot