tabula

Tabula is a tool for liberating data tables trapped inside PDF files

7,078

669

7,078

541

View on GitHub

Top Related Projects

camelot

3,333

A Python library to extract tabular data from PDFs

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

pdfminer

5,293

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PyMuPDF

7,705

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

pdfplumber

7,889

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Quick Overview

Tabula is an open-source tool for extracting data from PDF tables. It allows users to select tables in PDF files and export them into CSV, TSV, and other formats. Tabula is designed to be user-friendly and can be used through a web interface or command-line interface.

Pros

Easy-to-use web interface for non-technical users
Supports multiple output formats (CSV, TSV, JSON)
Free and open-source
Available as a desktop application for Windows, macOS, and Linux

Cons

Limited to extracting data from tables in PDFs
May struggle with complex or poorly formatted PDFs
Requires Java to run
Not suitable for batch processing large numbers of PDFs without additional scripting

Code Examples

As Tabula is primarily a desktop application and command-line tool, there are no direct code examples for using it as a library. However, you can use Tabula in conjunction with other programming languages through its command-line interface.

Getting Started

To get started with Tabula:

Download the latest release from the Tabula GitHub releases page.
Install Java if not already installed on your system.
Run the Tabula application:
- On Windows: Double-click the tabula.exe file.
- On macOS: Double-click the Tabula.app file.
- On Linux: Run java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -jar tabula.jar in the terminal.
Open your web browser and go to http://localhost:8080 to access the Tabula interface.
Upload a PDF file, select the tables you want to extract, and choose your preferred output format.

For command-line usage:

java -jar tabula.jar -p 1-10 -a 269.875,12.75,790.5,561 -f CSV input.pdf

This command extracts tables from pages 1-10 of input.pdf, focusing on the specified area coordinates, and outputs the result in CSV format.

Competitor Comparisons

camelot

3,333

A Python library to extract tabular data from PDFs

Pros of Camelot

More accurate table extraction, especially for complex layouts
Supports both stream and lattice-based extraction methods
Provides additional features like splitting and merging tables

Cons of Camelot

Slower processing speed compared to Tabula
Requires additional dependencies (e.g., Ghostscript)
May have compatibility issues with certain PDF formats

Code Comparison

Tabula (Java):

PDDocument document = PDDocument.load(new File("input.pdf"));
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
Page page = document.getPage(0);
List<Table> tables = sea.extract(page);

Camelot (Python):

import camelot

tables = camelot.read_pdf("input.pdf", pages="1")
df = tables[0].df

Both libraries aim to extract tables from PDF files, but they differ in their implementation and features. Tabula is written in Java and focuses on simplicity, while Camelot is a Python library that offers more advanced extraction capabilities. Camelot provides better accuracy for complex layouts and supports multiple extraction methods, but it may be slower and require additional setup compared to Tabula. The code examples demonstrate the basic usage of each library, highlighting the differences in language and API design.

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

More comprehensive PDF parsing capabilities, handling various PDF elements beyond just tables
Provides lower-level access to PDF structures, allowing for more customized extraction
Written in Python, making it easier to integrate with other Python-based data processing pipelines

Cons of pdfminer.six

Requires more programming knowledge and effort to extract tabular data specifically
Less user-friendly for non-programmers or those seeking quick table extraction
May require additional processing to clean and structure extracted data

Code Comparison

pdfminer.six:

from pdfminer.high_level import extract_text_to_fp
from io import StringIO

output_string = StringIO()
with open('document.pdf', 'rb') as fin:
    extract_text_to_fp(fin, output_string)
print(output_string.getvalue())

Tabula:

import technology.tabula.CommandLineApp;

public class ExtractTable {
    public static void main(String[] args) {
        CommandLineApp.main(new String[]{"input.pdf", "-o", "output.csv"});
    }
}

pdfminer

5,293

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Pros of pdfminer

More comprehensive PDF parsing capabilities, handling various PDF elements beyond just tables
Provides lower-level access to PDF structure, allowing for more customized extraction
Supports both Python 2 and Python 3

Cons of pdfminer

Requires more programming knowledge to use effectively
Less user-friendly for non-technical users
May require additional processing to extract structured table data

Code Comparison

pdfminer:

from pdfminer.high_level import extract_text_to_fp
with open('output.txt', 'wb') as output_file:
    with open('input.pdf', 'rb') as input_file:
        extract_text_to_fp(input_file, output_file)

Tabula:

import technology.tabula.CommandLineApp;

public class ExtractTable {
    public static void main(String[] args) {
        CommandLineApp.main(new String[]{"input.pdf", "-o", "output.csv"});
    }
}

pdfminer offers more flexibility but requires more code for specific tasks, while Tabula provides a simpler interface focused on table extraction.

PyMuPDF

7,705

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Pros of PyMuPDF

More comprehensive PDF manipulation capabilities beyond table extraction
Faster processing speed for large PDF files
Supports a wider range of PDF-related tasks, including rendering and editing

Cons of PyMuPDF

Less specialized for table extraction compared to Tabula
May require more setup and configuration for specific table extraction tasks
Steeper learning curve due to its broader feature set

Code Comparison

PyMuPDF table extraction:

import fitz
doc = fitz.open("example.pdf")
page = doc[0]
tables = page.find_tables()
for table in tables:
    print(table.extract())

Tabula table extraction:

import tabula
tables = tabula.read_pdf("example.pdf", pages="all")
for table in tables:
    print(table)

PyMuPDF offers more flexibility and control over PDF processing, while Tabula provides a simpler, more focused approach to table extraction. PyMuPDF is better suited for projects requiring extensive PDF manipulation, whereas Tabula excels in straightforward table extraction tasks.

pdfplumber

7,889

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Pros of pdfplumber

More comprehensive PDF parsing capabilities, including text extraction, image extraction, and table extraction
Provides detailed information about PDF elements, such as font details and positioning
Offers more granular control over parsing and extraction processes

Cons of pdfplumber

Slower performance compared to Tabula, especially for large PDFs
More complex setup and usage, requiring more coding knowledge
May require additional dependencies for certain functionalities

Code Comparison

pdfplumber:

import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()

Tabula:

import tabula

table = tabula.read_pdf("example.pdf", pages=1)

Both libraries aim to extract tabular data from PDFs, but pdfplumber offers more extensive PDF parsing capabilities at the cost of increased complexity and potentially slower performance. Tabula is more focused on table extraction and provides a simpler interface, making it easier to use for straightforward table extraction tasks. The choice between the two depends on the specific requirements of your project and the level of detail needed in PDF parsing.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Is tabula an active project?

Tabula is, and always has been, a volunteer-run project. We've occasionally had funding for specific features, but it's never been a commercial undertaking. At the moment, none of the original authors have the time to actively work on the project. The end-user application, hosted on this repo, is unlikely to see updates from us in the near future. tabula-java sees updates and occasional bug-fix releases from time to time.

Repo Note: The master branch is an in development version of Tabula. This may be substantially different from the latest releases of Tabula.

Tabula

tabula master

Tabula helps you liberate data tables trapped inside PDF files.

Download from the official site
Read more about Tabula on OpenNews Source
Interested in using Tabula on the command-line? Check out tabula-java, a Java library and command-line interface for Tabula. (This is the extraction library that powers Tabula.)

Why Tabula?
Using Tabula
Known issues
Incorporating Tabula into your own project
Running Tabula from source (for developers)
- Building a packaged application version
Contributing
- Backers

Why Tabula?

If youâve ever tried to do anything with data provided to you in PDFs, you know how painful this is â you canât easily copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data in CSV format, through a simple web interface.

Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.

Security Concerns?: Tabula is designed with security in mind. Your PDF and the extracted data never touch the net -- when you use Tabula on your local machine, as long as your browser's URL bar says "localhost" or "127.0.0.1", all processing takes place on your local machine. Other than to retrieve a few badges and other static assets, there are two calls that are made from your browser to external machines; one fetches the list of latest Tabula versions from GitHub to alert you if Tabula has been updated, the other makes a call to a stats counter that helps us determine how often various versions of Tabula are being used. If this is a problem, the version check can be disabled by adding -Dtabula.disable_version_check=1 to the command line at startup, and the stats counter call can be disabled by adding -Dtabula.disable_notifications=1. Please note: If you are providing Tabula as a service using a reverse SSL proxy, users may notice a security warning due to our stats counter endpoint being hosted at a non-secure URL, so you may wish to disable the notifications in this scenario.

Using Tabula

First, make sure you have a recent copy of Java installed. You can download Java here. Tabula requires a Java Runtime Environment compatible with Java 7 (i.e. Java 7, 8 or higher). If you have a problem, check Known Issues first, then report an issue.

Windows

Download tabula-win.zip from the download site. Unzip the whole thing and open the tabula.exe file inside. A browser should automatically open to http://127.0.0.1:8080/ . If not, open your web browser of choice and visit that link.

To close Tabula, just go back to the console window and press "Control-C" (as if to copy).
Mac OS X

Download tabula-mac.zip from the download site. Unzip and open the Tabula app inside. A browser should automatically open to http://127.0.0.1:8080/ . If not, open your web browser of choice and visit that link.

To close Tabula, find the Tabula icon in your dock, right-click (or control-click) on it, and press "Quit".

Note: If youâre running Mac OS X 10.8 or later, you might get an error like "Tabula is damaged and can't be opened." We're working on fixing this, but click here for a workaround.
Linux snap

Tabula is packaged as a snap package. If you have snap on your system, you can install it with
```
sudo snap install tabula
```
Other platforms (e.g. Linux)

Download tabula-jar.zip from the download site and unzip it to the directory of your choice. Open a terminal window, and cd to inside the tabula directory you just unzipped. Then run:

java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -jar tabula.jar

Then manually navigate your browser to http://127.0.0.1:8080/ (New in Tabula 1.1. To go back to the old behavior that automatically launches your web browser, use the -Dtabula.openBrowser=true option.

Tabula binds to port 8080 by default. You can change it with the warbler.port option; for example, to use port 9999:

java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -Dwarbler.port=9999 -jar tabula.jar
Docker Compose quick start using Amazon Correttto image

Make a new directory e.g. tabulapdf and enter it.

mkdir -p /opt/docker/tabulapdf cd /opt/docker/tabulapdf

Download tabula-jar package - for example version 1.2.1

wget https://github.com/tabulapdf/tabula/releases/download/v1.2.1/tabula-jar-1.2.1.zip

verify checksum (compare output with the release page)

sha256sum tabula-jar-1.2.1.zip

and unzip it.

unzip tabula-jar-1.2.1.zip

Place or create a docker-compose.yml file, adjust accordingly
```
### tabulapdf docker-compose.yml example ###
services:
tabulapdf:
  image: amazoncorretto:17
  container_name: tabulapdf-app
  command: >
    java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -Dwarbler.port=8080 -Dtabula.openBrowser=false -jar /app/tabula.jar
  volumes:
    - ./tabula:/app
  ports:
    - "8080:8080"
```
Run the app with

docker compose up -d

The app will be exposed on port 8080 and can be easily paired with a reverse proxy e.g. traefik

If the program fails to run, double-check that you have Java installed and then try again.

Known issues

There are some bugs that we're aware of that we haven't managed to fix yet. If there's not a solution here or you need more help, please go ahead and report an issue.

Legacy Java Environment (SE 6) Is Required: (Mac): The Mac operating system recently changed how it packages the Java Runtime Environment. If you get this error, download Tabula's "large experimental" package. This package includes its own Java Runtime Environment and should work without this issue.
"Tabula is damaged and can't be opened" (Mac): If youâre running Mac OS X 10.8 or later, GateKeeper may prevent you from opening the Tabula app. Please see this GateKeeper page for more information.
1. Right-click on Tabula.app and select Open from the context menu.
2. The system will tell you that the application is "from an unidentified developer" and ask you whether you want to open it. Click Open to allow the application to run. The system remembers this choice and won't prompt you again.
(If you continue to have issues, double-check the OS X GateKeeper documentation for more information.)

org.jruby.exceptions.RaiseException: (Encoding::CompatibilityError) incompatible character encodings: (Windows): Your Windows computer expects a type of encoding other than Unicode or Windows's English encoding. You can fix this by entering a few simple commands in the Command Prompt. (The commands won't affect anything besides Tabula.)
1. Open a Command Prompt
2. type cd and then the path to the directory that contains tabula.exe, e.g. cd C:\Users\Username\Downloads
3. Change that terminal's codepage to Unicode by typing: chcp 65001
4. Run Tabula by typing tabula.exe
A browser tab opens, but something other than Tabula loads there. Or Tabula doesn't start. It's possible another program is using port 8080, which Tabula binds to by default. You can try closing the other program, or change the port Tabula uses by running Tabula from the terminal with the warbler.port property:

java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -Dwarbler.port=9999 -jar tabula.jar

Incorporating Tabula into your own project

Tabula is open-source, so we'd love for you to incorporate pieces of Tabula into your own projects. The "guts" of Tabula -- that is, the logic and heuristics that reconstruct tables from PDFs -- is contained in the tabula-java repo. There's a JAR file that you can easily incorporate into JVM languages like Java, Scala or Clojure and it includes a command-line tool for you to automate your extraction tasks. Visit that repo for more information on how to use tabula-java on the CLI and on how Tabula exports tabula-java scripts.

Bindings:

Tabula has bindings for JRuby and R. If you end up writing bindings for another language, let us know and we'll add a link here.

tabulizer provides R bindings for tabula-java and is community-supported by @leeper.
tabula-js provides Node.js bindings for tabula-java; it is community-supported by @ezodude.
tabula-py provides Python bindings for tabula-java; it is community-supported by @chezou.
tabula-extractor DEPRECATED - Provides JRuby bindings for tabula-java

Running Tabula from source (for developers)

Download JRuby. You can install it from its website, or using tools like rvm or rbenv. Note that as of Tabula 1.1.0 (7875582becb2799b65586d5680782cafd399bb33), Tabula uses the JRuby 9000 series (i.e. JRuby 9.1.5.0).

Download Tabula and install the Ruby dependencies. (Note: if using rvm or rbenv, ensure that JRuby is being used.

git clone git://github.com/tabulapdf/tabula.git
cd tabula

gem install bundler -v 1.17.3
bundle install
jruby -S jbundle install

Then, start the development server:

jruby -G -r jbundler -S rackup

(If you get encoding errors, set the JAVA_OPTS environment variable to -Dfile.encoding=utf-8)

The site instance should now be viewable at http://127.0.0.1:9292/ .

You can a couple some options when executing the server in this manner:

TABULA_DATA_DIR="/tmp/tabula" \
TABULA_DEBUG=1 \
jruby -G -r jbundler -S rackup

TABULA_DATA_DIR controls where uploaded data for Tabula is stored. By default, data is stored in the OS-dependent application data directory for the current user. (similar to: C:\Users\foo\AppData\Roaming\Tabula on Windows, ~/Library/Application Support/Tabula on Mac, ~/.tabula on Linux/UNIX)
TABULA_DEBUG prints out extra status data when PDF files are being processed. (false by default.)

Alternatively, running the server as a JAR file

Testing in this manner will be closer to testing the "packaged application" version of the app.

jruby -G -S rake war
java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -jar build/tabula.jar

If you intend to develop against an unreleased version of tabula-java, you need to install its JAR to your local Maven repository. From the directory that contains tabula-java source:

mvn install:install-file -Dfile=target/tabula-<version>-SNAPSHOT.jar -DgroupId=technology.tabula -DartifactId=tabula -Dversion=<version>-SNAPSHOT -Dpackaging=jar -DpomFile=pom.xml

Then, adjust the Jarfile accordingly.

Building a packaged application version

After performing the above steps ("Running Tabula from source"), you can compile Tabula into a standalone application:

Mac OS X

If you wish to share Tabula with other machines, you will need a codesigning certificate. Our distribution of Tabula uses a self-signed certificate, as noted above. See this section of build.xml for details. If you will only be running Tabula on the machine you are building it on, you may remove this entire block (lines 44-53).

To compile the app:

WEBSERVER_VERSION=9.4.31.v20200723 MAVEN_REPO=https://repo1.maven.org/maven2 rake macosx

This will result in a portable "tabula_mac.zip" archive (inside the build directory) for Mac OS X users.

Note that the Mac version bundles Java with the Tabula app. This results in a 98MB zip file, versus the 30MB zip file for other platforms, but allows users to run Tabula without having to worry about Java version incompatibilities.

Windows

You can build .exe files for the Windows target on any platform.

Download a 3.1.X (beta) copy of Launch4J.

Unzip it into the Tabula repo so that "launch4j" (with subdirectories "bin", etc.) is in the repository root.

(If you're building on a 64bit Linux, you may need to install 32bit libs like, in Ubuntu sudo apt-get install lib32z1 lib32ncurses5)

Then:

WEBSERVER_VERSION=9.4.31.v20200723 MAVEN_REPO=https://repo1.maven.org/maven2 rake windows

This will result in a portable "tabula_win.zip" archive (inside the build directory) for Mac OS X users.

If you have issues, you can try building manually. (These commands are for OS X/Linux and may need to be adjusted for Windows users.)

# (from the root directory of the repo)
WEBSERVER_VERSION=9.4.31.v20200723 MAVEN_REPO=https://repo1.maven.org/maven2 rake war
cd launch4j
ant -f ../build.xml windows

A "tabula.exe" file will be generated in "build/windows". To run, the exe file needs "tabula.jar" (contained in "build") in the same directory. You can create a .zip archive by doing:

# (from the root directory of the repo)
cd build/windows
mkdir tabula
cp tabula.exe ./tabula/
cp ../tabula.jar ./tabula/
zip -r9 tabula_win.zip tabula
rm -fr tabula

Contributing

Interested in helping out? We'd love to have your help!

You can help by:

Reporting a bug.
Adding or editing documentation.
Contributing code via a Pull Request from ideas or bugs listed in the Enhancements section of the issues. see CONTRIBUTING.md
Spreading the word about Tabula to people who might be able to benefit from using it.

Backers

You can also support our continued work on Tabula with a one-time or monthly donation on OpenCollective. Organizations who use Tabula can also sponsor the project for acknowledgement on our official site and this README.

Tabula is made possible in part through the generosity of our users and through grants from the Knight Foundation and the Shuttleworth Foundation. Special thanks to all the users and organizations that support Tabula!