pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

2,244

371

2,244

View on GitHub

Top Related Projects

camelot

3,691

Camelot: PDF Table Extraction for Humans

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

pdfplumber

7,889

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

pdfrw

1,896

pdfrw is a pure Python library that reads and writes PDFs

pdfminer

5,293

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Quick Overview

PDFTabExtract is a Python library designed to extract tables from PDF files. It focuses on extracting tables with vertical and horizontal lines, providing tools for analyzing and processing the extracted data. The library is particularly useful for handling complex table structures in PDF documents.

Pros

Specialized in extracting tables with vertical and horizontal lines
Provides tools for post-processing and analyzing extracted table data
Supports custom page regions for targeted extraction
Offers visualization capabilities for debugging and verification

Cons

May struggle with tables that lack clear borders or have complex layouts
Limited support for tables spanning multiple pages
Requires some manual configuration for optimal results
Not actively maintained (last update was in 2017)

Code Examples

Basic table extraction:

from pdftabextract import process_page

# Extract tables from a specific page
page_tables = process_page("example.pdf", 1)

# Print the extracted tables
for table in page_tables:
    print(table)

Customizing extraction region:

from pdftabextract import process_page

# Define a custom region for extraction
custom_region = {
    'x1': 100,
    'y1': 200,
    'x2': 500,
    'y2': 700
}

# Extract tables from the specified region
page_tables = process_page("example.pdf", 1, region=custom_region)

Visualizing extracted tables:

from pdftabextract import visualize_table

# Extract tables
page_tables = process_page("example.pdf", 1)

# Visualize the first extracted table
visualize_table(page_tables[0])

Getting Started

To get started with PDFTabExtract:

Install the library:
```
pip install pdftabextract
```

Basic usage:

from pdftabextract import process_page

# Extract tables from the first page of a PDF
tables = process_page("your_document.pdf", 1)

# Process the extracted tables
for table in tables:
    for row in table:
        print(row)

Note: Make sure you have the necessary dependencies installed, including PyPDF2 and OpenCV.

Competitor Comparisons

camelot

3,691

Camelot: PDF Table Extraction for Humans

Pros of Camelot

More actively maintained with regular updates
Supports both stream and lattice-based table extraction methods
Provides a user-friendly command-line interface

Cons of Camelot

Requires Python 3.6+ and additional system dependencies
May struggle with complex or poorly formatted PDFs
Limited support for non-tabular data extraction

Code Comparison

pdftabextract:

from pdftabextract import imgproc, extract_table

# Extract table from image
table = extract_table(image_path)

Camelot:

import camelot

# Extract tables from PDF
tables = camelot.read_pdf('example.pdf')
df = tables[0].df  # Get DataFrame of first table

Both libraries aim to extract tabular data from PDFs, but Camelot offers a more comprehensive solution with additional features and better maintenance. pdftabextract focuses on image-based extraction, while Camelot supports both image and text-based methods. Camelot's code is more intuitive and provides direct integration with pandas DataFrames, making it easier to work with extracted data. However, pdftabextract may be simpler to set up and use for basic image-based table extraction tasks.

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of Tabula

User-friendly GUI for extracting tables from PDFs
Supports multiple output formats (CSV, TSV, JSON)
Active development and community support

Cons of Tabula

Limited to table extraction, less flexible for complex layouts
Requires Java runtime environment
May struggle with highly formatted or complex PDFs

Code Comparison

pdftabextract:

from pdftabextract import imgproc, extract_table

# Process image and extract table
img = imgproc.process_image("example.png")
table = extract_table(img)

Tabula:

import technology.tabula.ObjectExtractor;
import technology.tabula.Page;

ObjectExtractor oe = new ObjectExtractor(pdfDocument);
Page page = oe.extract(1);
List<Table> tables = page.getTables();

Both libraries offer methods to extract tables from PDFs, but pdftabextract focuses on image processing techniques, while Tabula uses Java-based PDF parsing. pdftabextract provides more low-level control over the extraction process, making it suitable for complex layouts. Tabula, on the other hand, offers a simpler API and GUI, making it more accessible for users without programming experience.

pdfplumber

7,889

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Pros of pdfplumber

More comprehensive PDF parsing capabilities, including text extraction, image extraction, and table extraction
Active development with regular updates and bug fixes
Extensive documentation and examples for various use cases

Cons of pdfplumber

Slower performance for large PDFs compared to pdftabextract
May require more setup and configuration for specific table extraction tasks
Higher memory usage, especially when processing complex PDFs

Code comparison

pdftabextract:

from pdftabextract import TableFinder

finder = TableFinder("input.pdf")
tables = finder.find_tables()
for table in tables:
    print(table.to_csv())

pdfplumber:

import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()
    print(table)

Both libraries offer table extraction capabilities, but pdfplumber provides a more straightforward API for general PDF parsing tasks. pdftabextract is more focused on table extraction specifically, which may be advantageous for certain use cases.

pdfrw

1,896

pdfrw is a pure Python library that reads and writes PDFs

Pros of pdfrw

More comprehensive PDF manipulation capabilities, including reading, writing, and modifying PDF files
Better suited for general PDF processing tasks beyond table extraction
Actively maintained with regular updates and contributions

Cons of pdfrw

Lacks specific functionality for table extraction from PDFs
May require more custom code to achieve table extraction compared to pdftabextract
Steeper learning curve for users primarily interested in table extraction

Code Comparison

pdfrw (general PDF manipulation):

from pdfrw import PdfReader, PdfWriter

pdf = PdfReader('input.pdf')
writer = PdfWriter()
writer.addpage(pdf.pages[0])
writer.write('output.pdf')

pdftabextract (table extraction):

from pdftabextract import TableExtractor

extractor = TableExtractor()
tables = extractor.extract_tables('input.pdf')
for table in tables:
    print(table.to_csv())

pdfrw offers more flexibility for general PDF operations, while pdftabextract provides a more straightforward approach for extracting tables from PDFs. The choice between the two depends on the specific requirements of your project and the scope of PDF manipulation needed.

pdfminer

5,293

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Pros of pdfminer

More comprehensive PDF parsing capabilities, handling a wider range of PDF structures
Better support for text extraction and layout analysis
Larger community and more frequent updates

Cons of pdfminer

Less focused on table extraction specifically
May require more setup and configuration for table-specific tasks
Steeper learning curve for users primarily interested in table extraction

Code comparison

pdftabextract:

from pdftabextract import imgproc, extract_table

# Extract table from image
table = extract_table(image_path)

pdfminer:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBoxHorizontal

# Extract text from PDF
for page_layout in extract_pages("document.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextBoxHorizontal):
            print(element.get_text())

pdftabextract is more focused on table extraction from images, while pdfminer provides broader PDF parsing capabilities, including text extraction. pdftabextract offers a simpler API for table extraction, whereas pdfminer requires more code to achieve similar results but offers greater flexibility for various PDF-related tasks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

pdftabextract - A set of tools for data mining (OCR-processed) PDFs

July 2016 / Feb. 2017, Markus Konrad markus.konrad@wzb.eu post@mkonrad.net / Berlin Social Science Center

This project is currently not maintained.

IMPORTANT INITIAL NOTES

From time to time I receive emails from people trying to extract tabular data from PDFs. I'm fine with that and I'm glad to help. However, some people think that pdftabextract is some kind of magic wand that automatically extracts the data they want by simply running one of the provided examples on their documents. This, in the very most cases, won't work. I want to clear up a few things that you should consider before using this software and before writing an email to me:

pdftabextract is not an OCR (optical character recognition) software. It requires scanned pages with OCR information, i.e. a "sandwich PDF" that contains both the scanned images and the recognized text. You need software like tesseract or ABBYY Finereader for OCR. In order to check if you have a "sandwich PDF", open your PDF and press "select all". This usually reveals the OCR-processed text information.
pdftabextract is some kind of last resort when all other things fail for extracting tabular data from PDFs. Before trying this out, you should ask yourself the following questions:

Is there really no other way / no other format for which the data is available?
Can a special OCR software like ABBYY Finereader detect and extract the tables (you need to try this with a large sample of pages -- I found the table recognition in Finereader often unreliable)?
Is it possible to extract the recognized text as-is from the PDFs and parse it? Try using the pdftotext tool from poppler-utils, a package which is part of most Linux distributions and is also available for OSX via Homebrew or MacPorts: pdftotext -layout yourdocument.pdf. This will create a file yourdocument.txt containing the recognized text (from the OCR) with a layout that hopefully resembles your tables. Often, this can be parsed directly (e.g. with a Python script using regular expressions). If it can't be parsed (e.g. if the columns are not well separated in the text, the tables on each page are too different to each other in order to come up with a common structure for parsing, the pages are too skewed or rotated) then pdftabextract is the right software for you.

pdftabextract is a set of tools. As such, it contains functions that are suitable for certain documents but not for others and many functions require you to set parameters that depend on the layout, scan quality, etc. of your documents. You can't just use the example scripts blindly with your data. You will need to adjust parameters in order that it works well with your documents. Below are some hints and explanations regarding those tools and their parameters.

Introduction

This repository contains a set of tools written in Python 3 with the aim to extract tabular data from (OCR-processed) PDF files. Before these files can be processed they need to be converted to XML files in pdf2xml format. This is very simple -- see section below for instructions.

Module overview

After that you can view the extracted text boxes with the pdf2xml-viewer tool if you like. The pdf2xml format can be loaded and parsed with functions in the common submodule. Lines can be detected in the scanned images using the imgproc module. If the pages are skewed or rotated, this can be detected and fixed with methods from imgproc and functions in textboxes. Lines or text box positions can be clustered in order to detect table columns and rows using the clustering module. When columns and rows were successfully detected, they can be converted to a page grid with the extract module and their contents can be extracted using fit_texts_into_grid in the same module. extract also allows you to export the data as pandas DataFrame.

If your scanned pages are double pages, you will need to pre-process them with splitpages.

Examples and tutorials

An extensive tutorial was posted here and is derived from the Jupyter Notebook contained in the examples. There are more use-cases and demonstrations in the examples directory.

Features

load and parse files in pdf2xml format (common module)
split scanned double pages (splitpages module)
detect lines in scanned pages via image processing (imgproc module)
detect page rotation or skew and fix it (imgproc and textboxes module)
detect clusters in detected lines or text box positions in order to find column and row positions (clustering module)
extract tabular data and convert it to pandas DataFrame (which allows export to CSV, Excel, etc.) (extract module)

Installation

This package is available on PyPI and can be installed via pip: pip install pdftabextract

Requirements

The requirements are listed in requirements.txt and are installed automatically if you use pip.

Only Python 3 -- No Python 2 support.

Converting PDF files to XML files with pdf2xml format

You need to convert your PDFs using the poppler-utils, a package which is part of most Linux distributions and is also available for OSX via Homebrew or MacPorts. From this package we need the command pdftohtml and can create an XML file in pdf2xml format in the following way using the Terminal:

pdftohtml -c -hidden -xml input.pdf output.xml

The arguments input.pdf and output.xml are your input PDF file and the created XML file in pdf2xml format respectively. It is important that you specify the -hidden parameter when you're dealing with OCR-processed ("sandwich") PDFs. You can furthermore add the parameters -f n and -l n to set only a range of pages to be converted.