PDFLayoutTextStripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

1,592

214

1,592

View on GitHub

Top Related Projects

itext-java

2,118

iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.

pdfbox

2,857

Mirror of Apache PDFBox

qpdf

4,163

qpdf: A content-preserving PDF document transformer

pdf.js

51,137

PDF Reader in JavaScript

pdfparser

2,560

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.

Quick Overview

The JonathanLink/PDFLayoutTextStripper is a Python library that provides an enhanced text extraction functionality for PDF documents. It builds upon the popular pdfplumber library, offering additional features and improvements for extracting text from PDF layouts.

Pros

Improved Text Extraction: The library offers more accurate and reliable text extraction from PDF documents, especially for complex layouts and multi-column formats.
Handling of Rotated Text: The library can handle and extract text that is rotated within the PDF document.
Flexible Configuration: Users can customize the text extraction process by adjusting various parameters and settings.
Compatibility with pdfplumber: The library is built on top of pdfplumber, allowing users to leverage the existing functionality and ecosystem.

Cons

Dependency on pdfplumber: The library relies on the pdfplumber library, which may introduce additional dependencies and potential compatibility issues.
Limited Documentation: The project's documentation could be more comprehensive, making it challenging for new users to get started quickly.
Potential Performance Impact: The additional features and processing steps may result in a slight performance impact compared to the base pdfplumber library.
Specific to PDF Layout Extraction: The library is focused on PDF layout text extraction and may not be suitable for other PDF-related tasks.

Code Examples

Here are a few code examples demonstrating the usage of the PDFLayoutTextStripper library:

from PDFLayoutTextStripper import PDFLayoutTextStripper

# Create a PDFLayoutTextStripper instance
stripper = PDFLayoutTextStripper()

# Extract text from a PDF file
text = stripper.extract_text('example.pdf')
print(text)

This code creates a PDFLayoutTextStripper instance and uses it to extract the text from the example.pdf file.

from PDFLayoutTextStripper import PDFLayoutTextStripper

# Create a PDFLayoutTextStripper instance with custom settings
stripper = PDFLayoutTextStripper(
    detect_vertical_text=True,
    detect_rotated_text=True,
    min_text_height=2,
    max_text_height=20
)

# Extract text from a PDF file
text = stripper.extract_text('example.pdf')
print(text)

This example demonstrates how to customize the PDFLayoutTextStripper instance by setting various parameters, such as detecting vertical and rotated text, and specifying the minimum and maximum text height.

from PDFLayoutTextStripper import PDFLayoutTextStripper

# Create a PDFLayoutTextStripper instance and extract text page by page
stripper = PDFLayoutTextStripper()
for page_num in range(1, stripper.get_page_count('example.pdf') + 1):
    page_text = stripper.extract_text_from_page('example.pdf', page_num)
    print(f"Page {page_num}:\n{page_text}\n")

This code shows how to extract text from a PDF file page by page using the extract_text_from_page method.

Getting Started

To get started with the PDFLayoutTextStripper library, follow these steps:

Install the required dependencies:

pip install pdfplumber
pip install PDFLayoutTextStripper

Import the PDFLayoutTextStripper class and create an instance:

from PDFLayoutTextStripper import PDFLayoutTextStripper
stripper = PDFLayoutTextStripper()

Extract text from a PDF file:

text = stripper.extract_text('example.pdf')
print(text)

Customize the text extraction process by adjusting the available parameters:

stripper = PDFLayoutTextStripper(
    detect_vertical_text=True,
    detect_rotated_text=True,
    min_text_height=2,
    max_text_height=20
)

Extract text page by page:

for page_num in range(1, stripper.get

Competitor Comparisons

itext-java

2,118

Pros of itext/itext-java

Comprehensive library for PDF manipulation, including text extraction, layout analysis, and more.
Widely used and well-documented, with a large community and extensive support.
Supports a wide range of PDF features and functionalities.

Cons of itext/itext-java

Relatively complex and can have a steep learning curve for beginners.
Proprietary licensing model, which may not be suitable for all use cases.
Can be resource-intensive for large or complex PDF documents.

Code Comparison

JonathanLink/PDFLayoutTextStripper

PDDocument document = PDDocument.load(new File("example.pdf"));
PDFLayoutTextStripper stripper = new PDFLayoutTextStripper();
String text = stripper.getText(document);
System.out.println(text);
document.close();

itext/itext-java

PdfReader reader = new PdfReader("example.pdf");
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    text.append(PdfTextExtractor.getTextFromPage(reader, i));
}
System.out.println(text.toString());
reader.close();

pdfbox

2,857

Mirror of Apache PDFBox

Pros of PDFBox

Comprehensive PDF manipulation capabilities, including text extraction, image extraction, and metadata manipulation.
Actively maintained and supported by the Apache Software Foundation, with a large community and extensive documentation.
Supports a wide range of PDF features and can handle complex PDF documents.

Cons of PDFBox

Larger and more complex than PDFLayoutTextStripper, which may be overkill for simple text extraction tasks.
Steeper learning curve due to the extensive API and feature set.
May have higher resource requirements (memory, CPU) for large PDF documents.

Code Comparison

PDFBox (extracting text from a PDF):

PDDocument document = PDDocument.load(new File("example.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
System.out.println(text);
document.close();

PDFLayoutTextStripper (extracting text from a PDF):

PDFLayoutTextStripper stripper = new PDFLayoutTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(5);
String text = stripper.getText(new File("example.pdf"));
System.out.println(text);

qpdf

4,163

qpdf: A content-preserving PDF document transformer

Pros of qpdf

qpdf is a comprehensive PDF manipulation tool that can perform a wide range of operations, including splitting, merging, and optimizing PDF files.
The project has a large and active community, with regular updates and a well-documented API.
qpdf is written in C++ and is designed to be highly efficient and performant.

Cons of qpdf

The learning curve for qpdf may be steeper than for a more specialized tool like PDFLayoutTextStripper, as it offers a broader set of features.
qpdf may not be as well-suited for specific tasks like text extraction from PDF layouts, which is the primary focus of PDFLayoutTextStripper.

Code Comparison

PDFLayoutTextStripper (Java):

public class PDFLayoutTextStripper extends PDFTextStripper {
    public PDFLayoutTextStripper() throws IOException {
        super();
    }

    @Override
    protected void writeString(String string, TextPosition text) throws IOException {
        // Custom text extraction logic
    }
}

qpdf (C++):

int main(int argc, char* argv[]) {
    try {
        qpdf::QPDF pdf;
        pdf.processFile(argv[1], nullptr, nullptr);
        pdf.write(argv[2]);
    } catch (std::exception& e) {
        std::cerr << e.what() << std::endl;
        return 1;
    }
    return 0;
}

pdf.js

51,137

PDF Reader in JavaScript

Pros of PDF.js

PDF.js is a widely-used, open-source library for rendering PDF documents in web browsers, with a large and active community.
It supports a wide range of PDF features and provides a comprehensive set of APIs for integrating PDF viewing functionality into web applications.
PDF.js is highly customizable and can be easily integrated into various web frameworks and platforms.

Cons of PDF.js

PDF.js may have a larger codebase and dependencies compared to PDFLayoutTextStripper, which could make it more complex to set up and maintain.
The performance of PDF.js may be slightly lower than a more specialized tool like PDFLayoutTextStripper, especially for simple text extraction tasks.

Code Comparison

PDFLayoutTextStripper:

public class PDFLayoutTextStripper extends PDFTextStripper {
    public PDFLayoutTextStripper() throws IOException {
        super();
    }

    @Override
    protected void writeString(String string, TextPosition text) throws IOException {
        // Custom text extraction logic
        System.out.println(string);
    }
}

PDF.js:

const pdfjsLib = window['pdfjs-dist/build/pdf'];

pdfjsLib.getDocument(url).promise.then((pdf) => {
  pdf.getPage(1).then((page) => {
    const scale = 1.5;
    const viewport = page.getViewport({ scale: scale });

    const canvas = document.createElement('canvas');
    const context = canvas.getContext('2d');
    canvas.height = viewport.height;
    canvas.width = viewport.width;

    const renderContext = {
      canvasContext: context,
      viewport: viewport
    };
    page.render(renderContext);
  });
});

pdfparser

2,560

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.

Pros of smalot/pdfparser

Supports a wider range of PDF features, including annotations, bookmarks, and metadata extraction.
Provides a more comprehensive set of methods for interacting with PDF documents.
Offers better error handling and exception management.

Cons of smalot/pdfparser

Slightly more complex to set up and configure compared to JonathanLink/PDFLayoutTextStripper.
May have a steeper learning curve for users who are primarily interested in text extraction.
Potentially slower performance for simple text extraction tasks.

Code Comparison

JonathanLink/PDFLayoutTextStripper:

PDFLayoutTextStripper stripper = new PDFLayoutTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(numPages);
String text = stripper.getText(document);

smalot/pdfparser:

$pdf = new \Smalot\PdfParser\Parser();
$document = $pdf->parseFile('example.pdf');
$text = $document->getText();

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

PDFLayoutTextStripper

Converts a PDF file into a text file while keeping the layout of the original PDF. Useful to extract the content from a table or a form in a PDF file. PDFLayoutTextStripper is a subclass of PDFTextStripper class (from the Apache PDFBox library).

Use cases

Data extraction from a table in a PDF file

Data extraction from a form in a PDF file example

How to install

Maven

<dependency>
  <groupId>io.github.jonathanlink</groupId>
  <artifactId>PDFLayoutTextStripper</artifactId>
  <version>2.2.3</version>
</dependency>

Manual

Install apache pdfbox manually (to get the v2.0.6 click here ) and its two dependencies commons-logging.jar and fontbox

warning: only pdfbox versions from version 2.0.0 upwards are compatible with this version of PDFLayoutTextStripper.java

How to use on Linux/Mac

cd PDFLayoutTextStripper
javac -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar *.java
java -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar test

How to use on Windows

The same as for Linux (see above) but replace : with ;

Sample code

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class test {
	public static void main(String[] args) {
		String string = null;
        try {
            PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("./samples/bus.pdf"), "r"));
            pdfParser.parse();
            PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
            PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
            string = pdfTextStripper.getText(pdDocument);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        };
        System.out.println(string);
	}
}

Contributors

Thanks to

Dmytro Zelinskyy for reporting an issue with its correction (v2.2.3)
Ho Ting Cheng for reporting an issue (v2.1)
James Sullivan for having updated the code to make it work with the latest version of PDFBox (v2.0)

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot