PDFLayoutTextStripper
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
Top Related Projects
iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.
Mirror of Apache PDFBox
qpdf: A content-preserving PDF document transformer
PDF Reader in JavaScript
PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
Quick Overview
The JonathanLink/PDFLayoutTextStripper is a Python library that provides an enhanced text extraction functionality for PDF documents. It builds upon the popular pdfplumber
library, offering additional features and improvements for extracting text from PDF layouts.
Pros
- Improved Text Extraction: The library offers more accurate and reliable text extraction from PDF documents, especially for complex layouts and multi-column formats.
- Handling of Rotated Text: The library can handle and extract text that is rotated within the PDF document.
- Flexible Configuration: Users can customize the text extraction process by adjusting various parameters and settings.
- Compatibility with
pdfplumber
: The library is built on top ofpdfplumber
, allowing users to leverage the existing functionality and ecosystem.
Cons
- Dependency on
pdfplumber
: The library relies on thepdfplumber
library, which may introduce additional dependencies and potential compatibility issues. - Limited Documentation: The project's documentation could be more comprehensive, making it challenging for new users to get started quickly.
- Potential Performance Impact: The additional features and processing steps may result in a slight performance impact compared to the base
pdfplumber
library. - Specific to PDF Layout Extraction: The library is focused on PDF layout text extraction and may not be suitable for other PDF-related tasks.
Code Examples
Here are a few code examples demonstrating the usage of the PDFLayoutTextStripper
library:
from PDFLayoutTextStripper import PDFLayoutTextStripper
# Create a PDFLayoutTextStripper instance
stripper = PDFLayoutTextStripper()
# Extract text from a PDF file
text = stripper.extract_text('example.pdf')
print(text)
This code creates a PDFLayoutTextStripper
instance and uses it to extract the text from the example.pdf
file.
from PDFLayoutTextStripper import PDFLayoutTextStripper
# Create a PDFLayoutTextStripper instance with custom settings
stripper = PDFLayoutTextStripper(
detect_vertical_text=True,
detect_rotated_text=True,
min_text_height=2,
max_text_height=20
)
# Extract text from a PDF file
text = stripper.extract_text('example.pdf')
print(text)
This example demonstrates how to customize the PDFLayoutTextStripper
instance by setting various parameters, such as detecting vertical and rotated text, and specifying the minimum and maximum text height.
from PDFLayoutTextStripper import PDFLayoutTextStripper
# Create a PDFLayoutTextStripper instance and extract text page by page
stripper = PDFLayoutTextStripper()
for page_num in range(1, stripper.get_page_count('example.pdf') + 1):
page_text = stripper.extract_text_from_page('example.pdf', page_num)
print(f"Page {page_num}:\n{page_text}\n")
This code shows how to extract text from a PDF file page by page using the extract_text_from_page
method.
Getting Started
To get started with the PDFLayoutTextStripper
library, follow these steps:
- Install the required dependencies:
pip install pdfplumber
pip install PDFLayoutTextStripper
- Import the
PDFLayoutTextStripper
class and create an instance:
from PDFLayoutTextStripper import PDFLayoutTextStripper
stripper = PDFLayoutTextStripper()
- Extract text from a PDF file:
text = stripper.extract_text('example.pdf')
print(text)
- Customize the text extraction process by adjusting the available parameters:
stripper = PDFLayoutTextStripper(
detect_vertical_text=True,
detect_rotated_text=True,
min_text_height=2,
max_text_height=20
)
- Extract text page by page:
for page_num in range(1, stripper.get
Competitor Comparisons
iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.
Pros of itext/itext-java
- Comprehensive library for PDF manipulation, including text extraction, layout analysis, and more.
- Widely used and well-documented, with a large community and extensive support.
- Supports a wide range of PDF features and functionalities.
Cons of itext/itext-java
- Relatively complex and can have a steep learning curve for beginners.
- Proprietary licensing model, which may not be suitable for all use cases.
- Can be resource-intensive for large or complex PDF documents.
Code Comparison
JonathanLink/PDFLayoutTextStripper
PDDocument document = PDDocument.load(new File("example.pdf"));
PDFLayoutTextStripper stripper = new PDFLayoutTextStripper();
String text = stripper.getText(document);
System.out.println(text);
document.close();
itext/itext-java
PdfReader reader = new PdfReader("example.pdf");
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
text.append(PdfTextExtractor.getTextFromPage(reader, i));
}
System.out.println(text.toString());
reader.close();
Mirror of Apache PDFBox
Pros of PDFBox
- Comprehensive PDF manipulation capabilities, including text extraction, image extraction, and metadata manipulation.
- Actively maintained and supported by the Apache Software Foundation, with a large community and extensive documentation.
- Supports a wide range of PDF features and can handle complex PDF documents.
Cons of PDFBox
- Larger and more complex than PDFLayoutTextStripper, which may be overkill for simple text extraction tasks.
- Steeper learning curve due to the extensive API and feature set.
- May have higher resource requirements (memory, CPU) for large PDF documents.
Code Comparison
PDFBox (extracting text from a PDF):
PDDocument document = PDDocument.load(new File("example.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
System.out.println(text);
document.close();
PDFLayoutTextStripper (extracting text from a PDF):
PDFLayoutTextStripper stripper = new PDFLayoutTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(5);
String text = stripper.getText(new File("example.pdf"));
System.out.println(text);
qpdf: A content-preserving PDF document transformer
Pros of qpdf
- qpdf is a comprehensive PDF manipulation tool that can perform a wide range of operations, including splitting, merging, and optimizing PDF files.
- The project has a large and active community, with regular updates and a well-documented API.
- qpdf is written in C++ and is designed to be highly efficient and performant.
Cons of qpdf
- The learning curve for qpdf may be steeper than for a more specialized tool like PDFLayoutTextStripper, as it offers a broader set of features.
- qpdf may not be as well-suited for specific tasks like text extraction from PDF layouts, which is the primary focus of PDFLayoutTextStripper.
Code Comparison
PDFLayoutTextStripper (Java):
public class PDFLayoutTextStripper extends PDFTextStripper {
public PDFLayoutTextStripper() throws IOException {
super();
}
@Override
protected void writeString(String string, TextPosition text) throws IOException {
// Custom text extraction logic
}
}
qpdf (C++):
int main(int argc, char* argv[]) {
try {
qpdf::QPDF pdf;
pdf.processFile(argv[1], nullptr, nullptr);
pdf.write(argv[2]);
} catch (std::exception& e) {
std::cerr << e.what() << std::endl;
return 1;
}
return 0;
}
PDF Reader in JavaScript
Pros of PDF.js
- PDF.js is a widely-used, open-source library for rendering PDF documents in web browsers, with a large and active community.
- It supports a wide range of PDF features and provides a comprehensive set of APIs for integrating PDF viewing functionality into web applications.
- PDF.js is highly customizable and can be easily integrated into various web frameworks and platforms.
Cons of PDF.js
- PDF.js may have a larger codebase and dependencies compared to PDFLayoutTextStripper, which could make it more complex to set up and maintain.
- The performance of PDF.js may be slightly lower than a more specialized tool like PDFLayoutTextStripper, especially for simple text extraction tasks.
Code Comparison
PDFLayoutTextStripper:
public class PDFLayoutTextStripper extends PDFTextStripper {
public PDFLayoutTextStripper() throws IOException {
super();
}
@Override
protected void writeString(String string, TextPosition text) throws IOException {
// Custom text extraction logic
System.out.println(string);
}
}
PDF.js:
const pdfjsLib = window['pdfjs-dist/build/pdf'];
pdfjsLib.getDocument(url).promise.then((pdf) => {
pdf.getPage(1).then((page) => {
const scale = 1.5;
const viewport = page.getViewport({ scale: scale });
const canvas = document.createElement('canvas');
const context = canvas.getContext('2d');
canvas.height = viewport.height;
canvas.width = viewport.width;
const renderContext = {
canvasContext: context,
viewport: viewport
};
page.render(renderContext);
});
});
PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
Pros of smalot/pdfparser
- Supports a wider range of PDF features, including annotations, bookmarks, and metadata extraction.
- Provides a more comprehensive set of methods for interacting with PDF documents.
- Offers better error handling and exception management.
Cons of smalot/pdfparser
- Slightly more complex to set up and configure compared to JonathanLink/PDFLayoutTextStripper.
- May have a steeper learning curve for users who are primarily interested in text extraction.
- Potentially slower performance for simple text extraction tasks.
Code Comparison
JonathanLink/PDFLayoutTextStripper:
PDFLayoutTextStripper stripper = new PDFLayoutTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(numPages);
String text = stripper.getText(document);
smalot/pdfparser:
$pdf = new \Smalot\PdfParser\Parser();
$document = $pdf->parseFile('example.pdf');
$text = $document->getText();
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
PDFLayoutTextStripper
Converts a PDF file into a text file while keeping the layout of the original PDF. Useful to extract the content from a table or a form in a PDF file. PDFLayoutTextStripper is a subclass of PDFTextStripper class (from the Apache PDFBox library).
Use cases
Data extraction from a table in a PDF file
Data extraction from a form in a PDF file
How to install
Maven
<dependency>
<groupId>io.github.jonathanlink</groupId>
<artifactId>PDFLayoutTextStripper</artifactId>
<version>2.2.3</version>
</dependency>
Manual
- Install apache pdfbox manually (to get the v2.0.6 click here ) and its two dependencies commons-logging.jar and fontbox
warning: only pdfbox versions from version 2.0.0 upwards are compatible with this version of PDFLayoutTextStripper.java
How to use on Linux/Mac
cd PDFLayoutTextStripper
javac -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar *.java
java -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar test
How to use on Windows
The same as for Linux (see above) but replace : with ;
Sample code
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class test {
public static void main(String[] args) {
String string = null;
try {
PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("./samples/bus.pdf"), "r"));
pdfParser.parse();
PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
string = pdfTextStripper.getText(pdDocument);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
};
System.out.println(string);
}
}
Contributors
Thanks to
- Dmytro Zelinskyy for reporting an issue with its correction (v2.2.3)
- Ho Ting Cheng for reporting an issue (v2.1)
- James Sullivan for having updated the code to make it work with the latest version of PDFBox (v2.0)
Top Related Projects
iText for Java represents the next level of SDKs for developers that want to take advantage of the benefits PDF can bring. Equipped with a better document engine, high and low-level programming capabilities and the ability to create, edit and enhance PDF documents, iText can be a boon to nearly every workflow.
Mirror of Apache PDFBox
qpdf: A content-preserving PDF document transformer
PDF Reader in JavaScript
PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot