Top Related Projects
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Tesseract Open Source OCR Engine (main repository)
Tabula is a tool for liberating data tables trapped inside PDF files
A Python library to extract tabular data from PDFs
Community maintained fork of pdfminer - we fathom PDF
Quick Overview
Parsr is an open-source document understanding and data extraction tool developed by AXA Group. It transforms PDF, documents, and images into structured and enriched data, making it easier to analyze and process information from various document formats.
Pros
- Supports multiple input formats (PDF, images, office documents)
- Provides a modular architecture for easy customization and extension
- Offers both CLI and API interfaces for integration into various workflows
- Includes advanced features like table extraction and document layout analysis
Cons
- Requires complex setup and dependencies
- May have performance issues with large or complex documents
- Limited support for handwritten text recognition
- Documentation could be more comprehensive for advanced use cases
Code Examples
- Basic usage with Node.js:
const Parsr = require('parsr-client');
const client = new Parsr('http://localhost:3001');
const file = './path/to/document.pdf';
client.send(file).then(result => {
console.log(JSON.stringify(result, null, 2));
});
- Extracting text from a specific page:
const Parsr = require('parsr-client');
const client = new Parsr('http://localhost:3001');
const file = './path/to/document.pdf';
const config = { pages: [1] };
client.send(file, config).then(result => {
const text = result.pages[0].elements
.filter(e => e.type === 'paragraph')
.map(e => e.content)
.join('\n');
console.log(text);
});
- Extracting tables from a document:
const Parsr = require('parsr-client');
const client = new Parsr('http://localhost:3001');
const file = './path/to/document.pdf';
const config = { modules: { table_detection: { enabled: true } } };
client.send(file, config).then(result => {
const tables = result.pages.flatMap(page =>
page.elements.filter(e => e.type === 'table')
);
console.log(JSON.stringify(tables, null, 2));
});
Getting Started
-
Install dependencies:
npm install parsr-client
-
Start the Parsr server:
docker run -p 3001:3001 axarev/parsr
-
Use the client in your Node.js application:
const Parsr = require('parsr-client'); const client = new Parsr('http://localhost:3001'); const file = './path/to/document.pdf'; client.send(file).then(result => { console.log(JSON.stringify(result, null, 2)); });
Competitor Comparisons
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Pros of OCRmyPDF
- Specializes in OCR for PDF files, making it more focused and potentially more efficient for this specific task
- Supports text recognition in multiple languages
- Integrates well with existing PDF workflows and tools
Cons of OCRmyPDF
- Limited to PDF files, while Parsr can handle various document formats
- Lacks advanced document structure analysis and data extraction features
- May require more manual intervention for complex document layouts
Code Comparison
OCRmyPDF:
import ocrmypdf
ocrmypdf.ocr('input.pdf', 'output.pdf', language='eng')
Parsr:
const Parsr = require('parsr');
Parsr.parse('input.pdf', {
output_format: 'json',
document_type: 'pdf'
}).then(result => console.log(result));
OCRmyPDF focuses on adding OCR layers to PDF files, while Parsr offers a more comprehensive document parsing solution with additional features for structure analysis and data extraction. OCRmyPDF is simpler to use for basic OCR tasks, whereas Parsr provides more flexibility for handling various document types and extracting structured data.
Tesseract Open Source OCR Engine (main repository)
Pros of Tesseract
- Mature and widely-used OCR engine with extensive language support
- Highly accurate for text recognition in various formats and languages
- Active community and continuous development
Cons of Tesseract
- Limited document structure analysis capabilities
- Requires pre-processing for optimal results with complex layouts
- Primarily focused on OCR, lacking advanced document parsing features
Code Comparison
Tesseract (Python):
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open('image.png'))
print(text)
Parsr (JavaScript):
const Parsr = require('parsr');
const config = { ... };
const file = 'document.pdf';
Parsr.run(file, config).then(result => {
console.log(result);
});
Tesseract excels in OCR tasks, offering a straightforward API for text extraction from images. Parsr, on the other hand, provides a more comprehensive document parsing solution, including layout analysis and data extraction capabilities.
While Tesseract focuses primarily on text recognition, Parsr offers additional features such as table detection, header/footer extraction, and document structure analysis. Tesseract is ideal for simple OCR tasks, whereas Parsr is better suited for complex document processing workflows that require more advanced parsing and analysis.
Tabula is a tool for liberating data tables trapped inside PDF files
Pros of Tabula
- Specifically designed for extracting tables from PDFs, making it highly effective for this task
- User-friendly GUI interface, making it accessible to non-technical users
- Supports multiple output formats, including CSV, TSV, and JSON
Cons of Tabula
- Limited to table extraction, lacking broader document parsing capabilities
- Requires Java Runtime Environment, which may be a barrier for some users
- Less actively maintained compared to Parsr, with fewer recent updates
Code Comparison
Tabula (Java):
PDDocument pdf = PDDocument.load(new File("document.pdf"));
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
Page page = pdf.getPage(0);
List<Table> tables = sea.extract(page);
Parsr (JavaScript):
const Parsr = require('parsr');
const parsr = new Parsr();
const config = { ... };
parsr.run('document.pdf', config)
.then(result => console.log(result));
While Tabula focuses on table extraction using Java, Parsr offers a more comprehensive document parsing solution in JavaScript, providing greater flexibility for various document types and structures.
A Python library to extract tabular data from PDFs
Pros of Camelot
- Python-based, making it easier for data scientists and analysts to integrate into existing workflows
- Supports both stream and lattice-based extraction methods, offering flexibility for different PDF layouts
- Extensive documentation and active community support
Cons of Camelot
- Limited to PDF files only, while Parsr supports multiple input formats
- May require additional dependencies (e.g., Ghostscript) for optimal performance
- Less comprehensive in terms of document cleaning and structuring compared to Parsr
Code Comparison
Camelot:
import camelot
tables = camelot.read_pdf('example.pdf')
df = tables[0].df
Parsr:
const Parsr = require('parsr');
const parsr = new Parsr();
parsr.run('example.pdf').then(result => {
console.log(result);
});
Key Differences
- Language: Camelot is Python-based, while Parsr is primarily JavaScript/TypeScript
- Input formats: Camelot focuses on PDFs, Parsr supports multiple formats
- Functionality: Parsr offers more comprehensive document processing features
- Integration: Camelot is easier to integrate into data analysis workflows
- Community: Camelot has a larger user base and more active community
Both tools have their strengths, with Camelot excelling in PDF table extraction and Parsr offering broader document processing capabilities.
Community maintained fork of pdfminer - we fathom PDF
Pros of pdfminer.six
- Pure Python implementation, making it easier to install and use across different platforms
- More granular control over PDF parsing and extraction processes
- Supports both Python 2 and Python 3
Cons of pdfminer.six
- Less user-friendly for non-technical users compared to Parsr's GUI and API
- Requires more manual coding to achieve complex document processing tasks
- Limited built-in support for advanced document structure analysis
Code Comparison
pdfminer.six:
from pdfminer.high_level import extract_text
text = extract_text('document.pdf')
print(text)
Parsr:
const Parsr = require('parsr');
Parsr.parse('document.pdf', { output_format: 'text' })
.then(result => console.log(result));
pdfminer.six offers a more straightforward approach for basic text extraction, while Parsr provides a higher-level API with more built-in processing options. pdfminer.six gives developers more control over the extraction process, but requires more code for advanced tasks. Parsr simplifies complex document processing with its pre-built modules and configuration options, making it more suitable for users who need quick results without diving deep into PDF parsing intricacies.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Turn your documents into data!
Français | Portuguese | Spanish | ä¸æ
-
Parsr, is a minimal-footprint document (image, pdf, docx, eml) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data in JSON, Markdown (MD), CSV/Pandas DF or TXT formats.
-
It provides analysts, data scientists and developers with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysts automation, archival, and many others.
-
Currently, Parsr can perform: document cleaning, hierarchy regeneration (words, lines, paragraphs), detection of headings, tables, lists, table of contents, page numbers, headers/footers, links, and others. Check out all the features.
Table of Contents
Getting Started
Installation
-- The advanced installation guide is available here --
The quickest way to install and run the Parsr API is through the docker image:
docker pull axarev/parsr
If you also wish to install the GUI for sending documents and visualising results:
docker pull axarev/parsr-ui-localhost
Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.
Usage
-- The advanced usage guide is available here --
To run the API, issue:
docker run -p 3001:3001 axarev/parsr
which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.
-
To access the python client to Parsr API, issue:
pip install parsr-client
To sample the Jupyter Notebook, using the python client, head over to the jupyter demo.
- To use the GUI tool (the API needs to already be running), issue:
Then, access it through http://localhost:8080.docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.
The API based usage and the command line usage are documented in the advanced usage guide.
Documentation
All documentation files can be found here.
Contribute
Please refer to the contribution guidelines.
Third Party Licenses
Third Party Libraries licenses for its dependencies:
- QPDF: Apache http://qpdf.sourceforge.net
- ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
- Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
- PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
- Tesseract: Apache 2.0 https://github.com/tesseract-ocr/tesseract
- Camelot: MIT https://github.com/camelot-dev/camelot
- MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
- Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc
License
Copyright 2020 AXA Group Operations S.A.
Licensed under the Apache 2.0 license (see the LICENSE file).
Top Related Projects
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Tesseract Open Source OCR Engine (main repository)
Tabula is a tool for liberating data tables trapped inside PDF files
A Python library to extract tabular data from PDFs
Community maintained fork of pdfminer - we fathom PDF
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot