Parsr

Transforms PDF, Documents and Images into Enriched Structured Data

5,971

311

5,971

View on GitHub

Top Related Projects

OCRmyPDF

29,472

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

tesseract

66,432

Tesseract Open Source OCR Engine (main repository)

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

camelot

3,333

A Python library to extract tabular data from PDFs

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

Quick Overview

Parsr is an open-source document understanding and data extraction tool developed by AXA Group. It transforms PDF, documents, and images into structured and enriched data, making it easier to analyze and process information from various document formats.

Pros

Supports multiple input formats (PDF, images, office documents)
Provides a modular architecture for easy customization and extension
Offers both CLI and API interfaces for integration into various workflows
Includes advanced features like table extraction and document layout analysis

Cons

Requires complex setup and dependencies
May have performance issues with large or complex documents
Limited support for handwritten text recognition
Documentation could be more comprehensive for advanced use cases

Code Examples

Basic usage with Node.js:

const Parsr = require('parsr-client');

const client = new Parsr('http://localhost:3001');
const file = './path/to/document.pdf';

client.send(file).then(result => {
  console.log(JSON.stringify(result, null, 2));
});

Extracting text from a specific page:

const Parsr = require('parsr-client');

const client = new Parsr('http://localhost:3001');
const file = './path/to/document.pdf';
const config = { pages: [1] };

client.send(file, config).then(result => {
  const text = result.pages[0].elements
    .filter(e => e.type === 'paragraph')
    .map(e => e.content)
    .join('\n');
  console.log(text);
});

Extracting tables from a document:

const Parsr = require('parsr-client');

const client = new Parsr('http://localhost:3001');
const file = './path/to/document.pdf';
const config = { modules: { table_detection: { enabled: true } } };

client.send(file, config).then(result => {
  const tables = result.pages.flatMap(page => 
    page.elements.filter(e => e.type === 'table')
  );
  console.log(JSON.stringify(tables, null, 2));
});

Getting Started

Install dependencies:
```
npm install parsr-client
```
Start the Parsr server:
```
docker run -p 3001:3001 axarev/parsr
```

Use the client in your Node.js application:

const Parsr = require('parsr-client');
const client = new Parsr('http://localhost:3001');

const file = './path/to/document.pdf';
client.send(file).then(result => {
  console.log(JSON.stringify(result, null, 2));
});

Competitor Comparisons

OCRmyPDF

29,472

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Pros of OCRmyPDF

Specializes in OCR for PDF files, making it more focused and potentially more efficient for this specific task
Supports text recognition in multiple languages
Integrates well with existing PDF workflows and tools

Cons of OCRmyPDF

Limited to PDF files, while Parsr can handle various document formats
Lacks advanced document structure analysis and data extraction features
May require more manual intervention for complex document layouts

Code Comparison

OCRmyPDF:

import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', language='eng')

Parsr:

const Parsr = require('parsr');

Parsr.parse('input.pdf', {
  output_format: 'json',
  document_type: 'pdf'
}).then(result => console.log(result));

OCRmyPDF focuses on adding OCR layers to PDF files, while Parsr offers a more comprehensive document parsing solution with additional features for structure analysis and data extraction. OCRmyPDF is simpler to use for basic OCR tasks, whereas Parsr provides more flexibility for handling various document types and extracting structured data.

tesseract

66,432

Tesseract Open Source OCR Engine (main repository)

Pros of Tesseract

Mature and widely-used OCR engine with extensive language support
Highly accurate for text recognition in various formats and languages
Active community and continuous development

Cons of Tesseract

Limited document structure analysis capabilities
Requires pre-processing for optimal results with complex layouts
Primarily focused on OCR, lacking advanced document parsing features

Code Comparison

Tesseract (Python):

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open('image.png'))
print(text)

Parsr (JavaScript):

const Parsr = require('parsr');

const config = { ... };
const file = 'document.pdf';

Parsr.run(file, config).then(result => {
  console.log(result);
});

Tesseract excels in OCR tasks, offering a straightforward API for text extraction from images. Parsr, on the other hand, provides a more comprehensive document parsing solution, including layout analysis and data extraction capabilities.

While Tesseract focuses primarily on text recognition, Parsr offers additional features such as table detection, header/footer extraction, and document structure analysis. Tesseract is ideal for simple OCR tasks, whereas Parsr is better suited for complex document processing workflows that require more advanced parsing and analysis.

tabula

7,078

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of Tabula

Specifically designed for extracting tables from PDFs, making it highly effective for this task
User-friendly GUI interface, making it accessible to non-technical users
Supports multiple output formats, including CSV, TSV, and JSON

Cons of Tabula

Limited to table extraction, lacking broader document parsing capabilities
Requires Java Runtime Environment, which may be a barrier for some users
Less actively maintained compared to Parsr, with fewer recent updates

Code Comparison

Tabula (Java):

PDDocument pdf = PDDocument.load(new File("document.pdf"));
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
Page page = pdf.getPage(0);
List<Table> tables = sea.extract(page);

Parsr (JavaScript):

const Parsr = require('parsr');
const parsr = new Parsr();
const config = { ... };
parsr.run('document.pdf', config)
  .then(result => console.log(result));

While Tabula focuses on table extraction using Java, Parsr offers a more comprehensive document parsing solution in JavaScript, providing greater flexibility for various document types and structures.

camelot

3,333

A Python library to extract tabular data from PDFs

Pros of Camelot

Python-based, making it easier for data scientists and analysts to integrate into existing workflows
Supports both stream and lattice-based extraction methods, offering flexibility for different PDF layouts
Extensive documentation and active community support

Cons of Camelot

Limited to PDF files only, while Parsr supports multiple input formats
May require additional dependencies (e.g., Ghostscript) for optimal performance
Less comprehensive in terms of document cleaning and structuring compared to Parsr

Code Comparison

Camelot:

import camelot
tables = camelot.read_pdf('example.pdf')
df = tables[0].df

Parsr:

const Parsr = require('parsr');
const parsr = new Parsr();
parsr.run('example.pdf').then(result => {
  console.log(result);
});

Key Differences

Language: Camelot is Python-based, while Parsr is primarily JavaScript/TypeScript
Input formats: Camelot focuses on PDFs, Parsr supports multiple formats
Functionality: Parsr offers more comprehensive document processing features
Integration: Camelot is easier to integrate into data analysis workflows
Community: Camelot has a larger user base and more active community

Both tools have their strengths, with Camelot excelling in PDF table extraction and Parsr offering broader document processing capabilities.

pdfminer.six

6,549

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

Pure Python implementation, making it easier to install and use across different platforms
More granular control over PDF parsing and extraction processes
Supports both Python 2 and Python 3

Cons of pdfminer.six

Less user-friendly for non-technical users compared to Parsr's GUI and API
Requires more manual coding to achieve complex document processing tasks
Limited built-in support for advanced document structure analysis

Code Comparison

pdfminer.six:

from pdfminer.high_level import extract_text

text = extract_text('document.pdf')
print(text)

Parsr:

const Parsr = require('parsr');

Parsr.parse('document.pdf', { output_format: 'text' })
  .then(result => console.log(result));

pdfminer.six offers a more straightforward approach for basic text extraction, while Parsr provides a higher-level API with more built-in processing options. pdfminer.six gives developers more control over the extraction process, but requires more code for advanced tasks. Parsr simplifies complex document processing with its pre-built modules and configuration options, making it more suitable for users who need quick results without diving deep into PDF parsing intricacies.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Turn your documents into data!

FranÃ§ais | Portuguese | Spanish | ä¸æ

Parsr, is a minimal-footprint document (image, pdf, docx, eml) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data in JSON, Markdown (MD), CSV/Pandas DF or TXT formats.
It provides analysts, data scientists and developers with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysts automation, archival, and many others.
Currently, Parsr can perform: document cleaning, hierarchy regeneration (words, lines, paragraphs), detection of headings, tables, lists, table of contents, page numbers, headers/footers, links, and others. Check out all the features.

Table of Contents
Getting Started
- Installation
- Usage
Documentation
Contribute
Third Party Licenses
License

Getting Started

Installation

-- The advanced installation guide is available here --

The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.

Usage

-- The advanced usage guide is available here --

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.

To access the python client to Parsr API, issue:
```
pip install parsr-client
```
To sample the Jupyter Notebook, using the python client, head over to the jupyter demo.

To use the GUI tool (the API needs to already be running), issue:
```
docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
```
Then, access it through http://localhost:8080.

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.

Documentation

All documentation files can be found here.

Contribute

Please refer to the contribution guidelines.

Third Party Licenses

Third Party Libraries licenses for its dependencies:

QPDF: Apache http://qpdf.sourceforge.net
ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
Tesseract: Apache 2.0 https://github.com/tesseract-ocr/tesseract
Camelot: MIT https://github.com/camelot-dev/camelot
MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc

License

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of OCRmyPDF

Cons of OCRmyPDF

Code Comparison

Pros of Tesseract

Cons of Tesseract

Code Comparison

Pros of Tabula

Cons of Tabula

Code Comparison

Pros of Camelot

Cons of Camelot

Code Comparison

Key Differences

Pros of pdfminer.six

Cons of pdfminer.six

Code Comparison

Convert designs to code with AI

README

Turn your documents into data!

Table of Contents

Getting Started

Installation

Usage

Documentation

Contribute

Third Party Licenses

License

Top Related Projects

Convert designs to code with AI