Convert Figma logo to code with AI

axa-group logoParsr

Transforms PDF, Documents and Images into Enriched Structured Data

5,872
310
5,872
72

Top Related Projects

14,359

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

63,942

Tesseract Open Source OCR Engine (main repository)

6,831

Tabula is a tool for liberating data tables trapped inside PDF files

3,058

A Python library to extract tabular data from PDFs

Community maintained fork of pdfminer - we fathom PDF

Quick Overview

Parsr is an open-source document understanding and data extraction tool developed by AXA Group. It transforms PDF, documents, and images into structured and enriched data, making it easier to analyze and process information from various document formats.

Pros

  • Supports multiple input formats (PDF, images, office documents)
  • Provides a modular architecture for easy customization and extension
  • Offers both CLI and API interfaces for integration into various workflows
  • Includes advanced features like table extraction and document layout analysis

Cons

  • Requires complex setup and dependencies
  • May have performance issues with large or complex documents
  • Limited support for handwritten text recognition
  • Documentation could be more comprehensive for advanced use cases

Code Examples

  1. Basic usage with Node.js:
const Parsr = require('parsr-client');

const client = new Parsr('http://localhost:3001');
const file = './path/to/document.pdf';

client.send(file).then(result => {
  console.log(JSON.stringify(result, null, 2));
});
  1. Extracting text from a specific page:
const Parsr = require('parsr-client');

const client = new Parsr('http://localhost:3001');
const file = './path/to/document.pdf';
const config = { pages: [1] };

client.send(file, config).then(result => {
  const text = result.pages[0].elements
    .filter(e => e.type === 'paragraph')
    .map(e => e.content)
    .join('\n');
  console.log(text);
});
  1. Extracting tables from a document:
const Parsr = require('parsr-client');

const client = new Parsr('http://localhost:3001');
const file = './path/to/document.pdf';
const config = { modules: { table_detection: { enabled: true } } };

client.send(file, config).then(result => {
  const tables = result.pages.flatMap(page => 
    page.elements.filter(e => e.type === 'table')
  );
  console.log(JSON.stringify(tables, null, 2));
});

Getting Started

  1. Install dependencies:

    npm install parsr-client
    
  2. Start the Parsr server:

    docker run -p 3001:3001 axarev/parsr
    
  3. Use the client in your Node.js application:

    const Parsr = require('parsr-client');
    const client = new Parsr('http://localhost:3001');
    
    const file = './path/to/document.pdf';
    client.send(file).then(result => {
      console.log(JSON.stringify(result, null, 2));
    });
    

Competitor Comparisons

14,359

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Pros of OCRmyPDF

  • Specializes in OCR for PDF files, making it more focused and potentially more efficient for this specific task
  • Supports text recognition in multiple languages
  • Integrates well with existing PDF workflows and tools

Cons of OCRmyPDF

  • Limited to PDF files, while Parsr can handle various document formats
  • Lacks advanced document structure analysis and data extraction features
  • May require more manual intervention for complex document layouts

Code Comparison

OCRmyPDF:

import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', language='eng')

Parsr:

const Parsr = require('parsr');

Parsr.parse('input.pdf', {
  output_format: 'json',
  document_type: 'pdf'
}).then(result => console.log(result));

OCRmyPDF focuses on adding OCR layers to PDF files, while Parsr offers a more comprehensive document parsing solution with additional features for structure analysis and data extraction. OCRmyPDF is simpler to use for basic OCR tasks, whereas Parsr provides more flexibility for handling various document types and extracting structured data.

63,942

Tesseract Open Source OCR Engine (main repository)

Pros of Tesseract

  • Mature and widely-used OCR engine with extensive language support
  • Highly accurate for text recognition in various formats and languages
  • Active community and continuous development

Cons of Tesseract

  • Limited document structure analysis capabilities
  • Requires pre-processing for optimal results with complex layouts
  • Primarily focused on OCR, lacking advanced document parsing features

Code Comparison

Tesseract (Python):

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open('image.png'))
print(text)

Parsr (JavaScript):

const Parsr = require('parsr');

const config = { ... };
const file = 'document.pdf';

Parsr.run(file, config).then(result => {
  console.log(result);
});

Tesseract excels in OCR tasks, offering a straightforward API for text extraction from images. Parsr, on the other hand, provides a more comprehensive document parsing solution, including layout analysis and data extraction capabilities.

While Tesseract focuses primarily on text recognition, Parsr offers additional features such as table detection, header/footer extraction, and document structure analysis. Tesseract is ideal for simple OCR tasks, whereas Parsr is better suited for complex document processing workflows that require more advanced parsing and analysis.

6,831

Tabula is a tool for liberating data tables trapped inside PDF files

Pros of Tabula

  • Specifically designed for extracting tables from PDFs, making it highly effective for this task
  • User-friendly GUI interface, making it accessible to non-technical users
  • Supports multiple output formats, including CSV, TSV, and JSON

Cons of Tabula

  • Limited to table extraction, lacking broader document parsing capabilities
  • Requires Java Runtime Environment, which may be a barrier for some users
  • Less actively maintained compared to Parsr, with fewer recent updates

Code Comparison

Tabula (Java):

PDDocument pdf = PDDocument.load(new File("document.pdf"));
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
Page page = pdf.getPage(0);
List<Table> tables = sea.extract(page);

Parsr (JavaScript):

const Parsr = require('parsr');
const parsr = new Parsr();
const config = { ... };
parsr.run('document.pdf', config)
  .then(result => console.log(result));

While Tabula focuses on table extraction using Java, Parsr offers a more comprehensive document parsing solution in JavaScript, providing greater flexibility for various document types and structures.

3,058

A Python library to extract tabular data from PDFs

Pros of Camelot

  • Python-based, making it easier for data scientists and analysts to integrate into existing workflows
  • Supports both stream and lattice-based extraction methods, offering flexibility for different PDF layouts
  • Extensive documentation and active community support

Cons of Camelot

  • Limited to PDF files only, while Parsr supports multiple input formats
  • May require additional dependencies (e.g., Ghostscript) for optimal performance
  • Less comprehensive in terms of document cleaning and structuring compared to Parsr

Code Comparison

Camelot:

import camelot
tables = camelot.read_pdf('example.pdf')
df = tables[0].df

Parsr:

const Parsr = require('parsr');
const parsr = new Parsr();
parsr.run('example.pdf').then(result => {
  console.log(result);
});

Key Differences

  • Language: Camelot is Python-based, while Parsr is primarily JavaScript/TypeScript
  • Input formats: Camelot focuses on PDFs, Parsr supports multiple formats
  • Functionality: Parsr offers more comprehensive document processing features
  • Integration: Camelot is easier to integrate into data analysis workflows
  • Community: Camelot has a larger user base and more active community

Both tools have their strengths, with Camelot excelling in PDF table extraction and Parsr offering broader document processing capabilities.

Community maintained fork of pdfminer - we fathom PDF

Pros of pdfminer.six

  • Pure Python implementation, making it easier to install and use across different platforms
  • More granular control over PDF parsing and extraction processes
  • Supports both Python 2 and Python 3

Cons of pdfminer.six

  • Less user-friendly for non-technical users compared to Parsr's GUI and API
  • Requires more manual coding to achieve complex document processing tasks
  • Limited built-in support for advanced document structure analysis

Code Comparison

pdfminer.six:

from pdfminer.high_level import extract_text

text = extract_text('document.pdf')
print(text)

Parsr:

const Parsr = require('parsr');

Parsr.parse('document.pdf', { output_format: 'text' })
  .then(result => console.log(result));

pdfminer.six offers a more straightforward approach for basic text extraction, while Parsr provides a higher-level API with more built-in processing options. pdfminer.six gives developers more control over the extraction process, but requires more code for advanced tasks. Parsr simplifies complex document processing with its pre-built modules and configuration options, making it more suitable for users who need quick results without diving deep into PDF parsing intricacies.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README


Turn your documents into data!

Français | Portuguese | Spanish | 中文

  • Parsr, is a minimal-footprint document (image, pdf, docx, eml) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data in JSON, Markdown (MD), CSV/Pandas DF or TXT formats.

  • It provides analysts, data scientists and developers with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysts automation, archival, and many others.

  • Currently, Parsr can perform: document cleaning, hierarchy regeneration (words, lines, paragraphs), detection of headings, tables, lists, table of contents, page numbers, headers/footers, links, and others. Check out all the features.

Table of Contents

Getting Started

Installation

-- The advanced installation guide is available here --

The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.

Usage

-- The advanced usage guide is available here --

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.

  1. To access the python client to Parsr API, issue:

    pip install parsr-client
    

    To sample the Jupyter Notebook, using the python client, head over to the jupyter demo.

  1. To use the GUI tool (the API needs to already be running), issue:
    docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
    
    Then, access it through http://localhost:8080.

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.

Documentation

All documentation files can be found here.

Contribute

Please refer to the contribution guidelines.

Third Party Licenses

Third Party Libraries licenses for its dependencies:

  1. QPDF: Apache http://qpdf.sourceforge.net
  2. ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
  3. Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
  4. PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
  5. Tesseract: Apache 2.0 https://github.com/tesseract-ocr/tesseract
  6. Camelot: MIT https://github.com/camelot-dev/camelot
  7. MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
  8. Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc

License

Copyright 2020 AXA Group Operations S.A.
Licensed under the Apache 2.0 license (see the LICENSE file).