Convert Figma logo to code with AI

mindee logodoctr

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

3,546
417
3,546
43

Top Related Projects

19,492

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

42,444

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

23,625

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

60,774

Tesseract Open Source OCR Engine (main repository)

Text recognition (optical character recognition) with deep learning methods, ICCV 2019

Quick Overview

DocTR (Document Text Recognition) is an open-source library for OCR-related tasks. It provides a comprehensive set of tools for document analysis, including text detection, recognition, and layout analysis, using deep learning models. DocTR aims to simplify the process of extracting text from various document formats.

Pros

  • Comprehensive OCR pipeline with both text detection and recognition capabilities
  • Supports multiple languages and document formats
  • Easy-to-use API with pre-trained models for quick implementation
  • Actively maintained and regularly updated

Cons

  • Requires significant computational resources for optimal performance
  • Limited documentation for advanced customization
  • Dependency on specific versions of deep learning frameworks may cause compatibility issues
  • Performance may vary depending on document quality and complexity

Code Examples

  1. Basic text detection and recognition:
from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_images("path/to/your/image.jpg")
result = model(doc)
print(result.export())
  1. Extracting text from a specific page:
from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_pdf("path/to/your/document.pdf")
page = doc[0]
result = model([page])
print(result.pages[0].export())
  1. Visualizing detection results:
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
from doctr.utils.visualization import visualize_page

model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_images("path/to/your/image.jpg")
result = model(doc)
visualize_page(result.pages[0].export(), doc[0], interactive=True)

Getting Started

To get started with DocTR, follow these steps:

  1. Install the library:
pip install python-doctr
  1. Use the library in your Python script:
from doctr.io import DocumentFile
from doctr.models import ocr_predictor

# Load a pre-trained model
model = ocr_predictor(pretrained=True)

# Load your document
doc = DocumentFile.from_images("path/to/your/image.jpg")

# Perform OCR
result = model(doc)

# Print the extracted text
print(result.export())

This basic example demonstrates how to load a pre-trained model, process an image, and extract the text content.

Competitor Comparisons

19,492

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Pros of UniLM

  • Broader scope, covering multiple NLP tasks beyond document understanding
  • Larger community and more frequent updates
  • Backed by Microsoft, potentially offering more resources and support

Cons of UniLM

  • More complex to set up and use for specific document processing tasks
  • Larger codebase, which may be overwhelming for simpler projects
  • Requires more computational resources due to its comprehensive nature

Code Comparison

UniLM (for text generation):

from transformers import UniLMTokenizer, UniLMForConditionalGeneration

tokenizer = UniLMTokenizer.from_pretrained("microsoft/unilm-base-cased")
model = UniLMForConditionalGeneration.from_pretrained("microsoft/unilm-base-cased")

input_ids = tokenizer.encode("Hello, how are you?", return_tensors="pt")
outputs = model.generate(input_ids, max_length=50)

docTR (for document text recognition):

from doctr.models import ocr_predictor
from doctr.io import DocumentFile

model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_images("path/to/your/image.jpg")
result = model(doc)

Both repositories offer powerful tools for document processing and natural language tasks. UniLM provides a more comprehensive solution for various NLP tasks, while docTR focuses specifically on document text recognition with a simpler, more targeted approach.

42,444

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

Pros of PaddleOCR

  • More comprehensive OCR toolkit with support for multiple languages and scripts
  • Includes pre-trained models for various OCR tasks, including text detection and recognition
  • Offers end-to-end OCR pipeline with additional features like layout analysis and table recognition

Cons of PaddleOCR

  • Steeper learning curve due to its extensive features and PaddlePaddle framework
  • Larger codebase and dependencies, which may impact deployment and integration
  • Less focus on document understanding compared to DocTR's document-centric approach

Code Comparison

PaddleOCR:

from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang='en')
result = ocr.ocr('image.jpg')

DocTR:

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_images('image.jpg')
result = model(doc)

Both libraries offer simple APIs for OCR tasks, but PaddleOCR provides more options for language and angle classification out of the box, while DocTR focuses on document-specific features and has a more streamlined API for document processing.

23,625

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Pros of EasyOCR

  • Supports a wide range of languages (80+)
  • Simple and user-friendly API
  • Includes pre-trained models for quick setup

Cons of EasyOCR

  • Less focus on document structure analysis
  • May have lower accuracy for complex layouts
  • Limited customization options for model training

Code Comparison

EasyOCR:

import easyocr
reader = easyocr.Reader(['en'])
result = reader.readtext('image.jpg')

docTR:

from doctr.io import DocumentFile
from doctr.models import ocr_predictor
model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_images('image.jpg')
result = model(doc)

EasyOCR provides a simpler API for basic OCR tasks, while docTR offers more advanced document analysis capabilities. EasyOCR is better suited for quick, multi-language OCR tasks, whereas docTR excels in handling complex document structures and layouts. The choice between the two depends on the specific requirements of the project, such as language support, document complexity, and the need for detailed structural analysis.

60,774

Tesseract Open Source OCR Engine (main repository)

Pros of Tesseract

  • Mature and widely-used OCR engine with extensive language support
  • Highly accurate for printed text recognition
  • Supports multiple output formats (text, hOCR, PDF, etc.)

Cons of Tesseract

  • Slower processing speed compared to modern deep learning-based approaches
  • Less effective for handwritten text and complex document layouts
  • Requires more manual configuration and preprocessing for optimal results

Code Comparison

Tesseract (Python wrapper):

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open('image.png'))
print(text)

DocTR:

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_images('image.png')
result = model(doc)
print(result.export())

DocTR is a more modern, deep learning-based OCR solution that offers better performance on complex layouts and handwritten text. It provides an end-to-end approach with built-in preprocessing and layout analysis. Tesseract, while more established and widely used, may require additional steps for optimal results but offers broader language support and multiple output formats.

Text recognition (optical character recognition) with deep learning methods, ICCV 2019

Pros of deep-text-recognition-benchmark

  • Comprehensive benchmark for various text recognition models
  • Includes multiple datasets and evaluation metrics
  • Provides a flexible framework for comparing different architectures

Cons of deep-text-recognition-benchmark

  • Less focus on end-to-end document processing
  • May require more setup and configuration for specific use cases
  • Limited built-in support for document layout analysis

Code Comparison

deep-text-recognition-benchmark:

model = Model(opt)
converter = AttnLabelConverter(opt.character)
criterion = torch.nn.CrossEntropyLoss(ignore_index=0).to(device)

doctr:

model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)
doc = DocumentFile.from_images('path/to/your/image.jpg')
result = model(doc)

The deep-text-recognition-benchmark code focuses on model setup and training, while doctr provides a more streamlined approach for document OCR tasks. deep-text-recognition-benchmark offers greater flexibility in model architecture, but doctr simplifies the process of applying OCR to documents with pre-trained models and built-in document analysis capabilities.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Slack Icon License Build Status Docker Images codecov CodeFactor Codacy Badge Doc Status Pypi Hugging Face Spaces Open In Colab

Optical Character Recognition made seamless & accessible to anyone, powered by TensorFlow 2 & PyTorch

What you can expect from this repository:

  • efficient ways to parse textual information (localize and identify each word) from your documents
  • guidance on how to integrate this in your current architecture

OCR_example

Quick Tour

Getting your pretrained model

End-to-End OCR is achieved in docTR using a two-stage approach: text detection (localizing words), then text recognition (identify all characters in the word). As such, you can select the architecture used for text detection, and the one for text recognition from the list of available implementations.

from doctr.models import ocr_predictor

model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)

Reading files

Documents can be interpreted from PDF or images:

from doctr.io import DocumentFile
# PDF
pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Image
single_img_doc = DocumentFile.from_images("path/to/your/img.jpg")
# Webpage (requires `weasyprint` to be installed)
webpage_doc = DocumentFile.from_url("https://www.yoursite.com")
# Multiple page images
multi_img_doc = DocumentFile.from_images(["path/to/page1.jpg", "path/to/page2.jpg"])

Putting it together

Let's use the default pretrained model for an example:

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Analyze
result = model(doc)

Dealing with rotated documents

Should you use docTR on documents that include rotated pages, or pages with multiple box orientations, you have multiple options to handle it:

  • If you only use straight document pages with straight words (horizontal, same reading direction), consider passing assume_straight_boxes=True to the ocr_predictor. It will directly fit straight boxes on your page and return straight boxes, which makes it the fastest option.

  • If you want the predictor to output straight boxes (no matter the orientation of your pages, the final localizations will be converted to straight boxes), you need to pass export_as_straight_boxes=True in the predictor. Otherwise, if assume_straight_pages=False, it will return rotated bounding boxes (potentially with an angle of 0°).

If both options are set to False, the predictor will always fit and return rotated boxes.

To interpret your model's predictions, you can visualize them interactively as follows:

# Display the result (requires matplotlib & mplcursors to be installed)
result.show()

Visualization sample

Or even rebuild the original document from its predictions:

import matplotlib.pyplot as plt

synthetic_pages = result.synthesize()
plt.imshow(synthetic_pages[0]); plt.axis('off'); plt.show()

Synthesis sample

The ocr_predictor returns a Document object with a nested structure (with Page, Block, Line, Word, Artefact). To get a better understanding of our document model, check our documentation:

You can also export them as a nested dict, more appropriate for JSON format:

json_output = result.export()

Use the KIE predictor

The KIE predictor is a more flexible predictor compared to OCR as your detection model can detect multiple classes in a document. For example, you can have a detection model to detect just dates and addresses in a document.

The KIE predictor makes it possible to use detector with multiple classes with a recognition model and to have the whole pipeline already setup for you.

from doctr.io import DocumentFile
from doctr.models import kie_predictor

# Model
model = kie_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)
# PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Analyze
result = model(doc)

predictions = result.pages[0].predictions
for class_name in predictions.keys():
    list_predictions = predictions[class_name]
    for prediction in list_predictions:
        print(f"Prediction for {class_name}: {prediction}")

The KIE predictor results per page are in a dictionary format with each key representing a class name and it's value are the predictions for that class.

If you are looking for support from the Mindee team

Bad OCR test detection image asking the developer if they need help

Installation

Prerequisites

Python 3.9 (or higher) and pip are required to install docTR.

Latest release

You can then install the latest release of the package using pypi as follows:

pip install python-doctr

:warning: Please note that the basic installation is not standalone, as it does not provide a deep learning framework, which is required for the package to run.

We try to keep framework-specific dependencies to a minimum. You can install framework-specific builds as follows:

# for TensorFlow
pip install "python-doctr[tf]"
# for PyTorch
pip install "python-doctr[torch]"
# optional dependencies for visualization, html, and contrib modules can be installed as follows:
pip install "python-doctr[torch,viz,html,contib]"

For MacBooks with M1 chip, you will need some additional packages or specific versions:

Developer mode

Alternatively, you can install it from source, which will require you to install Git. First clone the project repository:

git clone https://github.com/mindee/doctr.git
pip install -e doctr/.

Again, if you prefer to avoid the risk of missing dependencies, you can install the TensorFlow or the PyTorch build:

# for TensorFlow
pip install -e doctr/.[tf]
# for PyTorch
pip install -e doctr/.[torch]

Models architectures

Credits where it's due: this repository is implementing, among others, architectures from published research papers.

Text Detection

Text Recognition

More goodies

Documentation

The full package documentation is available here for detailed specifications.

Demo app

A minimal demo app is provided for you to play with our end-to-end OCR models!

Demo app

Live demo

Courtesy of :hugs: Hugging Face :hugs:, docTR has now a fully deployed version available on Spaces! Check it out Hugging Face Spaces

Running it locally

If you prefer to use it locally, there is an extra dependency (Streamlit) that is required.

Tensorflow version
pip install -r demo/tf-requirements.txt

Then run your app in your default browser with:

USE_TF=1 streamlit run demo/app.py
PyTorch version
pip install -r demo/pt-requirements.txt

Then run your app in your default browser with:

USE_TORCH=1 streamlit run demo/app.py

TensorFlow.js

Instead of having your demo actually running Python, you would prefer to run everything in your web browser? Check out our TensorFlow.js demo to get started!

TFJS demo

Docker container

We offer Docker container support for easy testing and deployment.

Using GPU with docTR Docker Images

The docTR Docker images are GPU-ready and based on CUDA 11.8. However, to use GPU support with these Docker images, please ensure that Docker is configured to use your GPU.

To verify and configure GPU support for Docker, please follow the instructions provided in the NVIDIA Container Toolkit Installation Guide.

Once Docker is configured to use GPUs, you can run docTR Docker containers with GPU support:

docker run -it --gpus all ghcr.io/mindee/doctr:tf-py3.8.18-gpu-2023-09 bash

Available Tags

The Docker images for docTR follow a specific tag nomenclature: <framework>-py<python_version>-<system>-<doctr_version|YYYY-MM>. Here's a breakdown of the tag structure:

  • <framework>: tf (TensorFlow) or torch (PyTorch).
  • <python_version>: 3.8.18, 3.9.18, or 3.10.13.
  • <system>: cpu or gpu
  • <doctr_version>: a tag >= v0.7.1
  • <YYYY-MM>: e.g. 2023-09

Here are examples of different image tags:

TagDescription
tf-py3.8.18-cpu-v0.7.1TensorFlow version 3.8.18 with docTR v0.7.1.
torch-py3.9.18-gpu-2023-09PyTorch version 3.9.18 with GPU support and a monthly build from 2023-09.

Building Docker Images Locally

You can also build docTR Docker images locally on your computer.

docker build -t doctr .

You can specify custom Python versions and docTR versions using build arguments. For example, to build a docTR image with TensorFlow, Python version 3.9.10, and docTR version v0.7.0, run the following command:

docker build -t doctr --build-arg FRAMEWORK=tf --build-arg PYTHON_VERSION=3.9.10 --build-arg DOCTR_VERSION=v0.7.0 .

Example script

An example script is provided for a simple documentation analysis of a PDF or image file:

python scripts/analyze.py path/to/your/doc.pdf

All script arguments can be checked using python scripts/analyze.py --help

Minimal API integration

Looking to integrate docTR into your API? Here is a template to get you started with a fully working API using the wonderful FastAPI framework.

Deploy your API locally

Specific dependencies are required to run the API template, which you can install as follows:

cd api/
pip install poetry
make lock
pip install -r requirements.txt

You can now run your API locally:

uvicorn --reload --workers 1 --host 0.0.0.0 --port=8002 --app-dir api/ app.main:app

Alternatively, you can run the same server on a docker container if you prefer using:

PORT=8002 docker-compose up -d --build

What you have deployed

Your API should now be running locally on your port 8002. Access your automatically-built documentation at http://localhost:8002/redoc and enjoy your three functional routes ("/detection", "/recognition", "/ocr", "/kie"). Here is an example with Python to send a request to the OCR route:

import requests

params = {"det_arch": "db_resnet50", "reco_arch": "crnn_vgg16_bn"}

with open('/path/to/your/doc.jpg', 'rb') as f:
    files = [  # application/pdf, image/jpeg, image/png supported
        ("files", ("doc.jpg", f.read(), "image/jpeg")),
    ]
print(requests.post("http://localhost:8080/ocr", params=params, files=files).json())

Example notebooks

Looking for more illustrations of docTR features? You might want to check the Jupyter notebooks designed to give you a broader overview.

Citation

If you wish to cite this project, feel free to use this BibTeX reference:

@misc{doctr2021,
    title={docTR: Document Text Recognition},
    author={Mindee},
    year={2021},
    publisher = {GitHub},
    howpublished = {\url{https://github.com/mindee/doctr}}
}

Contributing

If you scrolled down to this section, you most likely appreciate open source. Do you feel like extending the range of our supported characters? Or perhaps submitting a paper implementation? Or contributing in any other way?

You're in luck, we compiled a short guide (cf. CONTRIBUTING) for you to easily do so!

License

Distributed under the Apache 2.0 License. See LICENSE for more information.