tessdata

Trained models with fast variant of the "best" LSTM models + legacy models

7,006

2,353

7,006

View on GitHub

Top Related Projects

tesseract.js

36,446

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

EasyOCR

26,473

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

DUP-ocropy

3,451

Python-based tools for document analysis and OCR

tesseract

3,600

Tesseract Open Source OCR Engine (main repository)

Quick Overview

The tesseract-ocr/tessdata repository is a collection of trained models for the Tesseract OCR engine. It contains language data files for various languages and scripts, enabling Tesseract to recognize and extract text from images in multiple languages. This repository is an essential component for users of the Tesseract OCR system.

Pros

Supports a wide range of languages and scripts
Regularly updated with new and improved language models
Open-source and freely available for use
Compatible with various versions of Tesseract OCR

Cons

Large file sizes for some language models
May require additional processing or training for specific use cases
Performance can vary depending on the quality of input images
Limited support for some less common languages or scripts

Getting Started

To use the language data files from this repository with Tesseract OCR:

Clone or download the repository:

git clone https://github.com/tesseract-ocr/tessdata.git

Install Tesseract OCR on your system (if not already installed).
Set the TESSDATA_PREFIX environment variable to point to the directory containing the language data files:
```
export TESSDATA_PREFIX=/path/to/tessdata
```
Use Tesseract OCR with the desired language model:
```
tesseract input_image.png output -l eng
```

Replace eng with the appropriate language code for your needs.

Competitor Comparisons

tesseract.js

36,446

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

Pros of tesseract.js

Browser-based: Runs directly in web browsers without server-side processing
Easy integration: Simple to incorporate into web applications
Versatile: Supports multiple languages and image formats

Cons of tesseract.js

Performance: Generally slower than native Tesseract OCR
Accuracy: May have slightly lower accuracy compared to the native version
File size: Larger bundle size due to including the entire OCR engine

Code Comparison

tesseract.js:

import Tesseract from 'tesseract.js';

Tesseract.recognize('image.jpg', 'eng')
  .then(({ data: { text } }) => {
    console.log(text);
  });

tessdata (using Python wrapper):

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open('image.jpg'), lang='eng')
print(text)

Summary

tesseract.js is ideal for web-based OCR applications, offering easy integration and browser compatibility. However, it may sacrifice some performance and accuracy compared to the native Tesseract OCR (tessdata). The native version is better suited for server-side or desktop applications where performance and accuracy are critical. Choose based on your specific use case and deployment requirements.

PaddleOCR

48,742

Pros of PaddleOCR

More comprehensive OCR solution with end-to-end capabilities
Supports multiple languages and recognition scenarios
Actively maintained with frequent updates and improvements

Cons of PaddleOCR

Steeper learning curve due to its complexity
Requires more computational resources for training and inference
Less widespread adoption compared to Tesseract

Code Comparison

PaddleOCR:

from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang='en')
result = ocr.ocr('image.jpg')
print(result)

tessdata:

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open('image.jpg'))
print(text)

PaddleOCR offers a more comprehensive solution with built-in language and angle detection, while tessdata (used with Tesseract) provides a simpler interface for basic OCR tasks. PaddleOCR's code example demonstrates its ability to handle multiple aspects of OCR in a single call, whereas tessdata requires additional setup for language models and image preprocessing.

Both repositories serve different needs: PaddleOCR is better suited for complex OCR tasks and research, while tessdata is more appropriate for straightforward text extraction from images with lower resource requirements.

EasyOCR

26,473

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Pros of EasyOCR

Supports 80+ languages, including non-Latin scripts
Easy to use with a simple Python interface
GPU acceleration for faster processing

Cons of EasyOCR

Larger model size, requiring more storage and memory
May be slower for simple OCR tasks on CPU
Less mature project with potentially fewer community contributions

Code Comparison

EasyOCR:

import easyocr
reader = easyocr.Reader(['en'])
result = reader.readtext('image.jpg')

tessdata (using Tesseract OCR):

import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open('image.jpg'))

Summary

EasyOCR offers a user-friendly interface and supports a wide range of languages, making it suitable for multilingual OCR tasks. It also provides GPU acceleration for improved performance. However, it has a larger model size and may be slower for simple tasks on CPU.

tessdata, used with Tesseract OCR, is a more established project with a smaller footprint. It's generally faster for basic OCR tasks but may require more setup and configuration for optimal results, especially with non-Latin scripts.

The choice between the two depends on the specific requirements of your project, such as language support, processing speed, and ease of use.

DUP-ocropy

3,451

Python-based tools for document analysis and OCR

Pros of DUP-ocropy

More flexible and customizable for specific OCR tasks
Better suited for historical document recognition
Supports a wider range of image preprocessing techniques

Cons of DUP-ocropy

Less actively maintained compared to tessdata
Smaller community and fewer available resources
May require more technical expertise to implement effectively

Code Comparison

DUP-ocropy:

from ocrolib import ocropus
image = ocropus.read_image("input.png")
binarized = ocropus.binarize(image)
segmented = ocropus.segment(binarized)
text = ocropus.recognize(segmented)

tessdata:

import pytesseract
from PIL import Image
image = Image.open("input.png")
text = pytesseract.image_to_string(image)

DUP-ocropy offers more granular control over the OCR process, allowing for customization at each step. tessdata, on the other hand, provides a simpler, more straightforward approach with fewer lines of code required for basic OCR tasks.

tesseract

3,600

Tesseract Open Source OCR Engine (main repository)

Pros of tesseract

More comprehensive language support with a wider range of trained models
Regularly updated with new language data and improvements
Officially maintained by the Tesseract OCR project

Cons of tesseract

Larger repository size due to the inclusion of many language models
May require more storage space when using multiple languages
Potentially slower to clone or download due to its size

Code comparison

tessdata:

git clone https://github.com/tesseract-ocr/tessdata.git
tesseract image.png output -l eng

tesseract:

git clone https://github.com/UB-Mannheim/tesseract.git
tesseract image.png output -l eng

The code usage is identical for both repositories, as they primarily provide language data and models for Tesseract OCR. The main difference lies in the available language models and the repository structure.

Summary

tessdata is the official repository for Tesseract OCR language data, offering a wide range of language models and regular updates. tesseract, maintained by UB-Mannheim, provides a more focused set of language models and may be suitable for users who need a smaller subset of languages or prefer a more compact repository. Both repositories can be used interchangeably with the Tesseract OCR engine, and the choice between them depends on specific language requirements and storage considerations.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

tessdata

These language data files only work with Tesseract 4.0.0 and newer versions. They are based on the sources in tesseract-ocr/langdata on GitHub. (still to be updated for 4.0.0 - 20180322)

These have models for legacy tesseract engine (--oem 0) as well as the new LSTM neural net based engine (--oem 1).

The LSTM models (--oem 1) in these files have been updated to the integerized versions of tessdata_best on GitHub. So, they should be faster but probably a little less accurate than tessdata_best.

tessdata_fast on GitHub provides an alternate set of integerized LSTM models which have been built with a smaller network. tessdata_fast files are the ones packaged for Debian and Ubuntu.

The legacy tesseract models (--oem 0) have been removed for Indic and Arabic script language files.

tessdata for 3.04 or 3.05

Get language data files for Tesseract 3.04 or 3.05 from the 3.04 tree.

More information and a complete list of all languages is available in the Tesseract wiki.

All data in the repository are licensed under the Apache-2.0 License, see file LICENSE.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot