Convert Figma logo to code with AI

tesseract-ocr logotessdata

Trained models with fast variant of the "best" LSTM models + legacy models

6,274
2,153
6,274
49

Top Related Projects

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

42,444

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

23,625

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Python-based tools for document analysis and OCR

Tesseract Open Source OCR Engine (main repository)

Quick Overview

The tesseract-ocr/tessdata repository is a collection of trained models for the Tesseract OCR engine. It contains language data files for various languages and scripts, enabling Tesseract to recognize and extract text from images in multiple languages. This repository is an essential component for users of the Tesseract OCR system.

Pros

  • Supports a wide range of languages and scripts
  • Regularly updated with new and improved language models
  • Open-source and freely available for use
  • Compatible with various versions of Tesseract OCR

Cons

  • Large file sizes for some language models
  • May require additional processing or training for specific use cases
  • Performance can vary depending on the quality of input images
  • Limited support for some less common languages or scripts

Getting Started

To use the language data files from this repository with Tesseract OCR:

  1. Clone or download the repository:

    git clone https://github.com/tesseract-ocr/tessdata.git
    
  2. Install Tesseract OCR on your system (if not already installed).

  3. Set the TESSDATA_PREFIX environment variable to point to the directory containing the language data files:

    export TESSDATA_PREFIX=/path/to/tessdata
    
  4. Use Tesseract OCR with the desired language model:

    tesseract input_image.png output -l eng
    

Replace eng with the appropriate language code for your needs.

Competitor Comparisons

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

Pros of tesseract.js

  • Browser-based: Runs directly in web browsers without server-side processing
  • Easy integration: Simple to incorporate into web applications
  • Versatile: Supports multiple languages and image formats

Cons of tesseract.js

  • Performance: Generally slower than native Tesseract OCR
  • Accuracy: May have slightly lower accuracy compared to the native version
  • File size: Larger bundle size due to including the entire OCR engine

Code Comparison

tesseract.js:

import Tesseract from 'tesseract.js';

Tesseract.recognize('image.jpg', 'eng')
  .then(({ data: { text } }) => {
    console.log(text);
  });

tessdata (using Python wrapper):

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open('image.jpg'), lang='eng')
print(text)

Summary

tesseract.js is ideal for web-based OCR applications, offering easy integration and browser compatibility. However, it may sacrifice some performance and accuracy compared to the native Tesseract OCR (tessdata). The native version is better suited for server-side or desktop applications where performance and accuracy are critical. Choose based on your specific use case and deployment requirements.

42,444

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

Pros of PaddleOCR

  • More comprehensive OCR solution with end-to-end capabilities
  • Supports multiple languages and recognition scenarios
  • Actively maintained with frequent updates and improvements

Cons of PaddleOCR

  • Steeper learning curve due to its complexity
  • Requires more computational resources for training and inference
  • Less widespread adoption compared to Tesseract

Code Comparison

PaddleOCR:

from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang='en')
result = ocr.ocr('image.jpg')
print(result)

tessdata:

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open('image.jpg'))
print(text)

PaddleOCR offers a more comprehensive solution with built-in language and angle detection, while tessdata (used with Tesseract) provides a simpler interface for basic OCR tasks. PaddleOCR's code example demonstrates its ability to handle multiple aspects of OCR in a single call, whereas tessdata requires additional setup for language models and image preprocessing.

Both repositories serve different needs: PaddleOCR is better suited for complex OCR tasks and research, while tessdata is more appropriate for straightforward text extraction from images with lower resource requirements.

23,625

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Pros of EasyOCR

  • Supports 80+ languages, including non-Latin scripts
  • Easy to use with a simple Python interface
  • GPU acceleration for faster processing

Cons of EasyOCR

  • Larger model size, requiring more storage and memory
  • May be slower for simple OCR tasks on CPU
  • Less mature project with potentially fewer community contributions

Code Comparison

EasyOCR:

import easyocr
reader = easyocr.Reader(['en'])
result = reader.readtext('image.jpg')

tessdata (using Tesseract OCR):

import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open('image.jpg'))

Summary

EasyOCR offers a user-friendly interface and supports a wide range of languages, making it suitable for multilingual OCR tasks. It also provides GPU acceleration for improved performance. However, it has a larger model size and may be slower for simple tasks on CPU.

tessdata, used with Tesseract OCR, is a more established project with a smaller footprint. It's generally faster for basic OCR tasks but may require more setup and configuration for optimal results, especially with non-Latin scripts.

The choice between the two depends on the specific requirements of your project, such as language support, processing speed, and ease of use.

Python-based tools for document analysis and OCR

Pros of DUP-ocropy

  • More flexible and customizable for specific OCR tasks
  • Better suited for historical document recognition
  • Supports a wider range of image preprocessing techniques

Cons of DUP-ocropy

  • Less actively maintained compared to tessdata
  • Smaller community and fewer available resources
  • May require more technical expertise to implement effectively

Code Comparison

DUP-ocropy:

from ocrolib import ocropus
image = ocropus.read_image("input.png")
binarized = ocropus.binarize(image)
segmented = ocropus.segment(binarized)
text = ocropus.recognize(segmented)

tessdata:

import pytesseract
from PIL import Image
image = Image.open("input.png")
text = pytesseract.image_to_string(image)

DUP-ocropy offers more granular control over the OCR process, allowing for customization at each step. tessdata, on the other hand, provides a simpler, more straightforward approach with fewer lines of code required for basic OCR tasks.

Tesseract Open Source OCR Engine (main repository)

Pros of tesseract

  • More comprehensive language support with a wider range of trained models
  • Regularly updated with new language data and improvements
  • Officially maintained by the Tesseract OCR project

Cons of tesseract

  • Larger repository size due to the inclusion of many language models
  • May require more storage space when using multiple languages
  • Potentially slower to clone or download due to its size

Code comparison

tessdata:

git clone https://github.com/tesseract-ocr/tessdata.git
tesseract image.png output -l eng

tesseract:

git clone https://github.com/UB-Mannheim/tesseract.git
tesseract image.png output -l eng

The code usage is identical for both repositories, as they primarily provide language data and models for Tesseract OCR. The main difference lies in the available language models and the repository structure.

Summary

tessdata is the official repository for Tesseract OCR language data, offering a wide range of language models and regular updates. tesseract, maintained by UB-Mannheim, provides a more focused set of language models and may be suitable for users who need a smaller subset of languages or prefer a more compact repository. Both repositories can be used interchangeably with the Tesseract OCR engine, and the choice between them depends on specific language requirements and storage considerations.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

tessdata

These language data files only work with Tesseract 4.0.0 and newer versions. They are based on the sources in tesseract-ocr/langdata on GitHub. (still to be updated for 4.0.0 - 20180322)

These have models for legacy tesseract engine (--oem 0) as well as the new LSTM neural net based engine (--oem 1).

The LSTM models (--oem 1) in these files have been updated to the integerized versions of tessdata_best on GitHub. So, they should be faster but probably a little less accurate than tessdata_best.

tessdata_fast on GitHub provides an alternate set of integerized LSTM models which have been built with a smaller network. tessdata_fast files are the ones packaged for Debian and Ubuntu.

The legacy tesseract models (--oem 0) have been removed for Indic and Arabic script language files.

tessdata for 3.04 or 3.05

Get language data files for Tesseract 3.04 or 3.05 from the 3.04 tree.

More information and a complete list of all languages is available in the Tesseract wiki.

All data in the repository are licensed under the Apache-2.0 License, see file LICENSE.