tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

34,608

2,193

34,608

View on GitHub View on NPM

Top Related Projects

DUP-ocropy

3,411

Python-based tools for document analysis and OCR

tesseract

3,019

Tesseract Open Source OCR Engine (main repository)

opencv

77,862

Open Source Computer Vision Library

EasyOCR

23,625

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Quick Overview

Tesseract.js is a pure JavaScript port of the popular Tesseract OCR engine. It allows developers to easily integrate optical character recognition (OCR) capabilities into web applications, enabling text extraction from images directly in the browser or on Node.js environments.

Pros

Cross-platform compatibility (works in browsers and Node.js)
Easy integration with web applications
Supports multiple languages and can be trained for custom fonts
No server-side processing required, enhancing privacy and reducing server load

Cons

Performance may be slower compared to native Tesseract implementations
Large file size due to the inclusion of language data
Limited accuracy for complex layouts or low-quality images
May consume significant memory and CPU resources for large images

Code Examples

Basic usage to recognize text from an image:

import Tesseract from 'tesseract.js';

Tesseract.recognize(
  'https://example.com/image.jpg',
  'eng',
  { logger: m => console.log(m) }
).then(({ data: { text } }) => {
  console.log(text);
})

Recognizing text from a local image file:

import Tesseract from 'tesseract.js';

const image = document.getElementById('myImage');

Tesseract.recognize(image, 'eng')
  .then(({ data: { text } }) => {
    console.log('Recognized text:', text);
  })
  .catch(error => {
    console.error('Error:', error);
  });

Using a worker to perform OCR in the background:

import { createWorker } from 'tesseract.js';

const worker = await createWorker('eng');
const { data: { text } } = await worker.recognize('path/to/image.png');
console.log(text);
await worker.terminate();

Getting Started

To use Tesseract.js in your project, follow these steps:

Install the library:
```
npm install tesseract.js
```

Import and use in your JavaScript code:

import Tesseract from 'tesseract.js';

Tesseract.recognize(
  'https://example.com/image.jpg',
  'eng'
).then(({ data: { text } }) => {
  console.log('Recognized text:', text);
});

For more advanced usage and configuration options, refer to the official documentation at https://github.com/naptha/tesseract.js

Competitor Comparisons

DUP-ocropy

3,411

Python-based tools for document analysis and OCR

Pros of ocropy

More comprehensive OCR toolkit with advanced image processing capabilities
Supports training custom OCR models for specialized use cases
Better suited for large-scale, high-volume OCR tasks

Cons of ocropy

Steeper learning curve and more complex setup process
Less active development and community support
Requires more computational resources for processing

Code Comparison

ocropy:

from ocrolib import psegmentation, ocrolib
image = ocrolib.read_image_gray(image_path)
binary = ocrolib.binarize_sauvola(image)
segmentation = psegmentation.segment(binary)

tesseract.js:

const Tesseract = require('tesseract.js');

Tesseract.recognize(
  'path/to/image.jpg',
  'eng',
  { logger: m => console.log(m) }
).then(({ data: { text } }) => {
  console.log(text);
})

ocropy offers more granular control over the OCR process, allowing for custom segmentation and binarization. tesseract.js provides a simpler, more streamlined API for quick OCR tasks, making it easier to integrate into web applications. While ocropy is better suited for advanced OCR projects requiring customization, tesseract.js is ideal for simpler use cases and rapid development.

tesseract

3,019

Tesseract Open Source OCR Engine (main repository)

Pros of tesseract

Native C++ implementation, potentially offering better performance for complex OCR tasks
Supports a wider range of image formats and preprocessing options
More extensive language support and training data available

Cons of tesseract

Requires compilation and system-level installation, which can be complex
Larger footprint and more dependencies compared to the lightweight JavaScript implementation
May be overkill for simple OCR tasks or web-based applications

Code Comparison

tesseract:

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
api->Init(NULL, "eng");
api->SetImage(image);
char* outText = api->GetUTF8Text();
printf("OCR output:\n%s", outText);

tesseract.js:

const { createWorker } = require('tesseract.js');

(async () => {
  const worker = await createWorker();
  await worker.loadLanguage('eng');
  await worker.initialize('eng');
  const { data: { text } } = await worker.recognize('image.png');
  console.log(text);
  await worker.terminate();
})();

The tesseract repository provides a native C++ implementation, while tesseract.js offers a JavaScript wrapper for browser and Node.js environments. tesseract may be more suitable for desktop applications or high-performance server-side processing, while tesseract.js is ideal for web-based OCR tasks and easier integration into JavaScript projects.

opencv

77,862

Open Source Computer Vision Library

Pros of OpenCV

Comprehensive computer vision library with a wide range of functionalities
High performance and optimized for real-time applications
Supports multiple programming languages (C++, Python, Java)

Cons of OpenCV

Steeper learning curve due to its extensive feature set
Larger library size and potentially higher resource usage
Installation can be complex, especially for certain platforms

Code Comparison

Tesseract.js (OCR example):

Tesseract.recognize('image.png', 'eng')
  .then(({ data: { text } }) => {
    console.log(text);
  });

OpenCV (Image processing example):

import cv2

img = cv2.imread('image.png')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(gray, 100, 200)
cv2.imshow('Edges', edges)
cv2.waitKey(0)

Summary

Tesseract.js is focused on OCR (Optical Character Recognition) and is easier to use for text extraction from images. It's lightweight and runs in the browser. OpenCV, on the other hand, is a comprehensive computer vision library with a broader range of functionalities, including image processing, object detection, and machine learning. OpenCV offers better performance for complex tasks but requires more setup and has a steeper learning curve.

EasyOCR

23,625

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Pros of EasyOCR

Supports over 80 languages, including non-Latin scripts
Offers both GPU and CPU support for faster processing
Provides a more user-friendly API with simpler integration

Cons of EasyOCR

Larger file size and more dependencies compared to Tesseract.js
May require more setup and configuration for certain use cases
Less extensive documentation and community support

Code Comparison

EasyOCR:

import easyocr
reader = easyocr.Reader(['en'])
result = reader.readtext('image.jpg')

Tesseract.js:

const Tesseract = require('tesseract.js');
Tesseract.recognize('image.jpg', 'eng')
  .then(({ data: { text } }) => {
    console.log(text);
  });

Both libraries offer straightforward APIs for OCR tasks, but EasyOCR's Python-based approach may be more familiar to data scientists and machine learning practitioners. Tesseract.js, being JavaScript-based, is often easier to integrate into web applications and Node.js projects.

EasyOCR's multi-language support and GPU acceleration make it suitable for more diverse and performance-critical applications, while Tesseract.js's lightweight nature and browser compatibility give it an edge in web-based scenarios and projects with minimal dependencies.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

npm jsDelivr hits (npm)

Tesseract.js is a javascript library that gets words in almost any language out of images. (Demo)

Image Recognition

Video Real-time Recognition

Tesseract.js wraps a webassembly port of the Tesseract OCR Engine. It works in the browser using webpack, esm, or plain script tags with a CDN and on the server with Node.js. After you install it, using it is as simple as:

import { createWorker } from 'tesseract.js';

(async () => {
  const worker = await createWorker('eng');
  const ret = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
  console.log(ret.data.text);
  await worker.terminate();
})();

When recognizing multiple images, users should create a worker once, run worker.recognize for each image, and then run worker.terminate() once at the end (rather than running the above snippet for every image).

Installation

Tesseract.js works with a <script> tag via local copy or CDN, with webpack via npm and on Node.js with npm/yarn.

CDN

<!-- v5 -->
<script src='https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js'></script>

After including the script the Tesseract variable will be globally available and a worker can be created using Tesseract.createWorker.

Alternatively, an ESM build (used with import syntax) can be found at https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.esm.min.js.

Node.js

Requires Node.js v14 or higher

# For latest version
npm install tesseract.js
yarn add tesseract.js

# For old versions
npm install tesseract.js@3.0.3
yarn add tesseract.js@3.0.3

Documentation

Community Projects and Examples

The following are examples and projects built by the community using Tesseract.js. Officially supported examples are found in the examples directory.

Projects
- Scribe OCR: web application for scanning documents (images and PDFs)
  - Site at scribeocr.com, repo at github.com/scribeocr/scribeocr
- Chrome Extension (with Manifest V3): https://github.com/Tshetrim/Image-To-Text-OCR-extension-for-ChatGPT
Examples
- Converting PDF to text: https://github.com/racosa/pdf2text-ocr
- Use blocks output to generate granular data [word/symbol level]: https://github.com/Kishlay-notabot/tesseract-bbox-examples
- Electron: https://github.com/Balearica/tesseract.js-electron
- Typescript: https://github.com/Balearica/tesseract.js-typescript

If you have a project or example repo that uses Tesseract.js, feel free to add it to this list using a pull request. Examples submitted should be well documented such that new users can run them; projects should be functional and actively maintained.

Major changes in v5

Version 5 changes are documented in this issue. Highlights are below.

Significantly smaller files by default (54% smaller for English, 73% smaller for Chinese)
- This results in a ~50% reduction in runtime for first-time users (who do not have the files cached yet)
Significantly lower memory usage
Compatible with iOS 17 (using default settings)
Breaking changes:
- createWorker arguments changed
  - Setting non-default language and OEM now happens in createWorker
    - E.g. createWorker("chi_sim", 1)
- worker.initialize and worker.loadLanguage functions now do nothing and can be deleted from code
- See this issue for full list

Upgrading from v2 to v5? See this guide.

Major changes in v4

Version 4 includes many new features and bug fixes--see this issue for a full list. Several highlights are below.

Added rotation preprocessing options (including auto-rotate) for significantly better accuracy
Processed images (rotated, grayscale, binary) can now be retrieved
Improved support for parallel processing (schedulers)
Breaking changes:
- createWorker is now async
- getPDF function replaced by pdf recognize option

Major changes in v3

Significantly faster performance
- Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the example images
Upgrade to Tesseract v5.1.0 (using emscripten 3.1.18)
Added SIMD-enabled build for supported devices
Added support:
- Node.js version 18
Removed support:
- ASM.js version, any other old versions of Tesseract.js-core (<3.0.0)
- Node.js versions 10 and 12

Contributing

Development

To run a development copy of Tesseract.js do the following:

# First we clone the repository
git clone https://github.com/naptha/tesseract.js.git
cd tesseract.js

# Then we install the dependencies
npm install

# And finally we start the development server
npm start

The development server will be available at http://localhost:3000/examples/browser/basic-efficient.html in your favorite browser. It will automatically rebuild tesseract.min.js and worker.min.js when you change files in the src folder.

Online Setup with a single Click

You can use Gitpod(A free online VS Code like IDE) for contributing. With a single click it will launch a ready to code workspace with the build & start scripts already in process and within a few seconds it will spin up the dev server so that you can start contributing straight away without wasting any time.