OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

31,309

2,172

31,309

141

View on GitHub

Top Related Projects

tesseract

68,574

Tesseract Open Source OCR Engine (main repository)

OCRmyPDF

31,365

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

rembg

20,619

Rembg is a tool to remove images background

pytesseract

6,231

A Python wrapper for Google Tesseract

EasyOCR

27,439

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Quick Overview

OCRmyPDF is an open-source command-line tool and Python library that adds an OCR text layer to scanned PDF files. It uses Tesseract OCR engine to recognize text in images and embeds the results back into the PDF, making the document searchable and selectable while preserving the original layout and formatting.

Pros

Preserves the original PDF layout and metadata
Supports multiple languages and can automatically detect the document's language
Offers various optimization options, including image processing and PDF compression
Can be used as both a command-line tool and a Python library

Cons

Requires installation of several dependencies, which can be complex for some users
OCR accuracy depends on the quality of the input images and Tesseract's capabilities
Processing large or complex PDFs can be time-consuming

Code Examples

Basic OCR processing of a PDF file:

import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)

OCR with language specification and optimization:

import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', language=['eng', 'fra'], optimize=3, skip_text=True)

Using OCRmyPDF as a library with custom options:

from ocrmypdf import ocr
from ocrmypdf.helpers import resolution_ok

options = {
    'deskew': True,
    'clean': True,
    'optimize': 2,
    'output_type': 'pdf',
    'sidecar': 'ocr_text.txt'
}

if resolution_ok('input.pdf', 300):
    ocr('input.pdf', 'output.pdf', **options)
else:
    print("Input PDF resolution is too low for good OCR results.")

Getting Started

To get started with OCRmyPDF, follow these steps:

Install OCRmyPDF and its dependencies:
```
pip install ocrmypdf
```
Basic usage from the command line:
```
ocrmypdf input.pdf output.pdf
```

Basic usage in Python:

import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)

For more advanced options and configurations, refer to the official documentation.

Competitor Comparisons

tesseract

68,574

Tesseract Open Source OCR Engine (main repository)

Pros of Tesseract

Widely recognized as one of the most accurate open-source OCR engines
Supports a vast array of languages and scripts
Can be integrated into various applications and workflows

Cons of Tesseract

Requires more technical expertise to set up and use effectively
Limited built-in PDF handling capabilities
May require additional pre-processing steps for optimal results

Code Comparison

Tesseract (command-line usage):

tesseract input.png output -l eng

OCRmyPDF (command-line usage):

ocrmypdf input.pdf output.pdf

OCRmyPDF is a higher-level tool that uses Tesseract as its OCR engine. It provides a more user-friendly interface for PDF-specific OCR tasks, including automatic image preprocessing, PDF/A output, and metadata preservation. Tesseract, on the other hand, is a lower-level OCR engine that offers more flexibility for integration into custom applications but requires additional tools and steps for PDF processing.

While Tesseract excels in raw OCR accuracy and language support, OCRmyPDF simplifies the process of adding OCR to PDF files, making it more accessible for users who primarily work with PDFs and require a streamlined workflow.

OCRmyPDF

31,365

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Pros of OCRmyPDF

Actively maintained with regular updates and bug fixes
Extensive documentation and user guides available
Supports a wide range of PDF manipulation features beyond OCR

Cons of OCRmyPDF

May have a steeper learning curve for beginners
Requires additional dependencies to be installed
Can be resource-intensive for large PDF files

Code Comparison

OCRmyPDF:

import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True, optimize=1, skip_text=True)

Both repositories are the same, so there's no difference in the code comparison.

Summary

OCRmyPDF is a powerful tool for adding OCR text layers to PDF files. It offers a wide range of features and is actively maintained. However, it may require some technical knowledge to set up and use effectively. The repository provides comprehensive documentation and support for various PDF manipulation tasks beyond OCR.

Since both repositories mentioned in the prompt are identical (ocrmypdf/OCRmyPDF), there are no differences to compare. The pros, cons, and code example apply equally to both. Users can benefit from the tool's capabilities regardless of which repository they choose to use.

rembg

20,619

Rembg is a tool to remove images background

Pros of rembg

Specialized in background removal from images
Supports various input and output formats (PNG, JPG, WebP)
Offers both CLI and Python API for integration

Cons of rembg

Limited to a single task (background removal)
May require more manual intervention for complex images
Smaller community and fewer contributors compared to OCRmyPDF

Code Comparison

rembg:

from rembg import remove
from PIL import Image

input_path = 'input.png'
output_path = 'output.png'
input = Image.open(input_path)
output = remove(input)
output.save(output_path)

OCRmyPDF:

import ocrmypdf

input_file = 'input.pdf'
output_file = 'output.pdf'
ocrmypdf.ocr(input_file, output_file, deskew=True, optimize=3)

Summary

While rembg focuses on background removal from images, OCRmyPDF specializes in adding OCR layers to PDF files. rembg offers simplicity for its specific task, while OCRmyPDF provides a broader range of PDF-related features. The choice between the two depends on the specific image or document processing needs of the user.

pytesseract

6,231

A Python wrapper for Google Tesseract

Pros of pytesseract

Lightweight and focused on OCR functionality
Easy to integrate into existing Python projects
Provides direct access to Tesseract OCR engine features

Cons of pytesseract

Limited to OCR functionality only, no PDF handling
Requires manual image preprocessing for optimal results
Less user-friendly for non-programmers

Code comparison

pytesseract:

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open('image.png'))
print(text)

OCRmyPDF:

import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)

Summary

pytesseract is a Python wrapper for Tesseract OCR engine, providing direct access to OCR functionality. It's lightweight and easy to integrate into existing projects but requires more manual work for image preprocessing and lacks PDF handling capabilities.

OCRmyPDF, on the other hand, is a more comprehensive tool that focuses on adding OCR layers to PDF files. It includes features like automatic image preprocessing, PDF handling, and optimization, making it more user-friendly for end-users but potentially less flexible for developers needing fine-grained control over the OCR process.

EasyOCR

27,439

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Pros of EasyOCR

Supports multiple languages (90+) out of the box
Easier to use for general OCR tasks, especially for images
More flexible for various input formats (images, PDFs)

Cons of EasyOCR

May be slower for large-scale PDF processing
Less specialized for PDF-specific optimizations
Doesn't include advanced PDF manipulation features

Code Comparison

EasyOCR:

import easyocr
reader = easyocr.Reader(['en'])
result = reader.readtext('image.jpg')

OCRmyPDF:

import ocrmypdf
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)

EasyOCR is more straightforward for general OCR tasks, especially with images, while OCRmyPDF is specifically designed for PDF processing with additional features like deskewing. EasyOCR offers multi-language support by default, making it versatile for various OCR needs. However, OCRmyPDF excels in PDF-specific optimizations and advanced PDF manipulation, which may be crucial for certain workflows. The choice between the two depends on the specific use case, with EasyOCR being more suitable for general-purpose OCR across different file types and languages, and OCRmyPDF being optimal for PDF-centric OCR tasks with additional PDF processing capabilities.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Homebrew version Python versions

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

ocrmypdf                      # it's a scriptable command line program
   -l eng+fra                 # it supports multiple languages
   --rotate-pages             # it can fix pages that are misrotated
   --deskew                   # it can deskew crooked PDFs!
   --title "My PDF"           # it can change output metadata
   --jobs 4                   # it uses multiple cores by default
   --output-type pdfa         # it produces PDF/A by default
   input_scanned.pdf          # takes PDF input (or images)
   output_searchable.pdf      # produces validated PDF output

See the release notes for details on the latest changes.

Main features

Generates a searchable PDF/A file from a regular PDF
Places OCR text accurately below the image to ease copy / paste
Keeps the exact resolution of the original embedded images
When possible, inserts OCR information as a "lossless" operation without disrupting any other content
Optimizes PDF images, often producing files smaller than the input file
If requested, deskews and/or cleans the image before performing OCR
Validates input and output files
Distributes work across all available CPU cores
Uses Tesseract OCR engine to recognize more than 100 languages
Keeps your private data private.
Scales properly to handle files with thousands of pages.
Battle-tested on millions of PDFs.

For details: please consult the documentation.

Motivation

I searched the web for a free command line tool to OCR PDF files: I found many, but none of them were really satisfying:

Either they produced PDF files with misplaced text under the image (making copy/paste impossible)
Or they did not handle accents and multilingual characters
Or they changed the resolution of the embedded images
Or they generated ridiculously large PDF files
Or they crashed when trying to OCR
Or they did not produce valid PDF files
On top of that none of them produced PDF/A files (format dedicated for long time storage)

...so I decided to develop my own tool.

Installation

Linux, Windows, macOS and FreeBSD are supported. Docker images are also available, for both x64 and ARM.

Operating system	Install command
Debian, Ubuntu	`apt install ocrmypdf`
Windows Subsystem for Linux	`apt install ocrmypdf`
Fedora	`dnf install ocrmypdf`
macOS (Homebrew)	`brew install ocrmypdf`
macOS (MacPorts)	`port install ocrmypdf`
macOS (nix)	`nix-env -i ocrmypdf`
LinuxBrew	`brew install ocrmypdf`
FreeBSD	`pkg install py-ocrmypdf`
OpenBSD	`pkg_add ocrmypdf`
Ubuntu Snap	`snap install ocrmypdf`

For everyone else, see our documentation for installation steps.

Languages

OCRmyPDF uses Tesseract for OCR, and relies on its language packs. For Linux users, you can often find packages that provide language packs:

# Display a list of all Tesseract language packs
apt-cache search tesseract-ocr

# Debian/Ubuntu users
apt-get install tesseract-ocr-chi-sim  # Example: Install Chinese Simplified language pack

# Arch Linux users
pacman -S tesseract-data-eng tesseract-data-deu # Example: Install the English and German language packs

# OpenBSD users
pkg_info -aQ tesseract  # Display a list of all Tesseract language packs
pkg_add tesseract-cym  # Example: Install the Welsh language pack

# brew macOS users
brew install tesseract-lang

You can then pass the -l LANG argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple languages can be requested.

OCRmyPDF supports Tesseract 4.1.1+. It will automatically use whichever version it finds first on the PATH environment variable. On Windows, if PATH does not provide a Tesseract binary, we use the highest version number that is installed according to the Windows Registry.

Documentation and support

Once OCRmyPDF is installed, the built-in help which explains the command syntax and options can be accessed via:

ocrmypdf --help

Our documentation is served on Read the Docs.

Please report issues on our GitHub issues page, and follow the issue template for quick response.

Feature demo

# Add an OCR layer and convert to PDF/A
ocrmypdf input.pdf output.pdf

# Convert an image to single page PDF
ocrmypdf input.jpg output.pdf

# Add OCR to a file in place (only modifies file on success)
ocrmypdf myfile.pdf myfile.pdf

# OCR with non-English languages (look up your language's ISO 639-3 code)
ocrmypdf -l fra LeParisien.pdf LeParisien.pdf

# OCR multilingual documents
ocrmypdf -l eng+fra Bilingual-English-French.pdf Bilingual-English-French.pdf

# Deskew (straighten crooked pages)
ocrmypdf --deskew input.pdf output.pdf

For more features, see the documentation.

Requirements

In addition to the required Python version, OCRmyPDF requires external program installations of Ghostscript and Tesseract OCR. OCRmyPDF is pure Python, and runs on pretty much everything: Linux, macOS, Windows and FreeBSD.

Press & Media

Going paperless with OCRmyPDF
Converting a scanned document into a compressed searchable PDF with redactions
c't 1-2014, page 59: Detailed presentation of OCRmyPDF v1.0 in the leading German IT magazine c't
heise Open Source, 09/2014: Texterkennung mit OCRmyPDF
heise Durchsuchbare PDF-Dokumente mit OCRmyPDF erstellen
Excellent Utilities: OCRmyPDF
LinuxUser Texterkennung mit OCRmyPDF und Scanbd automatisieren
Y Combinator discussion

Business enquiries

OCRmyPDF would not be the software that it is today without companies and users choosing to provide support for feature development and consulting enquiries. We are happy to discuss all enquiries, whether for extending the existing feature set, or integrating OCRmyPDF into a larger system.

License

The OCRmyPDF software is licensed under the Mozilla Public License 2.0 (MPL-2.0). This license permits integration of OCRmyPDF with other code, included commercial and closed source, but asks you to publish source-level modifications you make to OCRmyPDF.

Some components of OCRmyPDF have other licenses, as indicated by standard SPDX license identifiers or the DEP5 copyright and licensing information file. Generally speaking, non-core code is licensed under MIT, and the documentation and test files are licensed under Creative Commons ShareAlike 4.0 (CC-BY-SA 4.0).

Disclaimer

The software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot