Convert Figma logo to code with AI

ocrmypdf logoOCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

13,564
992
13,564
109

Top Related Projects

60,774

Tesseract Open Source OCR Engine (main repository)

13,565

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

16,195

Rembg is a tool to remove images background

A Python wrapper for Google Tesseract

23,625

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Quick Overview

OCRmyPDF is an open-source command-line tool and Python library that adds an OCR text layer to scanned PDF files. It uses Tesseract OCR engine to recognize text in images and embeds the results back into the PDF, making the document searchable and selectable while preserving the original layout and formatting.

Pros

  • Preserves the original PDF layout and metadata
  • Supports multiple languages and can automatically detect the document's language
  • Offers various optimization options, including image processing and PDF compression
  • Can be used as both a command-line tool and a Python library

Cons

  • Requires installation of several dependencies, which can be complex for some users
  • OCR accuracy depends on the quality of the input images and Tesseract's capabilities
  • Processing large or complex PDFs can be time-consuming

Code Examples

  1. Basic OCR processing of a PDF file:
import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
  1. OCR with language specification and optimization:
import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', language=['eng', 'fra'], optimize=3, skip_text=True)
  1. Using OCRmyPDF as a library with custom options:
from ocrmypdf import ocr
from ocrmypdf.helpers import resolution_ok

options = {
    'deskew': True,
    'clean': True,
    'optimize': 2,
    'output_type': 'pdf',
    'sidecar': 'ocr_text.txt'
}

if resolution_ok('input.pdf', 300):
    ocr('input.pdf', 'output.pdf', **options)
else:
    print("Input PDF resolution is too low for good OCR results.")

Getting Started

To get started with OCRmyPDF, follow these steps:

  1. Install OCRmyPDF and its dependencies:

    pip install ocrmypdf
    
  2. Basic usage from the command line:

    ocrmypdf input.pdf output.pdf
    
  3. Basic usage in Python:

    import ocrmypdf
    
    ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)
    

For more advanced options and configurations, refer to the official documentation.

Competitor Comparisons

60,774

Tesseract Open Source OCR Engine (main repository)

Pros of Tesseract

  • Widely recognized as one of the most accurate open-source OCR engines
  • Supports a vast array of languages and scripts
  • Can be integrated into various applications and workflows

Cons of Tesseract

  • Requires more technical expertise to set up and use effectively
  • Limited built-in PDF handling capabilities
  • May require additional pre-processing steps for optimal results

Code Comparison

Tesseract (command-line usage):

tesseract input.png output -l eng

OCRmyPDF (command-line usage):

ocrmypdf input.pdf output.pdf

OCRmyPDF is a higher-level tool that uses Tesseract as its OCR engine. It provides a more user-friendly interface for PDF-specific OCR tasks, including automatic image preprocessing, PDF/A output, and metadata preservation. Tesseract, on the other hand, is a lower-level OCR engine that offers more flexibility for integration into custom applications but requires additional tools and steps for PDF processing.

While Tesseract excels in raw OCR accuracy and language support, OCRmyPDF simplifies the process of adding OCR to PDF files, making it more accessible for users who primarily work with PDFs and require a streamlined workflow.

13,565

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Pros of OCRmyPDF

  • Actively maintained with regular updates and bug fixes
  • Extensive documentation and user guides available
  • Supports a wide range of PDF manipulation features beyond OCR

Cons of OCRmyPDF

  • May have a steeper learning curve for beginners
  • Requires additional dependencies to be installed
  • Can be resource-intensive for large PDF files

Code Comparison

OCRmyPDF:

import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True, optimize=1, skip_text=True)

Both repositories are the same, so there's no difference in the code comparison.

Summary

OCRmyPDF is a powerful tool for adding OCR text layers to PDF files. It offers a wide range of features and is actively maintained. However, it may require some technical knowledge to set up and use effectively. The repository provides comprehensive documentation and support for various PDF manipulation tasks beyond OCR.

Since both repositories mentioned in the prompt are identical (ocrmypdf/OCRmyPDF), there are no differences to compare. The pros, cons, and code example apply equally to both. Users can benefit from the tool's capabilities regardless of which repository they choose to use.

16,195

Rembg is a tool to remove images background

Pros of rembg

  • Specialized in background removal from images
  • Supports various input and output formats (PNG, JPG, WebP)
  • Offers both CLI and Python API for integration

Cons of rembg

  • Limited to a single task (background removal)
  • May require more manual intervention for complex images
  • Smaller community and fewer contributors compared to OCRmyPDF

Code Comparison

rembg:

from rembg import remove
from PIL import Image

input_path = 'input.png'
output_path = 'output.png'
input = Image.open(input_path)
output = remove(input)
output.save(output_path)

OCRmyPDF:

import ocrmypdf

input_file = 'input.pdf'
output_file = 'output.pdf'
ocrmypdf.ocr(input_file, output_file, deskew=True, optimize=3)

Summary

While rembg focuses on background removal from images, OCRmyPDF specializes in adding OCR layers to PDF files. rembg offers simplicity for its specific task, while OCRmyPDF provides a broader range of PDF-related features. The choice between the two depends on the specific image or document processing needs of the user.

A Python wrapper for Google Tesseract

Pros of pytesseract

  • Lightweight and focused on OCR functionality
  • Easy to integrate into existing Python projects
  • Provides direct access to Tesseract OCR engine features

Cons of pytesseract

  • Limited to OCR functionality only, no PDF handling
  • Requires manual image preprocessing for optimal results
  • Less user-friendly for non-programmers

Code comparison

pytesseract:

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open('image.png'))
print(text)

OCRmyPDF:

import ocrmypdf

ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)

Summary

pytesseract is a Python wrapper for Tesseract OCR engine, providing direct access to OCR functionality. It's lightweight and easy to integrate into existing projects but requires more manual work for image preprocessing and lacks PDF handling capabilities.

OCRmyPDF, on the other hand, is a more comprehensive tool that focuses on adding OCR layers to PDF files. It includes features like automatic image preprocessing, PDF handling, and optimization, making it more user-friendly for end-users but potentially less flexible for developers needing fine-grained control over the OCR process.

23,625

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Pros of EasyOCR

  • Supports multiple languages (90+) out of the box
  • Easier to use for general OCR tasks, especially for images
  • More flexible for various input formats (images, PDFs)

Cons of EasyOCR

  • May be slower for large-scale PDF processing
  • Less specialized for PDF-specific optimizations
  • Doesn't include advanced PDF manipulation features

Code Comparison

EasyOCR:

import easyocr
reader = easyocr.Reader(['en'])
result = reader.readtext('image.jpg')

OCRmyPDF:

import ocrmypdf
ocrmypdf.ocr('input.pdf', 'output.pdf', deskew=True)

EasyOCR is more straightforward for general OCR tasks, especially with images, while OCRmyPDF is specifically designed for PDF processing with additional features like deskewing. EasyOCR offers multi-language support by default, making it versatile for various OCR needs. However, OCRmyPDF excels in PDF-specific optimizations and advanced PDF manipulation, which may be crucial for certain workflows. The choice between the two depends on the specific use case, with EasyOCR being more suitable for general-purpose OCR across different file types and languages, and OCRmyPDF being optimal for PDF-centric OCR tasks with additional PDF processing capabilities.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

OCRmyPDF

Build Status PyPI version Homebrew version ReadTheDocs Python versions

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

ocrmypdf                      # it's a scriptable command line program
   -l eng+fra                 # it supports multiple languages
   --rotate-pages             # it can fix pages that are misrotated
   --deskew                   # it can deskew crooked PDFs!
   --title "My PDF"           # it can change output metadata
   --jobs 4                   # it uses multiple cores by default
   --output-type pdfa         # it produces PDF/A by default
   input_scanned.pdf          # takes PDF input (or images)
   output_searchable.pdf      # produces validated PDF output

See the release notes for details on the latest changes.

Main features

  • Generates a searchable PDF/A file from a regular PDF
  • Places OCR text accurately below the image to ease copy / paste
  • Keeps the exact resolution of the original embedded images
  • When possible, inserts OCR information as a "lossless" operation without disrupting any other content
  • Optimizes PDF images, often producing files smaller than the input file
  • If requested, deskews and/or cleans the image before performing OCR
  • Validates input and output files
  • Distributes work across all available CPU cores
  • Uses Tesseract OCR engine to recognize more than 100 languages
  • Keeps your private data private.
  • Scales properly to handle files with thousands of pages.
  • Battle-tested on millions of PDFs.
Demo of OCRmyPDF in a terminal session

For details: please consult the documentation.

Motivation

I searched the web for a free command line tool to OCR PDF files: I found many, but none of them were really satisfying:

  • Either they produced PDF files with misplaced text under the image (making copy/paste impossible)
  • Or they did not handle accents and multilingual characters
  • Or they changed the resolution of the embedded images
  • Or they generated ridiculously large PDF files
  • Or they crashed when trying to OCR
  • Or they did not produce valid PDF files
  • On top of that none of them produced PDF/A files (format dedicated for long time storage)

...so I decided to develop my own tool.

Installation

Linux, Windows, macOS and FreeBSD are supported. Docker images are also available, for both x64 and ARM.

Operating systemInstall command
Debian, Ubuntuapt install ocrmypdf
Windows Subsystem for Linuxapt install ocrmypdf
Fedoradnf install ocrmypdf
macOS (Homebrew)brew install ocrmypdf
macOS (MacPorts)port install ocrmypdf
macOS (nix)nix-env -i ocrmypdf
LinuxBrewbrew install ocrmypdf
FreeBSDpkg install py-ocrmypdf
Condaconda install ocrmypdf
Ubuntu Snapsnap install ocrmypdf

For everyone else, see our documentation for installation steps.

Languages

OCRmyPDF uses Tesseract for OCR, and relies on its language packs. For Linux users, you can often find packages that provide language packs:

# Display a list of all Tesseract language packs
apt-cache search tesseract-ocr

# Debian/Ubuntu users
apt-get install tesseract-ocr-chi-sim  # Example: Install Chinese Simplified language pack

# Arch Linux users
pacman -S tesseract-data-eng tesseract-data-deu # Example: Install the English and German language packs

# brew macOS users
brew install tesseract-lang

You can then pass the -l LANG argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple languages can be requested.

OCRmyPDF supports Tesseract 4.1.1+. It will automatically use whichever version it finds first on the PATH environment variable. On Windows, if PATH does not provide a Tesseract binary, we use the highest version number that is installed according to the Windows Registry.

Documentation and support

Once OCRmyPDF is installed, the built-in help which explains the command syntax and options can be accessed via:

ocrmypdf --help

Our documentation is served on Read the Docs.

Please report issues on our GitHub issues page, and follow the issue template for quick response.

Requirements

In addition to the required Python version (3.8+), OCRmyPDF requires external program installations of Ghostscript and Tesseract OCR. OCRmyPDF is pure Python, and runs on pretty much everything: Linux, macOS, Windows and FreeBSD.

Press & Media

Business enquiries

OCRmyPDF would not be the software that it is today without companies and users choosing to provide support for feature development and consulting enquiries. We are happy to discuss all enquiries, whether for extending the existing feature set, or integrating OCRmyPDF into a larger system.

License

The OCRmyPDF software is licensed under the Mozilla Public License 2.0 (MPL-2.0). This license permits integration of OCRmyPDF with other code, included commercial and closed source, but asks you to publish source-level modifications you make to OCRmyPDF.

Some components of OCRmyPDF have other licenses, as indicated by standard SPDX license identifiers or the DEP5 copyright and licensing information file. Generally speaking, non-core code is licensed under MIT, and the documentation and test files are licensed under Creative Commons ShareAlike 4.0 (CC-BY-SA 4.0).

Disclaimer

The software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.