Convert Figma logo to code with AI

VikParuchuri logosurya

OCR, layout analysis, reading order, table recognition in 90+ languages

14,818
942
14,818
108

Top Related Projects

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

35,868

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

85,015

Tensors and Dynamic neural networks in Python with strong GPU acceleration

38,368

TensorFlow code and pre-trained models for BERT

186,879

An Open Source Machine Learning Framework for Everyone

30,331

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Quick Overview

Surya is an open-source project aimed at creating a fast and accurate OCR (Optical Character Recognition) system. It is designed to be language-agnostic and capable of handling various scripts and languages, with a focus on high performance and accuracy.

Pros

  • Fast and efficient OCR processing
  • Language-agnostic, supporting multiple scripts and languages
  • Open-source and actively maintained
  • Utilizes modern deep learning techniques for improved accuracy

Cons

  • Still in development, may have some instability or incomplete features
  • Limited documentation and examples for advanced use cases
  • Requires some technical knowledge to set up and use effectively
  • May have higher resource requirements compared to simpler OCR solutions

Code Examples

  1. Basic OCR processing:
from surya import Surya

ocr = Surya()
result = ocr.process_image("path/to/image.jpg")
print(result.text)
  1. Specifying language for OCR:
from surya import Surya, Language

ocr = Surya()
result = ocr.process_image("path/to/image.jpg", language=Language.HINDI)
print(result.text)
  1. Batch processing multiple images:
from surya import Surya

ocr = Surya()
image_paths = ["image1.jpg", "image2.png", "image3.tiff"]
results = ocr.process_batch(image_paths)

for result in results:
    print(result.text)

Getting Started

To get started with Surya, follow these steps:

  1. Install Surya using pip:
pip install surya
  1. Import and initialize Surya in your Python script:
from surya import Surya

ocr = Surya()
  1. Process an image and retrieve the OCR results:
result = ocr.process_image("path/to/your/image.jpg")
print(result.text)

For more advanced usage and configuration options, refer to the project's documentation on GitHub.

Competitor Comparisons

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Pros of transformers

  • Extensive library with support for numerous pre-trained models and architectures
  • Large community and frequent updates, ensuring compatibility with latest research
  • Comprehensive documentation and examples for various NLP tasks

Cons of transformers

  • Steeper learning curve due to its extensive features and options
  • Higher resource requirements for running and fine-tuning large models
  • May be overkill for simpler NLP tasks or projects with limited scope

Code comparison

transformers:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")[0]
print(f"Label: {result['label']}, Score: {result['score']:.4f}")

surya:

from surya import Classifier

classifier = Classifier()
result = classifier.classify("I love this product!")
print(f"Label: {result.label}, Score: {result.score:.4f}")

Summary

transformers is a comprehensive library for state-of-the-art NLP models, offering a wide range of pre-trained models and extensive documentation. It's ideal for complex NLP tasks and research projects. surya, on the other hand, appears to be a more focused and lightweight library, potentially easier to use for specific tasks but with less extensive features and community support compared to transformers.

35,868

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

  • Extensive optimization techniques for large-scale model training
  • Supports a wide range of hardware configurations and distributed training
  • Actively maintained by Microsoft with frequent updates and improvements

Cons of DeepSpeed

  • Steeper learning curve due to its complexity and extensive features
  • May be overkill for smaller projects or individual researchers
  • Requires more setup and configuration compared to simpler alternatives

Code Comparison

Surya:

from surya import Surya

model = Surya(model_path="path/to/model")
output = model.generate("Your input text here")
print(output)

DeepSpeed:

import deepspeed
import torch

model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config=ds_config
)
output = model(torch.tensor([1, 2, 3]))

Summary

DeepSpeed is a powerful library for optimizing large-scale model training, offering extensive features and support for distributed computing. It's well-maintained by Microsoft but may be complex for smaller projects. Surya, on the other hand, appears to be a simpler tool focused on specific use cases, potentially easier to use for individual researchers or smaller-scale projects. The choice between the two depends on the scale and requirements of your project.

85,015

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

  • Mature, widely-used deep learning framework with extensive community support
  • Comprehensive documentation and large ecosystem of tools/libraries
  • Supports a wide range of deep learning models and applications

Cons of PyTorch

  • Larger codebase and steeper learning curve for beginners
  • Higher computational requirements for running models
  • Less focused on specific NLP tasks compared to Surya

Code Comparison

Surya (text generation):

from surya import Surya

model = Surya.from_pretrained("surya-7b-v0")
output = model.generate("Once upon a time", max_length=100)
print(output)

PyTorch (general neural network):

import torch
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.linear = nn.Linear(10, 1)

    def forward(self, x):
        return self.linear(x)

Summary

PyTorch is a comprehensive deep learning framework suitable for various AI tasks, while Surya is a specialized project focused on efficient text generation. PyTorch offers more flexibility and a larger ecosystem, but Surya provides a simpler interface for specific NLP applications.

38,368

TensorFlow code and pre-trained models for BERT

Pros of BERT

  • Widely adopted and extensively researched in the NLP community
  • Supports multiple languages and has pre-trained models available
  • Backed by Google, ensuring ongoing development and support

Cons of BERT

  • Requires significant computational resources for training and fine-tuning
  • Can be complex to implement and optimize for specific tasks
  • May not perform as well on domain-specific tasks without extensive fine-tuning

Code Comparison

BERT example:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

Surya example:

from surya import Surya
model = Surya()
model.load_model("path/to/model")
text = "This is a sample text"
embeddings = model.embed(text)

BERT focuses on providing a pre-trained language model for various NLP tasks, while Surya appears to be more specialized for text embedding tasks. BERT offers more flexibility but requires more setup, whereas Surya seems to provide a simpler API for specific use cases.

186,879

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

  • Extensive ecosystem with wide industry adoption and support
  • Comprehensive documentation and large community for troubleshooting
  • Supports multiple programming languages and platforms

Cons of TensorFlow

  • Steeper learning curve for beginners
  • Can be resource-intensive and slower for smaller projects
  • More complex setup and configuration process

Code Comparison

Surya (Python):

from surya import Surya

model = Surya()
model.train("path/to/training/data")
predictions = model.predict("path/to/test/data")

TensorFlow (Python):

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(x_train, y_train, epochs=5)
predictions = model.predict(x_test)

Surya focuses on simplicity and ease of use for specific tasks, while TensorFlow offers more flexibility and control over model architecture and training process. Surya's API is more concise, requiring fewer lines of code for basic operations, whereas TensorFlow provides a more detailed and customizable approach to building and training models.

30,331

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

  • More comprehensive and feature-rich, supporting a wide range of NLP tasks
  • Backed by Facebook AI Research, with regular updates and a large community
  • Extensive documentation and examples for various use cases

Cons of fairseq

  • Steeper learning curve due to its complexity and extensive features
  • Requires more computational resources for training and inference
  • Less focused on specific tasks compared to Surya's specialization in text generation

Code Comparison

fairseq:

from fairseq.models.transformer import TransformerModel

model = TransformerModel.from_pretrained('/path/to/model', checkpoint_file='model.pt')
tokens = model.encode('Hello world')
translated = model.translate(tokens)
print(translated)

Surya:

from surya import Surya

model = Surya.load_model('path/to/model')
generated_text = model.generate('Hello world', max_length=50)
print(generated_text)

Summary

fairseq is a more comprehensive toolkit for NLP tasks, offering a wide range of features and models. It's well-maintained and has extensive documentation, but may be more complex to use and resource-intensive. Surya, on the other hand, is more focused on text generation and may be easier to use for specific tasks, but lacks the breadth of features and community support that fairseq provides.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Surya

Surya is a document OCR toolkit that does:

  • OCR in 90+ languages that benchmarks favorably vs cloud services
  • Line-level text detection in any language
  • Layout analysis (table, image, header, etc detection)
  • Reading order detection
  • Table recognition (detecting rows/columns)

It works on a range of documents (see usage and benchmarks for more details).

DetectionOCR
LayoutReading Order
Table Recognition

Surya is named for the Hindu sun god, who has universal vision.

Community

Discord is where we discuss future development.

Examples

NameDetectionOCRLayoutOrderTable Rec
JapaneseImageImageImageImageImage
ChineseImageImageImageImage
HindiImageImageImageImage
ArabicImageImageImageImage
Chinese + HindiImageImageImageImage
PresentationImageImageImageImageImage
Scientific PaperImageImageImageImageImage
Scanned DocumentImageImageImageImageImage
New York TimesImageImageImageImage
Scanned FormImageImageImageImageImage
TextbookImageImageImageImage

Hosted API

There is a hosted API for all surya models available here:

  • Works with PDF, images, word docs, and powerpoints
  • Consistent speed, with no latency spikes
  • High reliability and uptime

Commercial usage

I want surya to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.

The weights for the models are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the Datalab API. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options here.

Installation

You'll need python 3.10+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.

Install with:

pip install surya-ocr

Model weights will automatically download the first time you run surya.

Usage

  • Inspect the settings in surya/settings.py. You can override any settings with environment variables.
  • Your torch device will be automatically detected, but you can override this. For example, TORCH_DEVICE=cuda.

Interactive App

I've included a streamlit app that lets you interactively try Surya on images or PDF files. Run it with:

pip install streamlit
surya_gui

OCR (text recognition)

This command will write out a json file with the detected text and bboxes:

surya_ocr DATA_PATH
  • DATA_PATH can be an image, pdf, or folder of images/pdfs
  • --langs is an optional (but recommended) argument that specifies the language(s) to use for OCR. You can comma separate multiple languages. Use the language name or two-letter ISO code from here. Surya supports the 90+ languages found in surya/languages.py.
  • --lang_file if you want to use a different language for different PDFs/images, you can optionally specify languages in a file. The format is a JSON dict with the keys being filenames and the values as a list, like {"file1.pdf": ["en", "hi"], "file2.pdf": ["en"]}.
  • --images will save images of the pages and detected text lines (optional)
  • --results_dir specifies the directory to save results to instead of the default
  • --max specifies the maximum number of pages to process if you don't want to process everything
  • --start_page specifies the page number to start processing from

The results.json file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:

  • text_lines - the detected text and bounding boxes for each line
    • text - the text in the line
    • confidence - the confidence of the model in the detected text (0-1)
    • polygon - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
    • bbox - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
  • languages - the languages specified for the page
  • page - the page number in the file
  • image_bbox - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.

Performance tips

Setting the RECOGNITION_BATCH_SIZE env var properly will make a big difference when using a GPU. Each batch item will use 40MB of VRAM, so very high batch sizes are possible. The default is a batch size 512, which will use about 20GB of VRAM. Depending on your CPU core count, it may help, too - the default CPU batch size is 32.

From python

from PIL import Image
from surya.ocr import run_ocr
from surya.model.detection.model import load_model as load_det_model, load_processor as load_det_processor
from surya.model.recognition.model import load_model as load_rec_model
from surya.model.recognition.processor import load_processor as load_rec_processor

image = Image.open(IMAGE_PATH)
langs = ["en"] # Replace with your languages - optional but recommended
det_processor, det_model = load_det_processor(), load_det_model()
rec_model, rec_processor = load_rec_model(), load_rec_processor()

predictions = run_ocr([image], [langs], det_model, det_processor, rec_model, rec_processor)

Compilation

The following models have support for compilation. You will need to set the following environment variables to enable compilation:

  • Recognition: COMPILE_RECOGNITION=true
  • Detection: COMPILE_DETECTOR=true
  • Layout: COMPILE_LAYOUT=true
  • Table recognition: COMPILE_TABLE_REC=true

Alternatively, you can also set COMPILE_ALL=true which will compile all models.

Here are the speedups on an A10 GPU:

ModelTime per page (s)Compiled time per page (s)Speedup (%)
Recognition0.6575560.5626514.43314334
Detection0.1088080.105213.306742151
Layout0.273190.270630.93707676
Table recognition0.02190.0193811.50684932

Text line detection

This command will write out a json file with the detected bboxes.

surya_detect DATA_PATH
  • DATA_PATH can be an image, pdf, or folder of images/pdfs
  • --images will save images of the pages and detected text lines (optional)
  • --max specifies the maximum number of pages to process if you don't want to process everything
  • --results_dir specifies the directory to save results to instead of the default

The results.json file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:

  • bboxes - detected bounding boxes for text
    • bbox - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
    • polygon - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
    • confidence - the confidence of the model in the detected text (0-1)
  • vertical_lines - vertical lines detected in the document
    • bbox - the axis-aligned line coordinates.
  • page - the page number in the file
  • image_bbox - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.

Performance tips

Setting the DETECTOR_BATCH_SIZE env var properly will make a big difference when using a GPU. Each batch item will use 440MB of VRAM, so very high batch sizes are possible. The default is a batch size 36, which will use about 16GB of VRAM. Depending on your CPU core count, it might help, too - the default CPU batch size is 6.

From python

from PIL import Image
from surya.detection import batch_text_detection
from surya.model.detection.model import load_model, load_processor

image = Image.open(IMAGE_PATH)
model, processor = load_model(), load_processor()

# predictions is a list of dicts, one per image
predictions = batch_text_detection([image], model, processor)

Layout and reading order

This command will write out a json file with the detected layout and reading order.

surya_layout DATA_PATH
  • DATA_PATH can be an image, pdf, or folder of images/pdfs
  • --images will save images of the pages and detected text lines (optional)
  • --max specifies the maximum number of pages to process if you don't want to process everything
  • --results_dir specifies the directory to save results to instead of the default

The results.json file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:

  • bboxes - detected bounding boxes for text
    • bbox - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
    • polygon - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
    • position - the reading order of the box.
    • label - the label for the bbox. One of Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Figure, Section-header, Table, Form, Table-of-contents, Handwriting, Text, Text-inline-math.
    • top_k - the top-k other potential labels for the box. A dictionary with labels as keys and confidences as values.
  • page - the page number in the file
  • image_bbox - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.

Performance tips

Setting the LAYOUT_BATCH_SIZE env var properly will make a big difference when using a GPU. Each batch item will use 220MB of VRAM, so very high batch sizes are possible. The default is a batch size 32, which will use about 7GB of VRAM. Depending on your CPU core count, it might help, too - the default CPU batch size is 4.

From python

from PIL import Image
from surya.detection import batch_text_detection
from surya.layout import batch_layout_detection
from surya.model.detection.model import load_model as load_det_model, load_processor as load_det_processor
from surya.model.layout.model import load_model as load_layout_model
from surya.model.layout.processor import load_processor as load_layout_processor

image = Image.open(IMAGE_PATH)
model = load_layout_model()
processor = load_layout_processor()
det_model = load_det_model()
det_processor = load_det_processor()

# layout_predictions is a list of dicts, one per image
line_predictions = batch_text_detection([image], det_model, det_processor)
layout_predictions = batch_layout_detection([image], model, processor, line_predictions)

Table Recognition

This command will write out a json file with the detected table cells and row/column ids, along with row/column bounding boxes. If you want to get a formatted markdown table, check out the tabled repo.

surya_table DATA_PATH
  • DATA_PATH can be an image, pdf, or folder of images/pdfs
  • --images will save images of the pages and detected table cells + rows and columns (optional)
  • --max specifies the maximum number of pages to process if you don't want to process everything
  • --results_dir specifies the directory to save results to instead of the default
  • --detect_boxes specifies if cells should be detected. By default, they're pulled out of the PDF, but this is not always possible.
  • --skip_table_detection tells table recognition not to detect tables first. Use this if your image is already cropped to a table.

The results.json file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:

  • rows - detected table rows
    • bbox - the bounding box of the table row
    • row_id - the id of the row
  • cols - detected table columns
    • bbox - the bounding box of the table column
    • col_id- the id of the column
  • cells - detected table cells
    • bbox - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
    • text - if text could be pulled out of the pdf, the text of this cell.
  • page - the page number in the file
  • table_idx - the index of the table on the page (sorted in vertical order)
  • image_bbox - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.

Performance tips

Setting the TABLE_REC_BATCH_SIZE env var properly will make a big difference when using a GPU. Each batch item will use 150MB of VRAM, so very high batch sizes are possible. The default is a batch size 64, which will use about 10GB of VRAM. Depending on your CPU core count, it might help, too - the default CPU batch size is 8.

From python

See table_recognition.py for a code sample. Table recognition depends on extracting cells, so it is a little more involved to setup than other model types.

Limitations

  • This is specialized for document OCR. It will likely not work on photos or other images.
  • It is for printed text, not handwriting (though it may work on some handwriting).
  • The text detection model has trained itself to ignore advertisements.
  • You can find language support for OCR in surya/languages.py. Text detection, layout analysis, and reading order will work with any language.

Troubleshooting

If OCR isn't working properly:

  • Try increasing resolution of the image so the text is bigger. If the resolution is already very high, try decreasing it to no more than a 2048px width.
  • Preprocessing the image (binarizing, deskewing, etc) can help with very old/blurry images.
  • You can adjust DETECTOR_BLANK_THRESHOLD and DETECTOR_TEXT_THRESHOLD if you don't get good results. DETECTOR_BLANK_THRESHOLD controls the space between lines - any prediction below this number will be considered blank space. DETECTOR_TEXT_THRESHOLD controls how text is joined - any number above this is considered text. DETECTOR_TEXT_THRESHOLD should always be higher than DETECTOR_BLANK_THRESHOLD, and both should be in the 0-1 range. Looking at the heatmap from the debug output of the detector can tell you how to adjust these (if you see faint things that look like boxes, lower the thresholds, and if you see bboxes being joined together, raise the thresholds).

Manual install

If you want to develop surya, you can install it manually:

  • git clone https://github.com/VikParuchuri/surya.git
  • cd surya
  • poetry install - installs main and dev dependencies
  • poetry shell - activates the virtual environment

Benchmarks

OCR

Benchmark chart tesseract

ModelTime per page (s)Avg similarity (⬆)
surya.620.97
tesseract.450.88

Full language results

Tesseract is CPU-based, and surya is CPU or GPU. I tried to cost-match the resources used, so I used a 1xA6000 (48GB VRAM) for surya, and 28 CPU cores for Tesseract (same price on Lambda Labs/DigitalOcean).

Google Cloud Vision

I benchmarked OCR against Google Cloud vision since it has similar language coverage to Surya.

Benchmark chart google cloud

Full language results

Methodology

I measured normalized sentence similarity (0-1, higher is better) based on a set of real-world and synthetic pdfs. I sampled PDFs from common crawl, then filtered out the ones with bad OCR. I couldn't find PDFs for some languages, so I also generated simple synthetic PDFs for those.

I used the reference line bboxes from the PDFs with both tesseract and surya, to just evaluate the OCR quality.

For Google Cloud, I aligned the output from Google Cloud with the ground truth. I had to skip RTL languages since they didn't align well.

Text line detection

Benchmark chart

ModelTime (s)Time per page (s)precisionrecall
surya50.20990.1961330.8210610.956556
tesseract74.45460.2908380.6314980.997694

Tesseract is CPU-based, and surya is CPU or GPU. I ran the benchmarks on a system with an A10 GPU, and a 32 core CPU. This was the resource usage:

  • tesseract - 32 CPU cores, or 8 workers using 4 cores each
  • surya - 36 batch size, for 16GB VRAM usage

Methodology

Surya predicts line-level bboxes, while tesseract and others predict word-level or character-level. It's hard to find 100% correct datasets with line-level annotations. Merging bboxes can be noisy, so I chose not to use IoU as the metric for evaluation.

I instead used coverage, which calculates:

  • Precision - how well the predicted bboxes cover ground truth bboxes
  • Recall - how well ground truth bboxes cover predicted bboxes

First calculate coverage for each bbox, then add a small penalty for double coverage, since we want the detection to have non-overlapping bboxes. Anything with a coverage of 0.5 or higher is considered a match.

Then we calculate precision and recall for the whole dataset.

Layout analysis

Layout Typeprecisionrecall
Image0.912650.93976
List0.808490.86792
Table0.849570.96104
Text0.930190.94571
Title0.921020.95404

Time per image - .13 seconds on GPU (A10).

Methodology

I benchmarked the layout analysis on Publaynet, which was not in the training data. I had to align publaynet labels with the surya layout labels. I was then able to find coverage for each layout type:

  • Precision - how well the predicted bboxes cover ground truth bboxes
  • Recall - how well ground truth bboxes cover predicted bboxes

Reading Order

88% mean accuracy, and .4 seconds per image on an A10 GPU. See methodology for notes - this benchmark is not perfect measure of accuracy, and is more useful as a sanity check.

Methodology

I benchmarked the reading order on the layout dataset from here, which was not in the training data. Unfortunately, this dataset is fairly noisy, and not all the labels are correct. It was very hard to find a dataset annotated with reading order and also layout information. I wanted to avoid using a cloud service for the ground truth.

The accuracy is computed by finding if each pair of layout boxes is in the correct order, then taking the % that are correct.

Table Recognition

ModelRow IntersectionCol IntersectionTime Per Image
Surya0.970.930.03
Table transformer0.720.840.02

Higher is better for intersection, which the percentage of the actual row/column overlapped by the predictions.

Methodology

The benchmark uses a subset of Fintabnet from IBM. It has labeled rows and columns. After table recognition is run, the predicted rows and columns are compared to the ground truth. There is an additional penalty for predicting too many or too few rows/columns.

Running your own benchmarks

You can benchmark the performance of surya on your machine.

  • Follow the manual install instructions above.
  • poetry install --group dev - installs dev dependencies

Text line detection

This will evaluate tesseract and surya for text line detection across a randomly sampled set of images from doclaynet.

python benchmark/detection.py --max 256
  • --max controls how many images to process for the benchmark
  • --debug will render images and detected bboxes
  • --pdf_path will let you specify a pdf to benchmark instead of the default data
  • --results_dir will let you specify a directory to save results to instead of the default one

Text recognition

This will evaluate surya and optionally tesseract on multilingual pdfs from common crawl (with synthetic data for missing languages).

python benchmark/recognition.py --tesseract
  • --max controls how many images to process for the benchmark

  • --debug 2 will render images with detected text

  • --results_dir will let you specify a directory to save results to instead of the default one

  • --tesseract will run the benchmark with tesseract. You have to run sudo apt-get install tesseract-ocr-all to install all tesseract data, and set TESSDATA_PREFIX to the path to the tesseract data folder.

  • Set RECOGNITION_BATCH_SIZE=864 to use the same batch size as the benchmark.

  • Set RECOGNITION_BENCH_DATASET_NAME=vikp/rec_bench_hist to use the historical document data for benchmarking. This data comes from the tapuscorpus.

Layout analysis

This will evaluate surya on the publaynet dataset.

python benchmark/layout.py
  • --max controls how many images to process for the benchmark
  • --debug will render images with detected text
  • --results_dir will let you specify a directory to save results to instead of the default one

Reading Order

python benchmark/ordering.py
  • --max controls how many images to process for the benchmark
  • --debug will render images with detected text
  • --results_dir will let you specify a directory to save results to instead of the default one

Table Recognition

python benchmark/table_recognition.py --max 1024 --tatr
  • --max controls how many images to process for the benchmark
  • --debug will render images with detected text
  • --results_dir will let you specify a directory to save results to instead of the default one
  • --tatr specifies whether to also run table transformer

Training

Text detection was trained on 4x A6000s for 3 days. It used a diverse set of images as training data. It was trained from scratch using a modified efficientvit architecture for semantic segmentation.

Text recognition was trained on 4x A6000s for 2 weeks. It was trained using a modified donut model (GQA, MoE layer, UTF-16 decoding, layer config changes).

Thanks

This work would not have been possible without amazing open source AI work:

Thank you to everyone who makes open source AI possible.