deep-text-recognition-benchmark

Text recognition (optical character recognition) with deep learning methods, ICCV 2019

3,873

1,129

3,873

245

View on GitHub

Top Related Projects

CRAFT-pytorch

3,275

Official implementation of Character Region Awareness for Text Detection (CRAFT)

Awesome multilingual OCR and Document Parsing toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

EasyOCR

27,439

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

detr

14,567

End-to-End Object Detection with Transformers

Quick Overview

The deep-text-recognition-benchmark repository is a comprehensive framework for scene text recognition (STR) research. It provides implementations of various STR models, datasets, and evaluation metrics, allowing researchers to easily compare and benchmark different approaches in the field of text recognition from images.

Pros

Offers a unified framework for multiple STR models, making it easier to compare different approaches
Includes a wide range of datasets and evaluation metrics for thorough benchmarking
Provides detailed documentation and instructions for easy setup and usage
Supports both training and evaluation of STR models

Cons

Requires significant computational resources for training and evaluating models
May have a steep learning curve for newcomers to the field of scene text recognition
Some datasets used in the benchmark may require separate licenses or permissions
Limited to Python and PyTorch, which may not suit all researchers' preferences

Code Examples

Loading a pre-trained model and performing inference:

import torch
from model import Model

model = Model(opt)
model.load_state_dict(torch.load('path/to/pretrained_model.pth'))
model.eval()

image = load_image('path/to/image.jpg')
pred = model(image)

Training a model on a custom dataset:

from train import train

opt = get_args()  # Parse command-line arguments
train(opt)

Evaluating a model on a benchmark dataset:

from test import test

opt = get_args()  # Parse command-line arguments
test(opt)

Getting Started

Clone the repository:

git clone https://github.com/clovaai/deep-text-recognition-benchmark.git
cd deep-text-recognition-benchmark

Install dependencies:
```
pip install -r requirements.txt
```
Download lmdb datasets and place them in the data/ directory.

Train a model:

python train.py --train_data data_lmdb_release/training --valid_data data_lmdb_release/validation

Evaluate a model:

python test.py --eval_data data_lmdb_release/evaluation --saved_model saved_models/best_accuracy.pth

Competitor Comparisons

CRAFT-pytorch

3,275

Official implementation of Character Region Awareness for Text Detection (CRAFT)

Pros of CRAFT-pytorch

Focuses specifically on text detection, providing a more specialized solution
Implements a character-level approach, potentially offering better accuracy for complex layouts
Includes pre-trained models for immediate use

Cons of CRAFT-pytorch

Limited to text detection only, requiring additional steps for full OCR pipeline
May have higher computational requirements due to character-level processing

Code Comparison

CRAFT-pytorch:

from craft_text_detector import Craft

craft = Craft(output_dir=output_dir, crop_type="poly", cuda=cuda)
prediction_result = craft.detect_text(image_path)

deep-text-recognition-benchmark:

from modules.prediction import STRPredictor

predictor = STRPredictor(opt)
pred = predictor.predict(image)

Summary

CRAFT-pytorch excels in text detection with its character-level approach, while deep-text-recognition-benchmark offers a more comprehensive solution for both detection and recognition. CRAFT-pytorch may be preferred for specialized text detection tasks, whereas deep-text-recognition-benchmark provides a more end-to-end solution for OCR applications.

PaddleOCR

52,233

Pros of PaddleOCR

Comprehensive end-to-end OCR system with detection, recognition, and layout analysis
Supports multiple languages and offers pre-trained models for various scenarios
Provides tools for data annotation, model training, and deployment

Cons of PaddleOCR

Steeper learning curve due to its extensive features and components
Requires familiarity with the PaddlePaddle framework
May be overkill for simple text recognition tasks

Code Comparison

Deep-text-recognition-benchmark:

model = CRNN(opt)
converter = AttnLabelConverter(opt.character)
criterion = torch.nn.CrossEntropyLoss(ignore_index=0).to(device)

PaddleOCR:

model = build_model(config['Architecture'])
loss_class = build_loss(config['Loss'])
optimizer = build_optimizer(config['Optimizer'], model)

Summary

Deep-text-recognition-benchmark focuses specifically on text recognition benchmarking, offering a streamlined approach for comparing various recognition models. It's ideal for researchers and developers looking to evaluate and improve text recognition algorithms.

PaddleOCR, on the other hand, provides a full-fledged OCR toolkit with a wider range of features, including text detection and layout analysis. It's more suitable for production-ready OCR applications but requires more time to master its extensive capabilities.

Choose Deep-text-recognition-benchmark for focused text recognition research, and PaddleOCR for comprehensive OCR solutions in real-world applications.

EasyOCR

27,439

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Pros of EasyOCR

User-friendly API with simple installation and usage
Supports 80+ languages out of the box
Provides pre-trained models for quick deployment

Cons of EasyOCR

Less flexibility for customization and fine-tuning
May have lower accuracy on specific datasets or complex scenarios
Limited options for model architecture experimentation

Code Comparison

EasyOCR:

import easyocr
reader = easyocr.Reader(['en'])
result = reader.readtext('image.jpg')

deep-text-recognition-benchmark:

from model import Model
model = Model(opt)
model.load_state_dict(torch.load(opt.saved_model))
pred = model(image_tensors).view(-1, opt.batch_max_length)

EasyOCR offers a more straightforward API, while deep-text-recognition-benchmark provides greater control over the model and training process. The latter is better suited for researchers and developers who need to customize the OCR pipeline, while EasyOCR is ideal for quick integration and out-of-the-box functionality across multiple languages.

detr

14,567

End-to-End Object Detection with Transformers

Pros of DETR

More versatile: Can handle multiple object detection tasks beyond text recognition
End-to-end architecture: Simplifies the pipeline by eliminating hand-crafted components
Transformer-based approach: Potentially better at capturing long-range dependencies

Cons of DETR

Higher computational requirements: Transformer architecture can be more resource-intensive
Less specialized: May not perform as well on specific text recognition tasks
Steeper learning curve: More complex architecture may require more time to understand and implement

Code Comparison

DETR:

class DETR(nn.Module):
    def __init__(self, num_classes, hidden_dim, nheads, num_encoder_layers, num_decoder_layers):
        super().__init__()
        self.transformer = Transformer(hidden_dim, nheads, num_encoder_layers, num_decoder_layers)

Deep Text Recognition Benchmark:

class Model(nn.Module):
    def __init__(self, opt):
        super(Model, self).__init__()
        self.stages = {'Trans': opt.Transformation, 'Feat': opt.FeatureExtraction,
                       'Seq': opt.SequenceModeling, 'Pred': opt.Prediction}

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

Official PyTorch implementation of our four-stage STR framework, that most existing STR models fit into.
Using this framework allows for the module-wise contributions to performance in terms of accuracy, speed, and memory demand, under one consistent set of training and evaluation datasets.
Such analyses clean up the hindrance on the current comparisons to understand the performance gain of the existing modules.

trade-off

Honors

Based on this framework, we recorded the 1st place of ICDAR2013 focused scene text, ICDAR2019 ArT and 3rd place of ICDAR2017 COCO-Text, ICDAR2019 ReCTS (task1).
The difference between our paper and ICDAR challenge is summarized here.

Updates

Aug 3, 2020: added guideline to use Baidu warpctc which reproduces CTC results of our paper.
Dec 27, 2019: added FLOPS in our paper, and minor updates such as log_dataset.txt and ICDAR2019-NormalizedED.
Oct 22, 2019: added confidence score, and arranged the output form of training logs.
Jul 31, 2019: The paper is accepted at International Conference on Computer Vision (ICCV), Seoul 2019, as an oral talk.
Jul 25, 2019: The code for floating-point 16 calculation, check @YacobBY's pull request
Jul 16, 2019: added ST_spe.zip dataset, word images contain special characters in SynthText (ST) dataset, see this issue
Jun 24, 2019: added gt.txt of failure cases that contains path and label of each image, see image_release_190624.zip
May 17, 2019: uploaded resources in Baidu Netdisk also, added Run demo. (check @sharavsambuu's colab demo also)
May 9, 2019: PyTorch version updated from 1.0.1 to 1.1.0, use torch.nn.CTCLoss instead of torch-baidu-ctc, and various minor updated.

Getting Started

Dependency

This work was tested with PyTorch 1.3.1, CUDA 10.1, python 3.6 and Ubuntu 16.04.
You may need pip3 install torch==1.3.1.
In the paper, expriments were performed with PyTorch 0.4.1, CUDA 9.0.
requirements : lmdb, pillow, torchvision, nltk, natsort

pip3 install lmdb pillow torchvision nltk natsort

Download lmdb dataset for traininig and evaluation from here

data_lmdb_release.zip contains below.
training datasets : MJSynth (MJ)[1] and SynthText (ST)[2]
validation datasets : the union of the training sets IC13[3], IC15[4], IIIT[5], and SVT[6].
evaluation datasets : benchmark evaluation datasets, consist of IIIT[5], SVT[6], IC03[7], IC13[3], IC15[4], SVTP[8], and CUTE[9].

Run demo with pretrained model

Download pretrained model from here
Add image files to test into demo_image/
Run demo.py (add --sensitive option if you use case-sensitive model)

CUDA_VISIBLE_DEVICES=0 python3 demo.py \
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn \
--image_folder demo_image/ \
--saved_model TPS-ResNet-BiLSTM-Attn.pth

prediction results

demo images	TRBA (TPS-ResNet-BiLSTM-Attn)	TRBA (case-sensitive version)
	available	Available
	shakeshack	SHARESHACK
	london	Londen
	greenstead	Greenstead
	toast	TOAST
	merry	MERRY
	underground	underground
	ronaldo	RONALDO
	bally	BALLY
	university	UNIVERSITY

Training and evaluation

Train CRNN[10] model

CUDA_VISIBLE_DEVICES=0 python3 train.py \
--train_data data_lmdb_release/training --valid_data data_lmdb_release/validation \
--select_data MJ-ST --batch_ratio 0.5-0.5 \
--Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC

Test CRNN[10] model. If you want to evaluate IC15-2077, check data filtering part.

CUDA_VISIBLE_DEVICES=0 python3 test.py \
--eval_data data_lmdb_release/evaluation --benchmark_all_eval \
--Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC \
--saved_model saved_models/None-VGG-BiLSTM-CTC-Seed1111/best_accuracy.pth

Try to train and test our best accuracy model TRBA (TPS-ResNet-BiLSTM-Attn) also. (download pretrained model)

CUDA_VISIBLE_DEVICES=0 python3 train.py \
--train_data data_lmdb_release/training --valid_data data_lmdb_release/validation \
--select_data MJ-ST --batch_ratio 0.5-0.5 \
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn

CUDA_VISIBLE_DEVICES=0 python3 test.py \
--eval_data data_lmdb_release/evaluation --benchmark_all_eval \
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn \
--saved_model saved_models/TPS-ResNet-BiLSTM-Attn-Seed1111/best_accuracy.pth

Arguments

--train_data: folder path to training lmdb dataset.
--valid_data: folder path to validation lmdb dataset.
--eval_data: folder path to evaluation (with test.py) lmdb dataset.
--select_data: select training data. default is MJ-ST, which means MJ and ST used as training data.
--batch_ratio: assign ratio for each selected data in the batch. default is 0.5-0.5, which means 50% of the batch is filled with MJ and the other 50% of the batch is filled ST.
--data_filtering_off: skip data filtering when creating LmdbDataset.
--Transformation: select Transformation module [None | TPS].
--FeatureExtraction: select FeatureExtraction module [VGG | RCNN | ResNet].
--SequenceModeling: select SequenceModeling module [None | BiLSTM].
--Prediction: select Prediction module [CTC | Attn].
--saved_model: assign saved model to evaluation.
--benchmark_all_eval: evaluate with 10 evaluation dataset versions, same with Table 1 in our paper.

Download failure cases and cleansed label from here

image_release.zip contains failure case images and benchmark evaluation images with cleansed label. failure cases

When you need to train on your own dataset or Non-Latin language datasets.

Create your own lmdb dataset.

pip3 install fire
python3 create_lmdb_dataset.py --inputPath data/ --gtFile data/gt.txt --outputPath result/

The structure of data folder as below.

data
âââ gt.txt
âââ test
    âââ word_1.png
    âââ word_2.png
    âââ word_3.png
    âââ ...

At this time, gt.txt should be {imagepath}\t{label}\n
For example

test/word_1.png Tiredness
test/word_2.png kills
test/word_3.png A
...

Modify --select_data, --batch_ratio, and opt.character, see this issue.

Acknowledgements

This implementation has been based on these repository crnn.pytorch, ocr_attention.

Reference

[1] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scenetext recognition. In Workshop on Deep Learning, NIPS, 2014.
[2] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data fortext localisation in natural images. In CVPR, 2016.
[3] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Big-orda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, andL. P. De Las Heras. ICDAR 2013 robust reading competition. In ICDAR, pages 1484â1493, 2013.
[4] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R.Chandrasekhar, S. Lu, et al. ICDAR 2015 competition on ro-bust reading. In ICDAR, pages 1156â1160, 2015.
[5] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In BMVC, 2012.
[6] K. Wang, B. Babenko, and S. Belongie. End-to-end scenetext recognition. In ICCV, pages 1457â1464, 2011.
[7] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, andR. Young. ICDAR 2003 robust reading competitions. In ICDAR, pages 682â687, 2003.
[8] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recognizing text with perspective distortion in natural scenes. In ICCV, pages 569â576, 2013.
[9] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene images. In ESWA, volume 41, pages 8027â8048, 2014.
[10] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. In TPAMI, volume 39, pages2298â2304. 2017.

Citation

Please consider citing this work in your publications if it helps your research.

@inproceedings{baek2019STRcomparisons,
  title={What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis},
  author={Baek, Jeonghun and Kim, Geewook and Lee, Junyeop and Park, Sungrae and Han, Dongyoon and Yun, Sangdoo and Oh, Seong Joon and Lee, Hwalsuk},
  booktitle = {International Conference on Computer Vision (ICCV)},
  year={2019},
  pubstate={published},
  tppubtype={inproceedings}
}

Contact

Feel free to contact us if there is any question:
for code/paper Jeonghun Baek ku21fang@gmail.com; for collaboration hwalsuk.lee@navercorp.com (our team leader).

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of CRAFT-pytorch

Cons of CRAFT-pytorch

Code Comparison

Summary

Pros of PaddleOCR

Cons of PaddleOCR

Code Comparison

Summary

Pros of EasyOCR

Cons of EasyOCR

Code Comparison

Pros of DETR

Cons of DETR

Code Comparison

Convert designs to code with AI

README

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

Honors

Updates

Getting Started

Dependency

Download lmdb dataset for traininig and evaluation from here

Run demo with pretrained model

prediction results

Training and evaluation

Arguments

Download failure cases and cleansed label from here

When you need to train on your own dataset or Non-Latin language datasets.

Acknowledgements

Reference

Links

Citation

Contact

License

Top Related Projects

Convert designs to code with AI