Scene text recognition

Dataset

CUTE80 is the first curved text dataset that consists of 80 curved text images.
SCUT-CTW1500 is a curved text dataset, which includes over 10,000 text annotations in 1500 images.
Total-Text has 1,555 scene images, 9,330 annotated words with 3 different text orientations including horizontal, multi-oriented, and curved text.
WordArt is a dataset which primarily features challenging artistic text.
ReCTS is a large-scale dataset of 25,000 images, which mainly focuses on reading Chinese text on signboard.

MLT19 is a real dataset for Multi-Lingual scene Text (MLT) detection and recognition, which consists of 20,000 images containing text from 10 languages.

For end-to-end text recognition

What are all the scene text in the image? Do not translate.

Results of MLT19.

Impact of image resolution for recognition performance on MLT19 English subset.

Image size Precision ↑ Recall ↑ F1 ↑

128 47.10% 58.88% 52.34%

256 74.64% 86.67% 80.21%

512 86.23% 83.69% 84.94%

1024 90.58% 85.14% 87.78%

2048 92.75% 89.12% 89.46%
Illustration of word-level scene text recognition. In the answers of GPT-4V, we highlight the characters match the GT in green and characters do not match in red. GPT-4V can recognize curved, slanted and artistic English text, while comman-style Chinese text can not be recognized.

Method	CUTE80	SCUT-CTW1500	Total-Text	WordArt	ReCTS
GPT-4V	88.0%	62.0%	66.0%	62.0%	0
Supervised-SOTA	98.6%	87.0%	90.1%	68.2%	94.0%

Image size	Precision ↑	Recall ↑	F1 ↑
128	47.10%	58.88%	52.34%
256	74.64%	86.67%	80.21%
512	86.23%	83.69%	84.94%
1024	90.58%	85.14%	87.78%
2048	92.75%	89.12%	89.46%