- CUTE80 is the first curved text dataset that consists of 80 curved text images.
- SCUT-CTW1500 is a curved text dataset, which includes over 10,000 text annotations in 1500 images.
- Total-Text has 1,555 scene images, 9,330 annotated words with 3 different text orientations including horizontal, multi-oriented, and curved text.
- WordArt is a dataset which primarily features challenging artistic text.
- ReCTS is a large-scale dataset of 25,000 images, which mainly focuses on reading Chinese text on signboard.
- MLT19 is a real dataset for Multi-Lingual scene Text (MLT) detection and recognition, which consists of 20,000 images containing text from 10 languages.
- For word-level text recognition
What is the scene text in the image?
- For end-to-end text recognition
What are all the scene text in the image? Do not translate.
- For ReCTS in Chinese
图片中的场景文字是什么?
-
Results of word-level secne text recognition.
Method CUTE80 SCUT-CTW1500 Total-Text WordArt ReCTS GPT-4V 88.0% 62.0% 66.0% 62.0% 0 Supervised-SOTA 98.6% 87.0% 90.1% 68.2% 94.0% -
Results of MLT19.
Method Language Precision ↑ Recall ↑ F1 ↑ GPT-4V Arabic 16.44% 16.67% 16.55% English 86.57% 78.77% 82.49% French 83.0% 83.84% 83.42% Chinese 1.2% 1.56% 1.36% German 73.65% 86.29% 79.47% Korean 10.83% 12.39% 11.56% Japanese 11.9% 11.9% 11.9% Italian 62.7% 67.52% 65.02% Bangla 2.53% 2.63% 2.58% Hindi 7.29% 8.33% 7.78% All language 43.04% 45.42% 44.2% Supervised-SOTA All language 74.16% 52.91% 61.76% -
Impact of image resolution for recognition performance on MLT19 English subset.
Image size Precision ↑ Recall ↑ F1 ↑ 128 47.10% 58.88% 52.34% 256 74.64% 86.67% 80.21% 512 86.23% 83.69% 84.94% 1024 90.58% 85.14% 87.78% 2048 92.75% 89.12% 89.46% -
Illustration of word-level scene text recognition. In the answers of GPT-4V, we highlight the characters match the GT in green and characters do not match in red. GPT-4V can recognize curved, slanted and artistic English text, while comman-style Chinese text can not be recognized.