Skip to content

Latest commit

 

History

History
133 lines (124 loc) · 4.33 KB

results_str.md

File metadata and controls

133 lines (124 loc) · 4.33 KB

Scene text recognition

Dataset

Word-level text recognition

  • CUTE80 is the first curved text dataset that consists of 80 curved text images.
  • SCUT-CTW1500 is a curved text dataset, which includes over 10,000 text annotations in 1500 images.
  • Total-Text has 1,555 scene images, 9,330 annotated words with 3 different text orientations including horizontal, multi-oriented, and curved text.
  • WordArt is a dataset which primarily features challenging artistic text.
  • ReCTS is a large-scale dataset of 25,000 images, which mainly focuses on reading Chinese text on signboard.

End-to-end text recognition

  • MLT19 is a real dataset for Multi-Lingual scene Text (MLT) detection and recognition, which consists of 20,000 images containing text from 10 languages.

Prompt

  • For word-level text recognition
    What is the scene text in the image?
    
  • For end-to-end text recognition
    What are all the scene text in the image? Do not translate.
    
  • For ReCTS in Chinese
    图片中的场景文字是什么?
    

Results

  • Results of word-level secne text recognition.

    Method CUTE80 SCUT-CTW1500 Total-Text WordArt ReCTS
    GPT-4V 88.0% 62.0% 66.0% 62.0% 0
    Supervised-SOTA 98.6% 87.0% 90.1% 68.2% 94.0%
  • Results of MLT19.

    Method Language Precision ↑ Recall ↑ F1 ↑
    GPT-4V Arabic 16.44% 16.67% 16.55%
    English 86.57% 78.77% 82.49%
    French 83.0% 83.84% 83.42%
    Chinese 1.2% 1.56% 1.36%
    German 73.65% 86.29% 79.47%
    Korean 10.83% 12.39% 11.56%
    Japanese 11.9% 11.9% 11.9%
    Italian 62.7% 67.52% 65.02%
    Bangla 2.53% 2.63% 2.58%
    Hindi 7.29% 8.33% 7.78%
    All language 43.04% 45.42% 44.2%
    Supervised-SOTA All language 74.16% 52.91% 61.76%
  • Impact of image resolution for recognition performance on MLT19 English subset.

    Image size Precision ↑ Recall ↑ F1 ↑
    128 47.10% 58.88% 52.34%
    256 74.64% 86.67% 80.21%
    512 86.23% 83.69% 84.94%
    1024 90.58% 85.14% 87.78%
    2048 92.75% 89.12% 89.46%
  • Illustration of word-level scene text recognition. In the answers of GPT-4V, we highlight the characters match the GT in green and characters do not match in red. GPT-4V can recognize curved, slanted and artistic English text, while comman-style Chinese text can not be recognized. 0