Skip to content

Latest commit

 

History

History
171 lines (112 loc) · 9.03 KB

File metadata and controls

171 lines (112 loc) · 9.03 KB

Models

ALL models are available for non-commercial research purposes only.

Overview

It supports a number of datasets for speech recognition:

Video-to-Text List

  • For the CMU-MOSEAS dataset, video-to-text list can be found in ${lipreading_root}/labels/${dataset}/${language_code} folder.

  • For datasets with single language, video-to-text list can be found in ${lipreading_root}/labels/${dataset} folder.

Details

Lip Reading Sentences 2 (LRS2) [1]

Components WER url size (MB)
Visual-only
- 26.1 GoogleDrive or BaiduDrive(key: 48l1) 186
Language Models
- - GoogleDrive or BaiduDrive(key: 59u2) 180
Landmarks
- - GoogleDrive or BaiduDrive(key: 53rc) 9358
Lip Reading Sentences 3 (LRS3) [2]

Components WER url size (MB)
Visual-only
- 32.3 GoogleDrive or BaiduDrive(key: 1b1s) 186
Language Models
- - GoogleDrive or BaiduDrive(key: 59u2) 180
Landmarks
- - GoogleDrive or BaiduDrive(key: mi3c) 18577
Chinese Mandarin Lip Reading (CMLR) [3]

Components CER url size (MB)
Visual-only
- 8.0 GoogleDrive or BaiduDrive(key: 7eq1) 195
Language Models
- - GoogleDrive or BaiduDrive(key: k8iv) 187
Landmarks
- - GoogleDrive or BaiduDrive(key: 1ret) 3721
CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS) [4]

Components WER url size (MB)
Visual-only
Spanish 44.5 GoogleDrive or BaiduDrive(key: m35h) 186
Portuguese 51.4 GoogleDrive or BaiduDrive(key: wk2h) 186
French 58.6 GoogleDrive or BaiduDrive(key: t1hf) 186
Language Models
Spanish - GoogleDrive or BaiduDrive(key: 0mii) 180
Portuguese - GoogleDrive or BaiduDrive(key: l6ag) 179
French - GoogleDrive or BaiduDrive(key: 6tan) 179
Landmarks
- - GoogleDrive or BaiduDrive(key: vsic) 3040
GRID [5]

Components WER url size (MB)
Visual-only
Overlapped 1.2 GoogleDrive or BaiduDrive(key: d8d2) 186
Unseen 4.8 GoogleDrive or BaiduDrive(key: ttsh) 186
Landmarks
- - GoogleDrive or BaiduDrive(key: 16l9) 1141

You can pass .mpg to the variable argument --video-ext to match the extension of video filename on the GRID dataset.

Lombard GRID [6]

Components WER url size (MB)
Visual-only
Unseen (Front Plain) 4.9 GoogleDrive or BaiduDrive(key: 38ds) 186
Unseen (Side Plain) 8.0 GoogleDrive or BaiduDrive(key: k6m0) 186
Landmarks
- - GoogleDrive or BaiduDrive(key: cusv) 309

You can pass .mov to the variable argument --video-ext to match the extension of video filename on the Lombard GRID dataset.

TCD-TIMIT [7]

Components WER url size (MB)
Visual-only
Overlapped 16.9 GoogleDrive or BaiduDrive(key: jh65) 186
Unseen 21.8 GoogleDrive or BaiduDrive(key: n2gr) 186
Language Models
- - GoogleDrive or BaiduDrive(key: 59u2) 180
Landmarks
- - GoogleDrive or BaiduDrive(key: bnm8) 930

Reference

[1] Afouras, T., Chung, J. S., Senior, A., Vinyals, O. & Zisserman, A. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).

[2] Afouras, T., Chung, J. S. & Zisserman, A. LRS3-TED: a large-scale dataset for visual speech recognition. Preprint at arXiv (2018).

[3] Zhao, Y., Xu, R. & Song, M. A cascade sequence-to-sequence model for chinese mandarin lip reading. In Proceedings of the 1st ACM International Conference on Multimedia in Asia, pp. 1-6 (2019).

[4] Zadeh, A. B. et al. CMU-MOSEAS: A multimodal language dataset for spanish, portuguese, german and french. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 1801–1812 (2020).

[5] Cooke, M., Barker, J., Cunningham, S. & Shao, X. An audio-visual corpus for speech perceptionand automatic speech recognition. The Journal of the Acoustical Society of America, vol. 120, pp. 2421–2424 (2006).

[6] Alghamdi, N., Maddock, S., Marxer, R., Barker, J., & Brown, G. J. A corpus of audio-visual Lombard speech with frontal and profile views. The Journal of the Acoustical Society of America, vol. 143, pp. EL523-EL529 (2018).

[7] Harte, N. & Gillen, E. TCD-TIMIT: an audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, vol. 17, pp. 603–615 (2015).