ALL models are available for non-commercial research purposes only.
It supports a number of datasets for speech recognition:
- Lip Reading Sentences 2 (LRS2)
- Lip Reading Sentences 3 (LRS3)
- Chinese Mandarin Lip Reading (CMLR)
- CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)
- GRID
- Lombard GRID
- TCD-TIMIT
-
For the CMU-MOSEAS dataset, video-to-text list can be found in
${lipreading_root}/labels/${dataset}/${language_code}
folder. -
For datasets with single language, video-to-text list can be found in
${lipreading_root}/labels/${dataset}
folder.
Lip Reading Sentences 2 (LRS2) [1]
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
- | 26.1 | GoogleDrive or BaiduDrive(key: 48l1) | 186 |
Language Models | |||
- | - | GoogleDrive or BaiduDrive(key: 59u2) | 180 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: 53rc) | 9358 |
Lip Reading Sentences 3 (LRS3) [2]
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
- | 32.3 | GoogleDrive or BaiduDrive(key: 1b1s) | 186 |
Language Models | |||
- | - | GoogleDrive or BaiduDrive(key: 59u2) | 180 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: mi3c) | 18577 |
Chinese Mandarin Lip Reading (CMLR) [3]
Components | CER | url | size (MB) |
---|---|---|---|
Visual-only | |||
- | 8.0 | GoogleDrive or BaiduDrive(key: 7eq1) | 195 |
Language Models | |||
- | - | GoogleDrive or BaiduDrive(key: k8iv) | 187 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: 1ret) | 3721 |
CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS) [4]
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
Spanish | 44.5 | GoogleDrive or BaiduDrive(key: m35h) | 186 |
Portuguese | 51.4 | GoogleDrive or BaiduDrive(key: wk2h) | 186 |
French | 58.6 | GoogleDrive or BaiduDrive(key: t1hf) | 186 |
Language Models | |||
Spanish | - | GoogleDrive or BaiduDrive(key: 0mii) | 180 |
Portuguese | - | GoogleDrive or BaiduDrive(key: l6ag) | 179 |
French | - | GoogleDrive or BaiduDrive(key: 6tan) | 179 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: vsic) | 3040 |
GRID [5]
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
Overlapped | 1.2 | GoogleDrive or BaiduDrive(key: d8d2) | 186 |
Unseen | 4.8 | GoogleDrive or BaiduDrive(key: ttsh) | 186 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: 16l9) | 1141 |
You can pass .mpg
to the variable argument --video-ext
to match the extension of video filename on the GRID dataset.
Lombard GRID [6]
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
Unseen (Front Plain) | 4.9 | GoogleDrive or BaiduDrive(key: 38ds) | 186 |
Unseen (Side Plain) | 8.0 | GoogleDrive or BaiduDrive(key: k6m0) | 186 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: cusv) | 309 |
You can pass .mov
to the variable argument --video-ext
to match the extension of video filename on the Lombard GRID dataset.
TCD-TIMIT [7]
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
Overlapped | 16.9 | GoogleDrive or BaiduDrive(key: jh65) | 186 |
Unseen | 21.8 | GoogleDrive or BaiduDrive(key: n2gr) | 186 |
Language Models | |||
- | - | GoogleDrive or BaiduDrive(key: 59u2) | 180 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: bnm8) | 930 |
[1] Afouras, T., Chung, J. S., Senior, A., Vinyals, O. & Zisserman, A. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).
[2] Afouras, T., Chung, J. S. & Zisserman, A. LRS3-TED: a large-scale dataset for visual speech recognition. Preprint at arXiv (2018).
[3] Zhao, Y., Xu, R. & Song, M. A cascade sequence-to-sequence model for chinese mandarin lip reading. In Proceedings of the 1st ACM International Conference on Multimedia in Asia, pp. 1-6 (2019).
[4] Zadeh, A. B. et al. CMU-MOSEAS: A multimodal language dataset for spanish, portuguese, german and french. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 1801–1812 (2020).
[5] Cooke, M., Barker, J., Cunningham, S. & Shao, X. An audio-visual corpus for speech perceptionand automatic speech recognition. The Journal of the Acoustical Society of America, vol. 120, pp. 2421–2424 (2006).
[6] Alghamdi, N., Maddock, S., Marxer, R., Barker, J., & Brown, G. J. A corpus of audio-visual Lombard speech with frontal and profile views. The Journal of the Acoustical Society of America, vol. 143, pp. EL523-EL529 (2018).
[7] Harte, N. & Gillen, E. TCD-TIMIT: an audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, vol. 17, pp. 603–615 (2015).