Models

ALL models are available for non-commercial research purposes only.

Overview

It supports a number of datasets for speech recognition:

Lip Reading Sentences 2 (LRS2)
Lip Reading Sentences 3 (LRS3)
Chinese Mandarin Lip Reading (CMLR)
CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)
GRID
Lombard GRID
TCD-TIMIT

Video-to-Text List

For the CMU-MOSEAS dataset, video-to-text list can be found in ${lipreading_root}/labels/${dataset}/${language_code} folder.
For datasets with single language, video-to-text list can be found in ${lipreading_root}/labels/${dataset} folder.

Details

Lip Reading Sentences 2 (LRS2) [1]

Components	WER	url	size (MB)
Visual-only
-	26.1	GoogleDrive or BaiduDrive(key: 48l1)	186
Language Models
-	-	GoogleDrive or BaiduDrive(key: 59u2)	180
Landmarks
-	-	GoogleDrive or BaiduDrive(key: 53rc)	9358

Lip Reading Sentences 3 (LRS3) [2]

Components	WER	url	size (MB)
Visual-only
-	32.3	GoogleDrive or BaiduDrive(key: 1b1s)	186
Language Models
-	-	GoogleDrive or BaiduDrive(key: 59u2)	180
Landmarks
-	-	GoogleDrive or BaiduDrive(key: mi3c)	18577

Chinese Mandarin Lip Reading (CMLR) [3]

Components	CER	url	size (MB)
Visual-only
-	8.0	GoogleDrive or BaiduDrive(key: 7eq1)	195
Language Models
-	-	GoogleDrive or BaiduDrive(key: k8iv)	187
Landmarks
-	-	GoogleDrive or BaiduDrive(key: 1ret)	3721

CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS) [4]

Components	WER	url	size (MB)
Visual-only
Spanish	44.5	GoogleDrive or BaiduDrive(key: m35h)	186
Portuguese	51.4	GoogleDrive or BaiduDrive(key: wk2h)	186
French	58.6	GoogleDrive or BaiduDrive(key: t1hf)	186
Language Models
Spanish	-	GoogleDrive or BaiduDrive(key: 0mii)	180
Portuguese	-	GoogleDrive or BaiduDrive(key: l6ag)	179
French	-	GoogleDrive or BaiduDrive(key: 6tan)	179
Landmarks
-	-	GoogleDrive or BaiduDrive(key: vsic)	3040

GRID [5]

Components	WER	url	size (MB)
Visual-only
Overlapped	1.2	GoogleDrive or BaiduDrive(key: d8d2)	186
Unseen	4.8	GoogleDrive or BaiduDrive(key: ttsh)	186
Landmarks
-	-	GoogleDrive or BaiduDrive(key: 16l9)	1141

You can pass .mpg to the variable argument --video-ext to match the extension of video filename on the GRID dataset.

Lombard GRID [6]

Components	WER	url	size (MB)
Visual-only
Unseen (Front Plain)	4.9	GoogleDrive or BaiduDrive(key: 38ds)	186
Unseen (Side Plain)	8.0	GoogleDrive or BaiduDrive(key: k6m0)	186
Landmarks
-	-	GoogleDrive or BaiduDrive(key: cusv)	309

You can pass .mov to the variable argument --video-ext to match the extension of video filename on the Lombard GRID dataset.

TCD-TIMIT [7]

Components	WER	url	size (MB)
Visual-only
Overlapped	16.9	GoogleDrive or BaiduDrive(key: jh65)	186
Unseen	21.8	GoogleDrive or BaiduDrive(key: n2gr)	186
Language Models
-	-	GoogleDrive or BaiduDrive(key: 59u2)	180
Landmarks
-	-	GoogleDrive or BaiduDrive(key: bnm8)	930

Reference

[1] Afouras, T., Chung, J. S., Senior, A., Vinyals, O. & Zisserman, A. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).

[2] Afouras, T., Chung, J. S. & Zisserman, A. LRS3-TED: a large-scale dataset for visual speech recognition. Preprint at arXiv (2018).

[3] Zhao, Y., Xu, R. & Song, M. A cascade sequence-to-sequence model for chinese mandarin lip reading. In Proceedings of the 1st ACM International Conference on Multimedia in Asia, pp. 1-6 (2019).

[4] Zadeh, A. B. et al. CMU-MOSEAS: A multimodal language dataset for spanish, portuguese, german and french. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 1801–1812 (2020).

[5] Cooke, M., Barker, J., Cunningham, S. & Shao, X. An audio-visual corpus for speech perceptionand automatic speech recognition. The Journal of the Acoustical Society of America, vol. 120, pp. 2421–2424 (2006).

[6] Alghamdi, N., Maddock, S., Marxer, R., Barker, J., & Brown, G. J. A corpus of audio-visual Lombard speech with frontal and profile views. The Journal of the Acoustical Society of America, vol. 143, pp. EL523-EL529 (2018).

[7] Harte, N. & Gillen, E. TCD-TIMIT: an audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, vol. 17, pp. 603–615 (2015).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Models

Overview

Video-to-Text List

Details

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

Models

Overview

Video-to-Text List

Details

Reference