Wav2vec

Wav2vec is a pre-trained speech recognition model introduced by Baevski et al. (2020).

Wav2vec is composed of several parts. CNN acts as a feature encoder and encodes the audio signal to fixed-sized representations which are fed to Transformer that contextualizes it. Similarly to MLM the representations are masked-out, leaving the Transformer to predict their quantized forms.

The model is then fine-tuned on a labelled dataset using CTC loss. The authors showed that thanks to the pre-training, even with small amount of supervised data the model is able to surpass the competition

TODO:

quantization
CNN structure
Gumbel Softmax
Using CNN instead of absolute embeddings for the transformer
Contrastive loss
diversity loss for the masked prediction
usage of SpecAugment during fine-tuning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wav2vec.md

wav2vec.md

Wav2vec

Files

wav2vec.md

Latest commit

History

wav2vec.md

File metadata and controls

Wav2vec