Skip to content

Latest commit

 

History

History
25 lines (21 loc) · 904 Bytes

wav2vec.md

File metadata and controls

25 lines (21 loc) · 904 Bytes
tags
ml
speech_to_text

Wav2vec

Wav2vec is a pre-trained speech recognition model introduced by Baevski et al. (2020).

Wav2vec is composed of several parts. CNN acts as a feature encoder and encodes the audio signal to fixed-sized representations which are fed to Transformer that contextualizes it. Similarly to MLM the representations are masked-out, leaving the Transformer to predict their quantized forms.

The model is then fine-tuned on a labelled dataset using CTC loss. The authors showed that thanks to the pre-training, even with small amount of supervised data the model is able to surpass the competition

TODO:

  • quantization
  • CNN structure
  • Gumbel Softmax
  • Using CNN instead of absolute embeddings for the transformer
  • Contrastive loss
  • diversity loss for the masked prediction
  • usage of SpecAugment during fine-tuning