diff --git a/beginners_guite_to_asr.md b/beginners_guite_to_asr.md new file mode 100644 index 0000000..5b5118f --- /dev/null +++ b/beginners_guite_to_asr.md @@ -0,0 +1,29 @@ +# Beginners guide to ASR + +I write this, since I'm a beginner in speech ML (TTS, ASR/STT). This is what has +helped me get started. + +## Common libraries + +- loading files: `librosa` +- processing input data: `spark` + +## Common values + +- sampling rate: 16k + +## Common metrics + +- Word Error Rate (WER) -- word edit distance over the number of true words: + +$$ +\operatorname{WER}(y_\text{pred}, y_\text{true}) = \frac{ + \operatorname{ed}_w(y_\text{pred}, y_\text{true}) + }{ + |y_\text{true}| + } +$$ + +## Common approaches + + diff --git a/ctc.md b/ctc.md new file mode 100644 index 0000000..c205f73 --- /dev/null +++ b/ctc.md @@ -0,0 +1,103 @@ +# Connectionist Temporal Classification (CTC) + +CTC is an objective function for classification of token sequences introduced by +[Graves et al. (2006](https://www.cs.toronto.edu/~graves/icml_2006.pdf). It +addresses the issue of classification of tokens in variable-length sequences. + +## The goal + +Imagine a sequence of $N$ tokens, in which you have to find $M$ labels (speech +recognition), where $M \leq N$. You don't know where in those $N$ tokens the $M$ +labels appear and so you cannot use traditional negative log likelihood for each +token to compute loss. + +CTC loss approaches the problem in the following fashion: +- designs decoding algorithm to go from $N$ tokens to $M$ labels +- shows how to compute the probability of a sequence of $M$ labels, so that + Cross-entropy can be used as a loss + +## Decoding + +Each of the $N$ tokens is classified into $M + 1$ labels. The extra label being +the blank $-$. Then the decoding is as follows: + +- immediate repetition of the same label counts as one +- blanks are tossed + +So for $aa-a-a-bb--$ we decode $aaab$. We call the predicted labelling *extended +labelling*, since we extended the true labelling by adding blanks $-$. + +However, its not straightforward to obtain model's predicted extended labelling, +since there are exponentially (to $N$) many labellings the model can predicted, +each of those being represented by exponentially many extended labellings. So we +can either use + +- crude approximation -- sometimes works ans is fast +- heuristics -- works better but is slower + +### Crude approximation -- best path decoding + +For each token we take the most probable label. Note that this can lead to +selecting different than the most probable sequence. + +### Heuristics -- beam search + +As we decode from the beginning we keep the $k$ most probable labellings (not +extended). At the end of the $N$ tokens, we select the most probable sequence +out of the $k$ we remember. + +## Probability of a given sequence + +To use cross-entropy we would like to compute $p(l|x)$, that is the likelihood +of sequence of $M$ labels $l$ given the sequence of $N$ tokens. As we discussed +above, there are exponential number of extended labellings $l'$ that correspond +to $l$. However, we can use dynamic programming to compute the probability in +$O(MN)$ time. + +We imagine an idealised extended labelling $l'$ for $l$ that has blanks at the +beginning, end and between every two characters. Let us then define +$\alpha_t(s)$ as the probability that the first $t$ predicted tokens covers the +first $s$ tokens of $l'$. We set: + +$$ +\begin{aligned} + \alpha_1(0) &= y^1_{-} \\ + \alpha_1(1) &= y^1_{l'_1} +\end{aligned} +$$ + +where $y_k^t$ is the probability of predicting label $k$ at token $t$. We +iterate over $t$. Its more helpful to imagine it like a grid (rows are $s$, +columns are $t$) as in the following image (taken from the original paper): + +![Dynamic programming grid to compute +probability of "CAT" in T tokens.](./imgs/ctc_dynamic_programming.png) + +The step is: + +$$ +\begin{aligned} +\alpha_t(s) +&= \left(\alpha_{t-1}(s) + \alpha_{t-1}(s-1)\right)y_-^t &&\;\text{if}\; l_{s-1} = - \\ +&= \left(\alpha_{t-1}(s) + \alpha_{t-1}(s-1)\right)y_{l_s}^t &&\;\text{if}\; l_{s-1} \ne - \land l_{s-2} = l_s \\ +&= \left(\alpha_{t-1}(s) + \alpha_{t-1}(s-1) + \alpha_{t-1}(s-2)\right)y_{l_s}^t &&\;\text{if}\; l_{s-1} \ne - \land l_{s-2} \ne l_s \\ +\end{aligned} +$$ + +The above computations correspond to: + +1. if $l_s$ is a blank, it can be new one ($\alpha_{t-1}(s-1)$) or a repetition + of previous one, in which case prediction at $t-1$ needed to represent the + whole $l_{1:s}$ ($\alpha_{t-1}(s)$) +2. if $l_s$ is a non-blank character and its the same as the previous non-blank + character, our options are the same as above, since $l_{s-1}$ needs to be + blank, so we split up $l_s$ and $l_{s-2}$ so that they are not decoded as + one. +3. if $l_s$ is a non-blank character and its different from the previous + non-blank character, we can additionally skip the blank + ($\alpha_{t-1}(s-2)$). + + +## Implementation + +TODO diff --git a/hubert.md b/hubert.md new file mode 100644 index 0000000..2284bf1 --- /dev/null +++ b/hubert.md @@ -0,0 +1,45 @@ +# HuBERT + +Hidden-unit BERT (HuBERT) is a pre-trained speech recognition model introduced +by [Hsu et al. (2021)](https://arxiv.org/pdf/2106.07447). HuBERT closely +resembles [wav2vec](./wav2vec.md) in that it pre-trains using self-supervised +learning by classifying quantized input features. + +## Architecture + +HuBERT is a model composed of CNN feature encoder, Transformer that +contextualizes features and a quantization model. The authors use k-means +clustering that assigns each span of speech a cluster. To make the cluster +labels more robust, there are ensembles of $k$ clusterings with different +parameters for each span. The loss takes multi-label form: +$$ +L_m(f; X, \{Z^{(k)}\}_k, M) = +\sum_{t \in M} \sum_k \log p_f^{(k)}\left(z_t^{(k)} \mid \tilde{X}, t\right) +$$ + +In later stages of training the clusters are derived from the trained +representations of the transformer, rather than from the raw input. Using +clustering metrics, the authors show that clustering based on Transformer +features yields better clusters. + +## Interesting comparisons to wav2vec + +The authors point out that they quantize the input directly, not quantizing the +features extracted by CNN. They argue that since CNN encoding is lossy, +quantization has less information to quantize the input and therefore is not as +good. + +## Ablation studies + +The authors experiment with convex combination of losses on masked and unmasked +input spans. They show that with training only on masked inputs, the model is +more resilient to bad clustering. However, with good clustering, training on +unmasked spans is favorable. + +TODO: +- *freeze step* parameter during fine-tuning (as in wav2vec) +- What exactly is MFCC +- What are dev-other sets? Is there some other dev set? dev-clean, dev-other, + test-clean, test-other + + diff --git a/imgs/ctc_dynamic_programming.png b/imgs/ctc_dynamic_programming.png new file mode 100644 index 0000000..52cf15c Binary files /dev/null and b/imgs/ctc_dynamic_programming.png differ diff --git a/imgs/transformer_architecture.png b/imgs/transformer_architecture.png new file mode 100644 index 0000000..ce6f6ff Binary files /dev/null and b/imgs/transformer_architecture.png differ diff --git a/transformer.md b/transformer.md new file mode 100644 index 0000000..53275ab --- /dev/null +++ b/transformer.md @@ -0,0 +1,68 @@ +# Transformer + +Transformer is a revolutionary architecture for sequence-to-sequence or just +sequence tasks, which was introduced by [Vaswani et al., +(2017)](https://arxiv.org/pdf/1706.03762). + +## Architecture + +The model is composed of two sub-models: encoder and decoder. Encoder recieves +the input sequence, and makes the processed information available for the +decoder. Decoder predicts the target sequence using past true target tokens +(teacher forcing) and the input information provided by the encoder. + +**Encoder** digests the input sequence via positional and semantic embedding +matricies. For text this means that each sub-word get positional and semantic +embedding. Though both embeddings can be trained, in the original paper only the +semantic embedding matrix was trained, while the positional embeddings were +computed. Then the input goes through several identical blocks. Each block +consists of [multi-head self-attention](./transformer_self_attention.md), and +feed-forward layers. + +![Transformer architecture.](./imgs/transformer_architecture.png) + +**Decoder** is similar to encoder, yet not entirely the same. Decoder too +ingests input through embeded tokens, which also pass through several identical +layers. The layers are also similar to encoder's layers: +1. self-attention +2. encoder-decoder attention +3. feed-forward layers. + +### Attentions + +There is a whole note about [self-attention](./transformer_self_attention.md), +so I mention only the specifics to its use. There are 3 usages of the attention +mechanism: +1. self-attention in encoder +2. self-attention in decoder +3. encoder-decoder attention + +All types follow the basic computation but differ in some details. 1. is vanilla +self-attention as it is described in the above mentioned note. 2. is the same, +except its attention is masked out for upper diagonal to prevent the decoder +accessing future tokens. The diagonal is not masked since the decoder's input is +shifted by one. 3. is same as 1. except the values and keys are supplied by the +processed sequence *at the end of the encoder*. This allows the decoder to ask +for information from the input to generate the output. + +### Feed-forward layers + +Feed-forward layers include two feed-forward layers w/ ReLU between them which +are applied to each token separatelly. The first layer scales the input +dimension 4x, while the other scales it back down to the original size. + +### Residual connections + +In the original article there were two residual connections that went: +1. before the self-attention to after it +2. before the feed-forward layers to after them +Right after summing with the output of the avoided layer(s), there were +[layer-normalizations](./layer_normalization.md). + +However, later it was discovered (as it was in ResNet in 2016) that putting the +layer normalization inside the computation branch before any other computation +is beneficial as the residual connection only contains sums and therefore +doesn't disturb the gradient. The first usage of these **pre-activation +residual** blocks in Transformers appears (AFAIK) in [Sparse transformers paper from +2019](./sparse_transformer.md). + diff --git a/transformer_self_attention.md b/transformer_self_attention.md index b0ffea3..733e92d 100644 --- a/transformer_self_attention.md +++ b/transformer_self_attention.md @@ -34,6 +34,40 @@ the next multiplication of the previously mentioned matrix by $V$. First multiplication takes $O(N^2)$ space and $O(N^2E)$ time, the second $O(NE)$ space and $O(N^2E)$ time. +## Multi-head self-attention + +In a normal settings the self-attention computation is done several times to +allow the attention to model several sets of token dependencies. These sets are +called heads. Each head computes its own values, queries and keys. + +However, this is implemented simply as computing #head-times queries, values and +keys from the full embeddings with #head-times smaller dimensions. With clever +transpositions we achieve the same effect as if we were doing the computation +#head-times: + +```python +def per_head(z: torch.Tensor) -> torch.Tensor: + """Recieves [batch_size, seq_length, dim], outputs [batch_size, num_heads, + seq_length, dim // num_heads]""" + + batch_size, seq_length, dim = z.shape + z = torch.reshape(z, [batch_size, seq_length, num_heads, -1]) + return torch.permute(z, (0, 2, 1, 3)) + +keys = per_head(linear_keys(x)) +queries = per_head(linear_queries(x)) +values = per_head(linear_values(x)) + +attention = queries @ torch.transpose(keys, (-2, -1)) +d_z = dim // num_heads +results_per_head = torch.nn.softmax(attention/torch.sqrt(d_z)) @ values + +return torch.reshape( + torch.permute(results_per_head, (0, 2, 1, 3)), + (batch_size, seq_length, -1), +) +``` + --- For some visualizations visit [Illustrated transformer](http://jalammar.github.io/illustrated-transformer/). diff --git a/wav2vec.md b/wav2vec.md new file mode 100644 index 0000000..7196b48 --- /dev/null +++ b/wav2vec.md @@ -0,0 +1,22 @@ +# Wav2vec + +Wav2vec is a pre-trained speech recognition model introduced by [Baevski et al. +(2020)](https://arxiv.org/pdf/2006.11477). + +Wav2vec is composed of several parts. CNN acts as a feature encoder and encodes +the audio signal to fixed-sized representations which are fed to Transformer +that contextualizes it. Similarly to MLM the representations are masked-out, +leaving the Transformer to predict their *quantized forms*. + +The model is then fine-tuned on a labelled dataset using [CTC loss](./ctc.md). +The authors showed that thanks to the pre-training, even with small amount of +supervised data the model is able to surpass the competition + +TODO: +- quantization +- CNN structure +- Gumbel Softmax +- Using CNN instead of absolute embeddings for the transformer +- Contrastive loss +- diversity loss for the masked prediction +- usage of SpecAugment during fine-tuning diff --git a/whisper.md b/whisper.md new file mode 100644 index 0000000..0f8f4d7 --- /dev/null +++ b/whisper.md @@ -0,0 +1,32 @@ +# Whisper + +Whisper is a pre-trained speech to text model introduced by [Radford et al., +2020](https://arxiv.org/pdf/2212.04356). Compared to other STT models like +[HuBERT](./hubert.md) or [wav2vec](./wav2vec.md), Whisper aims to be ready out +of the box for zero-shot transcription. + +Instead of training an encoder, authors chose to train speech transcriber +end-to-end. This means that the model cannot be used for other speech-encoding +tasks. The authors argue that pre-trained encoders always lack equally trained +decoders and that finetuning decoders only on dedicated datasets decreases the +system's robustness. + +## Architecture + +Whisper is [Transformer encoder-decoder](./transformer.md) model that digests +audio via fairly small 2-layer convolution layers. The model is trained to do +multiple tasks: +1. multilingual transcription (focusing on English though) +2. Any language to English translation +3. Speech detection (Is somebody speaking or not?) + +Additionally the model is trained to detect the language of the input, on some +inputs even detect quantized timestamps for each token. To control these +predictions the decoder receives a rather complicated sequence of tokens with +several special ones (apart from the utterance). The special tokens give the +model place to predict +- if somebody is speaking, +- the language, +and instruct it if it should +- translate, +- predict timestamps or not.