SST models

dburian · Oct 4, 2024 · 24c12ef · 24c12ef
1 parent 25817eb
commit 24c12ef
Show file tree

Hide file tree

Showing 9 changed files with 333 additions and 0 deletions.
diff --git a/beginners_guite_to_asr.md b/beginners_guite_to_asr.md
@@ -0,0 +1,29 @@
+# Beginners guide to ASR
+
+I write this, since I'm a beginner in speech ML (TTS, ASR/STT). This is what has
+helped me get started.
+
+## Common libraries
+
+- loading files: `librosa`
+- processing input data: `spark`
+
+## Common values
+
+- sampling rate: 16k
+
+## Common metrics
+
+- Word Error Rate (WER) -- word edit distance over the number of true words:
+
+$$
+\operatorname{WER}(y_\text{pred}, y_\text{true}) = \frac{
+  \operatorname{ed}_w(y_\text{pred}, y_\text{true})
+  }{
+    |y_\text{true}|
+  }
+$$
+
+## Common approaches
+
+
diff --git a/ctc.md b/ctc.md
@@ -0,0 +1,103 @@
+# Connectionist Temporal Classification (CTC)
+
+CTC is an objective function for classification of token sequences introduced by
+[Graves et al. (2006](https://www.cs.toronto.edu/~graves/icml_2006.pdf). It
+addresses the issue of classification of tokens in variable-length sequences.
+
+## The goal
+
+Imagine a sequence of $N$ tokens, in which you have to find $M$ labels (speech
+recognition), where $M \leq N$. You don't know where in those $N$ tokens the $M$
+labels appear and so you cannot use traditional negative log likelihood for each
+token to compute loss.
+
+CTC loss approaches the problem in the following fashion:
+- designs decoding algorithm to go from $N$ tokens to $M$ labels
+- shows how to compute the probability of a sequence of $M$ labels, so that
+  Cross-entropy can be used as a loss
+
+## Decoding
+
+Each of the $N$ tokens is classified into $M + 1$ labels. The extra label being
+the blank $-$. Then the decoding is as follows:
+
+- immediate repetition of the same label counts as one
+- blanks are tossed
+
+So for $aa-a-a-bb--$ we decode $aaab$. We call the predicted labelling *extended
+labelling*, since we extended the true labelling by adding blanks $-$.
+
+However, its not straightforward to obtain model's predicted extended labelling,
+since there are exponentially (to $N$) many labellings the model can predicted,
+each of those being represented by exponentially many extended labellings. So we
+can either use
+
+- crude approximation -- sometimes works ans is fast
+- heuristics -- works better but is slower
+
+### Crude approximation -- best path decoding
+
+For each token we take the most probable label. Note that this can lead to
+selecting different than the most probable sequence.
+
+### Heuristics -- beam search
+
+As we decode from the beginning we keep the $k$ most probable labellings (not
+extended). At the end of the $N$ tokens, we select the most probable sequence
+out of the $k$ we remember.
+
+## Probability of a given sequence
+
+To use cross-entropy we would like to compute $p(l|x)$, that is the likelihood
+of sequence of $M$ labels $l$ given the sequence of $N$ tokens. As we discussed
+above, there are exponential number of extended labellings $l'$ that correspond
+to $l$. However, we can use dynamic programming to compute the probability in
+$O(MN)$ time.
+
+We imagine an idealised extended labelling $l'$ for $l$ that has blanks at the
+beginning, end and between every two characters. Let us then define
+$\alpha_t(s)$ as the probability that the first $t$ predicted tokens covers the
+first $s$ tokens of $l'$. We set:
+
+$$
+\begin{aligned}
+  \alpha_1(0) &= y^1_{-} \\
+  \alpha_1(1) &= y^1_{l'_1}
+\end{aligned}
+$$
+
+where $y_k^t$ is the probability of predicting label $k$ at token $t$. We
+iterate over $t$. Its more helpful to imagine it like a grid (rows are $s$,
+columns are $t$) as in the following image (taken from the original paper):
+
+![Dynamic programming grid to compute
+probability of "CAT" in T tokens.](./imgs/ctc_dynamic_programming.png)
+
+The step is:
+
+$$
+\begin{aligned}
+\alpha_t(s)
+&= \left(\alpha_{t-1}(s) + \alpha_{t-1}(s-1)\right)y_-^t &&\;\text{if}\; l_{s-1} = -  \\
+&= \left(\alpha_{t-1}(s) + \alpha_{t-1}(s-1)\right)y_{l_s}^t &&\;\text{if}\; l_{s-1} \ne - \land l_{s-2} = l_s \\
+&= \left(\alpha_{t-1}(s) + \alpha_{t-1}(s-1) + \alpha_{t-1}(s-2)\right)y_{l_s}^t &&\;\text{if}\; l_{s-1} \ne - \land l_{s-2} \ne l_s \\
+\end{aligned}
+$$
+
+The above computations correspond to:
+
+1. if $l_s$ is a blank, it can be new one ($\alpha_{t-1}(s-1)$) or a repetition
+   of previous one, in which case prediction at $t-1$ needed to represent the
+   whole $l_{1:s}$ ($\alpha_{t-1}(s)$)
+2. if $l_s$ is a non-blank character and its the same as the previous non-blank
+   character, our options are the same as above, since $l_{s-1}$ needs to be
+   blank, so we split up $l_s$ and $l_{s-2}$ so that they are not decoded as
+   one.
+3. if $l_s$ is a non-blank character and its different from the previous
+   non-blank character, we can additionally skip the blank
+   ($\alpha_{t-1}(s-2)$).
+
+
+## Implementation
+
+TODO
diff --git a/hubert.md b/hubert.md
@@ -0,0 +1,45 @@
+# HuBERT
+
+Hidden-unit BERT (HuBERT) is a pre-trained speech recognition model introduced
+by [Hsu et al. (2021)](https://arxiv.org/pdf/2106.07447). HuBERT closely
+resembles [wav2vec](./wav2vec.md) in that it pre-trains using self-supervised
+learning by classifying quantized input features.
+
+## Architecture
+
+HuBERT is a model composed of CNN feature encoder, Transformer that
+contextualizes features and a quantization model. The authors use k-means
+clustering that assigns each span of speech a cluster. To make the cluster
+labels more robust, there are ensembles of $k$ clusterings with different
+parameters for each span. The loss takes multi-label form:
+$$
+L_m(f; X, \{Z^{(k)}\}_k, M) =
+\sum_{t \in M} \sum_k \log p_f^{(k)}\left(z_t^{(k)} \mid \tilde{X}, t\right)
+$$
+
+In later stages of training the clusters are derived from the trained
+representations of the transformer, rather than from the raw input. Using
+clustering metrics, the authors show that clustering based on Transformer
+features yields better clusters.
+
+## Interesting comparisons to wav2vec
+
+The authors point out that they quantize the input directly, not quantizing the
+features extracted by CNN. They argue that since CNN encoding is lossy,
+quantization has less information to quantize the input and therefore is not as
+good.
+
+## Ablation studies
+
+The authors experiment with convex combination of losses on masked and unmasked
+input spans. They show that with training only on masked inputs, the model is
+more resilient to bad clustering. However, with good clustering, training on
+unmasked spans is favorable.
+
+TODO:
+- *freeze step* parameter during fine-tuning (as in wav2vec)
+- What exactly is MFCC
+- What are dev-other sets? Is there some other dev set? dev-clean, dev-other,
+  test-clean, test-other
+
+
diff --git a/imgs/ctc_dynamic_programming.png b/imgs/ctc_dynamic_programming.png
diff --git a/imgs/transformer_architecture.png b/imgs/transformer_architecture.png
diff --git a/transformer.md b/transformer.md
@@ -0,0 +1,68 @@
+# Transformer
+
+Transformer is a revolutionary architecture for sequence-to-sequence or just
+sequence tasks, which was introduced by [Vaswani et al.,
+(2017)](https://arxiv.org/pdf/1706.03762).
+
+## Architecture
+
+The model is composed of two sub-models: encoder and decoder. Encoder recieves
+the input sequence, and makes the processed information available for the
+decoder. Decoder predicts the target sequence using past true target tokens
+(teacher forcing) and the input information provided by the encoder.
+
+**Encoder** digests the input sequence via positional and semantic embedding
+matricies. For text this means that each sub-word get positional and semantic
+embedding. Though both embeddings can be trained, in the original paper only the
+semantic embedding matrix was trained, while the positional embeddings were
+computed. Then the input goes through several identical blocks. Each block
+consists of [multi-head self-attention](./transformer_self_attention.md), and
+feed-forward layers.
+
+![Transformer architecture.](./imgs/transformer_architecture.png)
+
+**Decoder** is similar to encoder, yet not entirely the same. Decoder too
+ingests input through embeded tokens, which also pass through several identical
+layers. The layers are also similar to encoder's layers:
+1. self-attention
+2. encoder-decoder attention
+3. feed-forward layers.
+
+### Attentions
+
+There is a whole note about [self-attention](./transformer_self_attention.md),
+so I mention only the specifics to its use. There are 3 usages of the attention
+mechanism:
+1. self-attention in encoder
+2. self-attention in decoder
+3. encoder-decoder attention
+
+All types follow the basic computation but differ in some details. 1. is vanilla
+self-attention as it is described in the above mentioned note. 2. is the same,
+except its attention is masked out for upper diagonal to prevent the decoder
+accessing future tokens. The diagonal is not masked since the decoder's input is
+shifted by one. 3. is same as 1. except the values and keys are supplied by the
+processed sequence *at the end of the encoder*. This allows the decoder to ask
+for information from the input to generate the output.
+
+### Feed-forward layers
+
+Feed-forward layers include two feed-forward layers w/ ReLU between them which
+are applied to each token separatelly. The first layer scales the input
+dimension 4x, while the other scales it back down to the original size.
+
+### Residual connections
+
+In the original article there were two residual connections that went:
+1. before the self-attention to after it
+2. before the feed-forward layers to after them
+Right after summing with the output of the avoided layer(s), there were
+[layer-normalizations](./layer_normalization.md).
+
+However, later it was discovered (as it was in ResNet in 2016) that putting the
+layer normalization inside the computation branch before any other computation
+is beneficial as the residual connection only contains sums and therefore
+doesn't disturb the gradient. The first usage of these **pre-activation
+residual** blocks in Transformers appears (AFAIK) in [Sparse transformers paper from
+2019](./sparse_transformer.md).
+
diff --git a/transformer_self_attention.md b/transformer_self_attention.md
@@ -34,6 +34,40 @@ the next multiplication of the previously mentioned matrix by $V$. First
 multiplication takes $O(N^2)$ space and $O(N^2E)$ time, the second $O(NE)$ space
 and $O(N^2E)$ time.
 
+## Multi-head self-attention
+
+In a normal settings the self-attention computation is done several times to
+allow the attention to model several sets of token dependencies. These sets are
+called heads. Each head computes its own values, queries and keys.
+
+However, this is implemented simply as computing #head-times queries, values and
+keys from the full embeddings with #head-times smaller dimensions. With clever
+transpositions we achieve the same effect as if we were doing the computation
+#head-times:
+
+```python
+def per_head(z: torch.Tensor) -> torch.Tensor:
+  """Recieves [batch_size, seq_length, dim], outputs [batch_size, num_heads,
+  seq_length, dim // num_heads]"""
+
+  batch_size, seq_length, dim = z.shape
+  z = torch.reshape(z, [batch_size, seq_length, num_heads, -1])
+  return torch.permute(z, (0, 2, 1, 3))
+
+keys = per_head(linear_keys(x))
+queries = per_head(linear_queries(x))
+values = per_head(linear_values(x))
+
+attention = queries @ torch.transpose(keys, (-2, -1))
+d_z = dim // num_heads
+results_per_head = torch.nn.softmax(attention/torch.sqrt(d_z)) @ values
+
+return torch.reshape(
+  torch.permute(results_per_head, (0, 2, 1, 3)),
+  (batch_size, seq_length, -1),
+)
+```
+
 ---
 For some visualizations visit [Illustrated
 transformer](http://jalammar.github.io/illustrated-transformer/).
diff --git a/wav2vec.md b/wav2vec.md
@@ -0,0 +1,22 @@
+# Wav2vec
+
+Wav2vec is a pre-trained speech recognition model introduced by [Baevski et al.
+(2020)](https://arxiv.org/pdf/2006.11477).
+
+Wav2vec is composed of several parts. CNN acts as a feature encoder and encodes
+the audio signal to fixed-sized representations which are fed to Transformer
+that contextualizes it. Similarly to MLM the representations are masked-out,
+leaving the Transformer to predict their *quantized forms*.
+
+The model is then fine-tuned on a labelled dataset using [CTC loss](./ctc.md).
+The authors showed that thanks to the pre-training, even with small amount of
+supervised data the model is able to surpass the competition
+
+TODO:
+- quantization
+- CNN structure
+- Gumbel Softmax
+- Using CNN instead of absolute embeddings for the transformer
+- Contrastive loss
+- diversity loss for the masked prediction
+- usage of SpecAugment during fine-tuning
diff --git a/whisper.md b/whisper.md
@@ -0,0 +1,32 @@
+# Whisper
+
+Whisper is a pre-trained speech to text model introduced by [Radford et al.,
+2020](https://arxiv.org/pdf/2212.04356). Compared to other STT models like
+[HuBERT](./hubert.md) or [wav2vec](./wav2vec.md), Whisper aims to be ready out
+of the box for zero-shot transcription.
+
+Instead of training an encoder, authors chose to train speech transcriber
+end-to-end. This means that the model cannot be used for other speech-encoding
+tasks. The authors argue that pre-trained encoders always lack equally trained
+decoders and that finetuning decoders only on dedicated datasets decreases the
+system's robustness.
+
+## Architecture
+
+Whisper is [Transformer encoder-decoder](./transformer.md) model that digests
+audio via fairly small 2-layer convolution layers. The model is trained to do
+multiple tasks:
+1. multilingual transcription (focusing on English though)
+2. Any language to English translation
+3. Speech detection (Is somebody speaking or not?)
+
+Additionally the model is trained to detect the language of the input, on some
+inputs even detect quantized timestamps for each token. To control these
+predictions the decoder receives a rather complicated sequence of tokens with
+several special ones (apart from the utterance). The special tokens give the
+model place to predict
+- if somebody is speaking,
+- the language,
+and instruct it if it should
+- translate,
+- predict timestamps or not.