Skip to content

Commit

Permalink
SST models
Browse files Browse the repository at this point in the history
  • Loading branch information
dburian committed Oct 4, 2024
1 parent 25817eb commit 24c12ef
Show file tree
Hide file tree
Showing 9 changed files with 333 additions and 0 deletions.
29 changes: 29 additions & 0 deletions beginners_guite_to_asr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Beginners guide to ASR

I write this, since I'm a beginner in speech ML (TTS, ASR/STT). This is what has
helped me get started.

## Common libraries

- loading files: `librosa`
- processing input data: `spark`

## Common values

- sampling rate: 16k

## Common metrics

- Word Error Rate (WER) -- word edit distance over the number of true words:

$$
\operatorname{WER}(y_\text{pred}, y_\text{true}) = \frac{
\operatorname{ed}_w(y_\text{pred}, y_\text{true})
}{
|y_\text{true}|
}
$$

## Common approaches


103 changes: 103 additions & 0 deletions ctc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Connectionist Temporal Classification (CTC)

CTC is an objective function for classification of token sequences introduced by
[Graves et al. (2006](https://www.cs.toronto.edu/~graves/icml_2006.pdf). It
addresses the issue of classification of tokens in variable-length sequences.

## The goal

Imagine a sequence of $N$ tokens, in which you have to find $M$ labels (speech
recognition), where $M \leq N$. You don't know where in those $N$ tokens the $M$
labels appear and so you cannot use traditional negative log likelihood for each
token to compute loss.

CTC loss approaches the problem in the following fashion:
- designs decoding algorithm to go from $N$ tokens to $M$ labels
- shows how to compute the probability of a sequence of $M$ labels, so that
Cross-entropy can be used as a loss

## Decoding

Each of the $N$ tokens is classified into $M + 1$ labels. The extra label being
the blank $-$. Then the decoding is as follows:

- immediate repetition of the same label counts as one
- blanks are tossed

So for $aa-a-a-bb--$ we decode $aaab$. We call the predicted labelling *extended
labelling*, since we extended the true labelling by adding blanks $-$.

However, its not straightforward to obtain model's predicted extended labelling,
since there are exponentially (to $N$) many labellings the model can predicted,
each of those being represented by exponentially many extended labellings. So we
can either use

- crude approximation -- sometimes works ans is fast
- heuristics -- works better but is slower

### Crude approximation -- best path decoding

For each token we take the most probable label. Note that this can lead to
selecting different than the most probable sequence.

### Heuristics -- beam search

As we decode from the beginning we keep the $k$ most probable labellings (not
extended). At the end of the $N$ tokens, we select the most probable sequence
out of the $k$ we remember.

## Probability of a given sequence

To use cross-entropy we would like to compute $p(l|x)$, that is the likelihood
of sequence of $M$ labels $l$ given the sequence of $N$ tokens. As we discussed
above, there are exponential number of extended labellings $l'$ that correspond
to $l$. However, we can use dynamic programming to compute the probability in
$O(MN)$ time.

We imagine an idealised extended labelling $l'$ for $l$ that has blanks at the
beginning, end and between every two characters. Let us then define
$\alpha_t(s)$ as the probability that the first $t$ predicted tokens covers the
first $s$ tokens of $l'$. We set:

$$
\begin{aligned}
\alpha_1(0) &= y^1_{-} \\
\alpha_1(1) &= y^1_{l'_1}
\end{aligned}
$$

where $y_k^t$ is the probability of predicting label $k$ at token $t$. We
iterate over $t$. Its more helpful to imagine it like a grid (rows are $s$,
columns are $t$) as in the following image (taken from the original paper):

![Dynamic programming grid to compute
probability of "CAT" in T tokens.](./imgs/ctc_dynamic_programming.png)

The step is:

$$
\begin{aligned}
\alpha_t(s)
&= \left(\alpha_{t-1}(s) + \alpha_{t-1}(s-1)\right)y_-^t &&\;\text{if}\; l_{s-1} = - \\
&= \left(\alpha_{t-1}(s) + \alpha_{t-1}(s-1)\right)y_{l_s}^t &&\;\text{if}\; l_{s-1} \ne - \land l_{s-2} = l_s \\
&= \left(\alpha_{t-1}(s) + \alpha_{t-1}(s-1) + \alpha_{t-1}(s-2)\right)y_{l_s}^t &&\;\text{if}\; l_{s-1} \ne - \land l_{s-2} \ne l_s \\
\end{aligned}
$$

The above computations correspond to:

1. if $l_s$ is a blank, it can be new one ($\alpha_{t-1}(s-1)$) or a repetition
of previous one, in which case prediction at $t-1$ needed to represent the
whole $l_{1:s}$ ($\alpha_{t-1}(s)$)
2. if $l_s$ is a non-blank character and its the same as the previous non-blank
character, our options are the same as above, since $l_{s-1}$ needs to be
blank, so we split up $l_s$ and $l_{s-2}$ so that they are not decoded as
one.
3. if $l_s$ is a non-blank character and its different from the previous
non-blank character, we can additionally skip the blank
($\alpha_{t-1}(s-2)$).


## Implementation

TODO
45 changes: 45 additions & 0 deletions hubert.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# HuBERT

Hidden-unit BERT (HuBERT) is a pre-trained speech recognition model introduced
by [Hsu et al. (2021)](https://arxiv.org/pdf/2106.07447). HuBERT closely
resembles [wav2vec](./wav2vec.md) in that it pre-trains using self-supervised
learning by classifying quantized input features.

## Architecture

HuBERT is a model composed of CNN feature encoder, Transformer that
contextualizes features and a quantization model. The authors use k-means
clustering that assigns each span of speech a cluster. To make the cluster
labels more robust, there are ensembles of $k$ clusterings with different
parameters for each span. The loss takes multi-label form:
$$
L_m(f; X, \{Z^{(k)}\}_k, M) =
\sum_{t \in M} \sum_k \log p_f^{(k)}\left(z_t^{(k)} \mid \tilde{X}, t\right)
$$

In later stages of training the clusters are derived from the trained
representations of the transformer, rather than from the raw input. Using
clustering metrics, the authors show that clustering based on Transformer
features yields better clusters.

## Interesting comparisons to wav2vec

The authors point out that they quantize the input directly, not quantizing the
features extracted by CNN. They argue that since CNN encoding is lossy,
quantization has less information to quantize the input and therefore is not as
good.

## Ablation studies

The authors experiment with convex combination of losses on masked and unmasked
input spans. They show that with training only on masked inputs, the model is
more resilient to bad clustering. However, with good clustering, training on
unmasked spans is favorable.

TODO:
- *freeze step* parameter during fine-tuning (as in wav2vec)
- What exactly is MFCC
- What are dev-other sets? Is there some other dev set? dev-clean, dev-other,
test-clean, test-other


Binary file added imgs/ctc_dynamic_programming.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/transformer_architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
68 changes: 68 additions & 0 deletions transformer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Transformer

Transformer is a revolutionary architecture for sequence-to-sequence or just
sequence tasks, which was introduced by [Vaswani et al.,
(2017)](https://arxiv.org/pdf/1706.03762).

## Architecture

The model is composed of two sub-models: encoder and decoder. Encoder recieves
the input sequence, and makes the processed information available for the
decoder. Decoder predicts the target sequence using past true target tokens
(teacher forcing) and the input information provided by the encoder.

**Encoder** digests the input sequence via positional and semantic embedding
matricies. For text this means that each sub-word get positional and semantic
embedding. Though both embeddings can be trained, in the original paper only the
semantic embedding matrix was trained, while the positional embeddings were
computed. Then the input goes through several identical blocks. Each block
consists of [multi-head self-attention](./transformer_self_attention.md), and
feed-forward layers.

![Transformer architecture.](./imgs/transformer_architecture.png)

**Decoder** is similar to encoder, yet not entirely the same. Decoder too
ingests input through embeded tokens, which also pass through several identical
layers. The layers are also similar to encoder's layers:
1. self-attention
2. encoder-decoder attention
3. feed-forward layers.

### Attentions

There is a whole note about [self-attention](./transformer_self_attention.md),
so I mention only the specifics to its use. There are 3 usages of the attention
mechanism:
1. self-attention in encoder
2. self-attention in decoder
3. encoder-decoder attention

All types follow the basic computation but differ in some details. 1. is vanilla
self-attention as it is described in the above mentioned note. 2. is the same,
except its attention is masked out for upper diagonal to prevent the decoder
accessing future tokens. The diagonal is not masked since the decoder's input is
shifted by one. 3. is same as 1. except the values and keys are supplied by the
processed sequence *at the end of the encoder*. This allows the decoder to ask
for information from the input to generate the output.

### Feed-forward layers

Feed-forward layers include two feed-forward layers w/ ReLU between them which
are applied to each token separatelly. The first layer scales the input
dimension 4x, while the other scales it back down to the original size.

### Residual connections

In the original article there were two residual connections that went:
1. before the self-attention to after it
2. before the feed-forward layers to after them
Right after summing with the output of the avoided layer(s), there were
[layer-normalizations](./layer_normalization.md).

However, later it was discovered (as it was in ResNet in 2016) that putting the
layer normalization inside the computation branch before any other computation
is beneficial as the residual connection only contains sums and therefore
doesn't disturb the gradient. The first usage of these **pre-activation
residual** blocks in Transformers appears (AFAIK) in [Sparse transformers paper from
2019](./sparse_transformer.md).

34 changes: 34 additions & 0 deletions transformer_self_attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,40 @@ the next multiplication of the previously mentioned matrix by $V$. First
multiplication takes $O(N^2)$ space and $O(N^2E)$ time, the second $O(NE)$ space
and $O(N^2E)$ time.

## Multi-head self-attention

In a normal settings the self-attention computation is done several times to
allow the attention to model several sets of token dependencies. These sets are
called heads. Each head computes its own values, queries and keys.

However, this is implemented simply as computing #head-times queries, values and
keys from the full embeddings with #head-times smaller dimensions. With clever
transpositions we achieve the same effect as if we were doing the computation
#head-times:

```python
def per_head(z: torch.Tensor) -> torch.Tensor:
"""Recieves [batch_size, seq_length, dim], outputs [batch_size, num_heads,
seq_length, dim // num_heads]"""

batch_size, seq_length, dim = z.shape
z = torch.reshape(z, [batch_size, seq_length, num_heads, -1])
return torch.permute(z, (0, 2, 1, 3))

keys = per_head(linear_keys(x))
queries = per_head(linear_queries(x))
values = per_head(linear_values(x))

attention = queries @ torch.transpose(keys, (-2, -1))
d_z = dim // num_heads
results_per_head = torch.nn.softmax(attention/torch.sqrt(d_z)) @ values

return torch.reshape(
torch.permute(results_per_head, (0, 2, 1, 3)),
(batch_size, seq_length, -1),
)
```

---
For some visualizations visit [Illustrated
transformer](http://jalammar.github.io/illustrated-transformer/).
22 changes: 22 additions & 0 deletions wav2vec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Wav2vec

Wav2vec is a pre-trained speech recognition model introduced by [Baevski et al.
(2020)](https://arxiv.org/pdf/2006.11477).

Wav2vec is composed of several parts. CNN acts as a feature encoder and encodes
the audio signal to fixed-sized representations which are fed to Transformer
that contextualizes it. Similarly to MLM the representations are masked-out,
leaving the Transformer to predict their *quantized forms*.

The model is then fine-tuned on a labelled dataset using [CTC loss](./ctc.md).
The authors showed that thanks to the pre-training, even with small amount of
supervised data the model is able to surpass the competition

TODO:
- quantization
- CNN structure
- Gumbel Softmax
- Using CNN instead of absolute embeddings for the transformer
- Contrastive loss
- diversity loss for the masked prediction
- usage of SpecAugment during fine-tuning
32 changes: 32 additions & 0 deletions whisper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Whisper

Whisper is a pre-trained speech to text model introduced by [Radford et al.,
2020](https://arxiv.org/pdf/2212.04356). Compared to other STT models like
[HuBERT](./hubert.md) or [wav2vec](./wav2vec.md), Whisper aims to be ready out
of the box for zero-shot transcription.

Instead of training an encoder, authors chose to train speech transcriber
end-to-end. This means that the model cannot be used for other speech-encoding
tasks. The authors argue that pre-trained encoders always lack equally trained
decoders and that finetuning decoders only on dedicated datasets decreases the
system's robustness.

## Architecture

Whisper is [Transformer encoder-decoder](./transformer.md) model that digests
audio via fairly small 2-layer convolution layers. The model is trained to do
multiple tasks:
1. multilingual transcription (focusing on English though)
2. Any language to English translation
3. Speech detection (Is somebody speaking or not?)

Additionally the model is trained to detect the language of the input, on some
inputs even detect quantized timestamps for each token. To control these
predictions the decoder receives a rather complicated sequence of tokens with
several special ones (apart from the utterance). The special tokens give the
model place to predict
- if somebody is speaking,
- the language,
and instruct it if it should
- translate,
- predict timestamps or not.

0 comments on commit 24c12ef

Please sign in to comment.