-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
333 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Beginners guide to ASR | ||
|
||
I write this, since I'm a beginner in speech ML (TTS, ASR/STT). This is what has | ||
helped me get started. | ||
|
||
## Common libraries | ||
|
||
- loading files: `librosa` | ||
- processing input data: `spark` | ||
|
||
## Common values | ||
|
||
- sampling rate: 16k | ||
|
||
## Common metrics | ||
|
||
- Word Error Rate (WER) -- word edit distance over the number of true words: | ||
|
||
$$ | ||
\operatorname{WER}(y_\text{pred}, y_\text{true}) = \frac{ | ||
\operatorname{ed}_w(y_\text{pred}, y_\text{true}) | ||
}{ | ||
|y_\text{true}| | ||
} | ||
$$ | ||
|
||
## Common approaches | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
# Connectionist Temporal Classification (CTC) | ||
|
||
CTC is an objective function for classification of token sequences introduced by | ||
[Graves et al. (2006](https://www.cs.toronto.edu/~graves/icml_2006.pdf). It | ||
addresses the issue of classification of tokens in variable-length sequences. | ||
|
||
## The goal | ||
|
||
Imagine a sequence of $N$ tokens, in which you have to find $M$ labels (speech | ||
recognition), where $M \leq N$. You don't know where in those $N$ tokens the $M$ | ||
labels appear and so you cannot use traditional negative log likelihood for each | ||
token to compute loss. | ||
|
||
CTC loss approaches the problem in the following fashion: | ||
- designs decoding algorithm to go from $N$ tokens to $M$ labels | ||
- shows how to compute the probability of a sequence of $M$ labels, so that | ||
Cross-entropy can be used as a loss | ||
|
||
## Decoding | ||
|
||
Each of the $N$ tokens is classified into $M + 1$ labels. The extra label being | ||
the blank $-$. Then the decoding is as follows: | ||
|
||
- immediate repetition of the same label counts as one | ||
- blanks are tossed | ||
|
||
So for $aa-a-a-bb--$ we decode $aaab$. We call the predicted labelling *extended | ||
labelling*, since we extended the true labelling by adding blanks $-$. | ||
|
||
However, its not straightforward to obtain model's predicted extended labelling, | ||
since there are exponentially (to $N$) many labellings the model can predicted, | ||
each of those being represented by exponentially many extended labellings. So we | ||
can either use | ||
|
||
- crude approximation -- sometimes works ans is fast | ||
- heuristics -- works better but is slower | ||
|
||
### Crude approximation -- best path decoding | ||
|
||
For each token we take the most probable label. Note that this can lead to | ||
selecting different than the most probable sequence. | ||
|
||
### Heuristics -- beam search | ||
|
||
As we decode from the beginning we keep the $k$ most probable labellings (not | ||
extended). At the end of the $N$ tokens, we select the most probable sequence | ||
out of the $k$ we remember. | ||
|
||
## Probability of a given sequence | ||
|
||
To use cross-entropy we would like to compute $p(l|x)$, that is the likelihood | ||
of sequence of $M$ labels $l$ given the sequence of $N$ tokens. As we discussed | ||
above, there are exponential number of extended labellings $l'$ that correspond | ||
to $l$. However, we can use dynamic programming to compute the probability in | ||
$O(MN)$ time. | ||
|
||
We imagine an idealised extended labelling $l'$ for $l$ that has blanks at the | ||
beginning, end and between every two characters. Let us then define | ||
$\alpha_t(s)$ as the probability that the first $t$ predicted tokens covers the | ||
first $s$ tokens of $l'$. We set: | ||
|
||
$$ | ||
\begin{aligned} | ||
\alpha_1(0) &= y^1_{-} \\ | ||
\alpha_1(1) &= y^1_{l'_1} | ||
\end{aligned} | ||
$$ | ||
|
||
where $y_k^t$ is the probability of predicting label $k$ at token $t$. We | ||
iterate over $t$. Its more helpful to imagine it like a grid (rows are $s$, | ||
columns are $t$) as in the following image (taken from the original paper): | ||
|
||
![Dynamic programming grid to compute | ||
probability of "CAT" in T tokens.](./imgs/ctc_dynamic_programming.png) | ||
|
||
The step is: | ||
|
||
$$ | ||
\begin{aligned} | ||
\alpha_t(s) | ||
&= \left(\alpha_{t-1}(s) + \alpha_{t-1}(s-1)\right)y_-^t &&\;\text{if}\; l_{s-1} = - \\ | ||
&= \left(\alpha_{t-1}(s) + \alpha_{t-1}(s-1)\right)y_{l_s}^t &&\;\text{if}\; l_{s-1} \ne - \land l_{s-2} = l_s \\ | ||
&= \left(\alpha_{t-1}(s) + \alpha_{t-1}(s-1) + \alpha_{t-1}(s-2)\right)y_{l_s}^t &&\;\text{if}\; l_{s-1} \ne - \land l_{s-2} \ne l_s \\ | ||
\end{aligned} | ||
$$ | ||
|
||
The above computations correspond to: | ||
|
||
1. if $l_s$ is a blank, it can be new one ($\alpha_{t-1}(s-1)$) or a repetition | ||
of previous one, in which case prediction at $t-1$ needed to represent the | ||
whole $l_{1:s}$ ($\alpha_{t-1}(s)$) | ||
2. if $l_s$ is a non-blank character and its the same as the previous non-blank | ||
character, our options are the same as above, since $l_{s-1}$ needs to be | ||
blank, so we split up $l_s$ and $l_{s-2}$ so that they are not decoded as | ||
one. | ||
3. if $l_s$ is a non-blank character and its different from the previous | ||
non-blank character, we can additionally skip the blank | ||
($\alpha_{t-1}(s-2)$). | ||
|
||
|
||
## Implementation | ||
|
||
TODO |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# HuBERT | ||
|
||
Hidden-unit BERT (HuBERT) is a pre-trained speech recognition model introduced | ||
by [Hsu et al. (2021)](https://arxiv.org/pdf/2106.07447). HuBERT closely | ||
resembles [wav2vec](./wav2vec.md) in that it pre-trains using self-supervised | ||
learning by classifying quantized input features. | ||
|
||
## Architecture | ||
|
||
HuBERT is a model composed of CNN feature encoder, Transformer that | ||
contextualizes features and a quantization model. The authors use k-means | ||
clustering that assigns each span of speech a cluster. To make the cluster | ||
labels more robust, there are ensembles of $k$ clusterings with different | ||
parameters for each span. The loss takes multi-label form: | ||
$$ | ||
L_m(f; X, \{Z^{(k)}\}_k, M) = | ||
\sum_{t \in M} \sum_k \log p_f^{(k)}\left(z_t^{(k)} \mid \tilde{X}, t\right) | ||
$$ | ||
|
||
In later stages of training the clusters are derived from the trained | ||
representations of the transformer, rather than from the raw input. Using | ||
clustering metrics, the authors show that clustering based on Transformer | ||
features yields better clusters. | ||
|
||
## Interesting comparisons to wav2vec | ||
|
||
The authors point out that they quantize the input directly, not quantizing the | ||
features extracted by CNN. They argue that since CNN encoding is lossy, | ||
quantization has less information to quantize the input and therefore is not as | ||
good. | ||
|
||
## Ablation studies | ||
|
||
The authors experiment with convex combination of losses on masked and unmasked | ||
input spans. They show that with training only on masked inputs, the model is | ||
more resilient to bad clustering. However, with good clustering, training on | ||
unmasked spans is favorable. | ||
|
||
TODO: | ||
- *freeze step* parameter during fine-tuning (as in wav2vec) | ||
- What exactly is MFCC | ||
- What are dev-other sets? Is there some other dev set? dev-clean, dev-other, | ||
test-clean, test-other | ||
|
||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# Transformer | ||
|
||
Transformer is a revolutionary architecture for sequence-to-sequence or just | ||
sequence tasks, which was introduced by [Vaswani et al., | ||
(2017)](https://arxiv.org/pdf/1706.03762). | ||
|
||
## Architecture | ||
|
||
The model is composed of two sub-models: encoder and decoder. Encoder recieves | ||
the input sequence, and makes the processed information available for the | ||
decoder. Decoder predicts the target sequence using past true target tokens | ||
(teacher forcing) and the input information provided by the encoder. | ||
|
||
**Encoder** digests the input sequence via positional and semantic embedding | ||
matricies. For text this means that each sub-word get positional and semantic | ||
embedding. Though both embeddings can be trained, in the original paper only the | ||
semantic embedding matrix was trained, while the positional embeddings were | ||
computed. Then the input goes through several identical blocks. Each block | ||
consists of [multi-head self-attention](./transformer_self_attention.md), and | ||
feed-forward layers. | ||
|
||
![Transformer architecture.](./imgs/transformer_architecture.png) | ||
|
||
**Decoder** is similar to encoder, yet not entirely the same. Decoder too | ||
ingests input through embeded tokens, which also pass through several identical | ||
layers. The layers are also similar to encoder's layers: | ||
1. self-attention | ||
2. encoder-decoder attention | ||
3. feed-forward layers. | ||
|
||
### Attentions | ||
|
||
There is a whole note about [self-attention](./transformer_self_attention.md), | ||
so I mention only the specifics to its use. There are 3 usages of the attention | ||
mechanism: | ||
1. self-attention in encoder | ||
2. self-attention in decoder | ||
3. encoder-decoder attention | ||
|
||
All types follow the basic computation but differ in some details. 1. is vanilla | ||
self-attention as it is described in the above mentioned note. 2. is the same, | ||
except its attention is masked out for upper diagonal to prevent the decoder | ||
accessing future tokens. The diagonal is not masked since the decoder's input is | ||
shifted by one. 3. is same as 1. except the values and keys are supplied by the | ||
processed sequence *at the end of the encoder*. This allows the decoder to ask | ||
for information from the input to generate the output. | ||
|
||
### Feed-forward layers | ||
|
||
Feed-forward layers include two feed-forward layers w/ ReLU between them which | ||
are applied to each token separatelly. The first layer scales the input | ||
dimension 4x, while the other scales it back down to the original size. | ||
|
||
### Residual connections | ||
|
||
In the original article there were two residual connections that went: | ||
1. before the self-attention to after it | ||
2. before the feed-forward layers to after them | ||
Right after summing with the output of the avoided layer(s), there were | ||
[layer-normalizations](./layer_normalization.md). | ||
|
||
However, later it was discovered (as it was in ResNet in 2016) that putting the | ||
layer normalization inside the computation branch before any other computation | ||
is beneficial as the residual connection only contains sums and therefore | ||
doesn't disturb the gradient. The first usage of these **pre-activation | ||
residual** blocks in Transformers appears (AFAIK) in [Sparse transformers paper from | ||
2019](./sparse_transformer.md). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Wav2vec | ||
|
||
Wav2vec is a pre-trained speech recognition model introduced by [Baevski et al. | ||
(2020)](https://arxiv.org/pdf/2006.11477). | ||
|
||
Wav2vec is composed of several parts. CNN acts as a feature encoder and encodes | ||
the audio signal to fixed-sized representations which are fed to Transformer | ||
that contextualizes it. Similarly to MLM the representations are masked-out, | ||
leaving the Transformer to predict their *quantized forms*. | ||
|
||
The model is then fine-tuned on a labelled dataset using [CTC loss](./ctc.md). | ||
The authors showed that thanks to the pre-training, even with small amount of | ||
supervised data the model is able to surpass the competition | ||
|
||
TODO: | ||
- quantization | ||
- CNN structure | ||
- Gumbel Softmax | ||
- Using CNN instead of absolute embeddings for the transformer | ||
- Contrastive loss | ||
- diversity loss for the masked prediction | ||
- usage of SpecAugment during fine-tuning |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Whisper | ||
|
||
Whisper is a pre-trained speech to text model introduced by [Radford et al., | ||
2020](https://arxiv.org/pdf/2212.04356). Compared to other STT models like | ||
[HuBERT](./hubert.md) or [wav2vec](./wav2vec.md), Whisper aims to be ready out | ||
of the box for zero-shot transcription. | ||
|
||
Instead of training an encoder, authors chose to train speech transcriber | ||
end-to-end. This means that the model cannot be used for other speech-encoding | ||
tasks. The authors argue that pre-trained encoders always lack equally trained | ||
decoders and that finetuning decoders only on dedicated datasets decreases the | ||
system's robustness. | ||
|
||
## Architecture | ||
|
||
Whisper is [Transformer encoder-decoder](./transformer.md) model that digests | ||
audio via fairly small 2-layer convolution layers. The model is trained to do | ||
multiple tasks: | ||
1. multilingual transcription (focusing on English though) | ||
2. Any language to English translation | ||
3. Speech detection (Is somebody speaking or not?) | ||
|
||
Additionally the model is trained to detect the language of the input, on some | ||
inputs even detect quantized timestamps for each token. To control these | ||
predictions the decoder receives a rather complicated sequence of tokens with | ||
several special ones (apart from the utterance). The special tokens give the | ||
model place to predict | ||
- if somebody is speaking, | ||
- the language, | ||
and instruct it if it should | ||
- translate, | ||
- predict timestamps or not. |