Skip to content

Commit

Permalink
fast fourier transform and related
Browse files Browse the repository at this point in the history
  • Loading branch information
dburian committed Oct 5, 2024
1 parent 24c12ef commit 56ffde9
Show file tree
Hide file tree
Showing 6 changed files with 305 additions and 0 deletions.
89 changes: 89 additions & 0 deletions beginners_guite_to_asr.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,95 @@ $$
}
$$

## Glossary

### Frame

Part of the signal in time in between two timestamps.

### Window, Hop length

Some functions are **windowed**, meaning they are applied repeatedly for different
frames. *Windows are part of frames*, where window size doesn't have to be equal,
but often is to size of frame. The distance between consecutive windows is
called **hop length**, i.e. the overlap of two consecutive windows.


### Frequency, Phase, Amplitude

If we take *pure sound* we can study its 3 basic properties:

$$
a * \sin(2\pi * (ft - \phi),
$$

where

- $a$ is **amplitude** -- height of signal's peak (loudness of it)
- $f$ is **frequency** of the sound -- the number of cycles per second
- $t$ is time in seconds
- $\phi \in [0, 1)$ is **phase** of the signal -- horizontal shift of the sine
wave


### Amplitude envelope

The max of waveform for each frame. Gives an idea of loudness of a signal.
Sensitive to outliers.

$$
\text{AE}_t = \max_{i \in t} s(i)
$$

where

- $t$ - frame index
- $i$ - timestamp index in a frame
- $s(i)$ - amplitude at timestamp $i$

### Root-mean-square energy

The root of mean of square amplitude within each frame. Describes mean loudness
within a frame, with bias towards high amplitude (loudnesses)

$$
\text{RMS}_t = \sqrt{\sum_{i \in t} s(i)^2}
$$

where

- $t$ - frame index
- $i$ - timestamp index in a frame
- $s(i)$ - amplitude at timestamp $i$

### Zero-crossing rate

Number of times signal crossed `y=0` line (going from positive to negative or
vice-versa). ZCR can tell us about the sound's:

- percussiveness vs pitch -- both have high ZCR, but percussive sound's ZCR
varies more
- presence/absence of voice -- voiced sounds tend to have higher ZCR
- pitch -- assuming monotonic pitch, ZCR would correspond to frequency

$$
\text{ZCR}_t = \frac{1}{2}
\sum_{i \in t} |
\operatorname{sgn}\left(s(i)\right) -
\operatorname{sgn}\left(s(i+1)\right)
|
$$

where

- $t$ - frame index
- $i$ - timestamp index in a frame
- $s(i)$ - amplitude at timestamp $i$

## Common approaches


## Good sources of info

- [Audio Signal Processing for ML
(YouTube)](https://www.youtube.com/playlist?list=PL-wATfeyAMNqIee7cH3q1bh4QJFAaeNv0)
53 changes: 53 additions & 0 deletions discrete_fourier_transform.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Discrete Fourier transform

Discrete Fourier transform (DFT) is a function $\mathcal{F}: \mathbb{C}^n
\rightarrow \mathbb{C}^n$:

$$
\mathcal{F}(x)_j = \sum_{k = 0}^{n -1} x_k \cdot \omega_n^{jk},
\quad j \in \{0..n-1\},
$$

where $\omega_n$ is the [n-th primitive root of 1](./primitive_root_of_one.md).

## What it is

DFT is essentially evaluating polynomial $(x_0, x_1, \ldots, x_{n-1})$ at points
$(\omega_n^{0}, \omega_n^{1}, \ldots, \omega_n^{n-1})$. The reason why we chose
$\omega_n^j$, is due to the characteristics of n-th primitive roots of one,
thanks to which we can evaluate DFT in $O(n\log n)$ time using algorithm called
[Fast Fourier Transform](./fft.md).

## There is an inverse too

Since DFT is a linear projection:
$$
\mathcal{F}(\mathbf{x}) = \mathbf{\Omega} \mathbf{x}, \;\text{s.t.}\;
\mathbf{\Omega}_jk = \omega^{jk}
$$


There is also an inverse:
$$
\mathbf{\Omega^{-1}} = \frac{\overline{\mathbf{\Omega}}}{n}
$$


## Why is it useful

DFT is useful for many things. Here is some short list to give you an idea.

### Multiplying polynomials

For a naive multiplication of two polynomials of size $n$, we would need
quadratic time. Since polynomial of size $n$ is defined by arbitrary, but different
$n$ points. We can do the following:
1. Evaluate each polynomial in $n$ points using DFT -- $O(n\log n)$
2. Multiply corresponding pair of coefficients together -- $O(n)$
3. Do inverse of DFT -- $O(n\log n)$

### Spectral analysis of a signal

For a signal (e.g. sound) we can use DFT to find pure frequencies the signal is
made up of. This is somehow more involved and there is a separate note dedicated
to [Spectral analysis of sound](./spectral_analysis_of_sound.md).
129 changes: 129 additions & 0 deletions fft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Fast Fourier transform

Fast Fourier transform (FFT) computes [Discrete Fourier
transform (DFT)](./discrete_fourier_transform.md) in $O(n\log n)$ time.

Naive evaluation would run in $O(n^2)$ time since we need to evaluate polynomial
$x_0, \ldots, x_{n-1}$ in $n$ points $(P(z_0), \ldots, P(z_{n-1}))$.

## Idea of FFT

The idea of FFT is pretty simple:

- We assume: $n = 2^l$ for some $l$, if $n$ is smaller, we pad it with zeros.

- divide the polynomial $P$ into even and odd powers:

$$
\begin{align}
P(z) &=
\left(
x_0 z^0 +
x_2 z^2 +
x_4 z^4 +
\ldots +
x_{n - 2} z^{n - 2}
\right) \\
&\quad +
z \left(
x_1 z^0 +
x_3 z^2 +
x_5 z^4 +
\ldots +
x_{n - 1} z^{n - 2}
\right) \\
&= P_{e}(z^2) + z P_o(z^2)
\end{align}
$$

- now we have to evaluate two polynomials that are half the original size at $n$
points

- Further, we strategically pick the $n$ points, s.t. they are paired:

$$
z_k = - z_j, \,\text{for some}\, i,j \in {0, \ldots, n-1}
$$

- now we only have to evaluate two polynomials that are half the original size
at half the points since $z_k^2 = z_j^2$:

$$
\begin{align}
P(z_k) &= P_e(z_k^2) + z_k P_o(z_k^2) \\
P(z_j) &= P_e(z_k^2) - z_j P_o(z_k^2) \\
\end{align}
$$

- We apply the same idea recursively, which gives us $\log_2 n$ iterations. The
number of $z$s to evaluate also decreases by a factor of 2 ($z$s are
paired), which means at level $l$ we are going to be computing $\frac{1}{2^l}$
$z$s. However the number of branches also grows exponentially which mean
at each depth, we are going to be doing $n$ computations. This means $O(n\log
n)$ time in total.

The only hiccup is that after the first iteration, the $z$s are all going
to be positive and we will not be able to find $k, j$ s.t. $z^2_j = -
z^2_k$. This is at least true for **real numbers**.

For complex numbers there is a convenient option for $z_k$ to be the [n-th
root of 1](./primitive_root_of_one.md) at the power of $k$. This means that we
define $z$ as:

$$
z_k = \omega^k,
$$

where $\omega$ is n-th primitive root of one. Thanks to the characteristics of
primitive roots, the following are true:

- $\omega^j \ne \omega^k$ for $0 \le j < k < n$ -- ergo we will get value of $P$
at $n$ different points
- For even $n$: $\omega^{\frac{n}{2} + j} = \omega^\frac{n}{2} \cdot \omega^j =
-\omega^j$ -- ergo we can find $j, k$ such that $z_k$ and $z_j$ are paired
- For even $n$: $\omega^2$ is $\frac{n}{2}$-th primitive root of 1
- all powers are unique and less then one $(\omega^2)^0 < (\omega^2)^1 <
\ldots < (\omega^2)^{\frac{n}{2} - 1} < (\omega^2)^\frac{n}{2} = 1
- meaning we can find pairing on every depth as long as $n = 2^l$

### Summary

The reason why the algorithm is as fast is thanks to the fact that we can
divide the problem to 2 parts, each of which is four times easier than the
original solution (half the polynomial size, half of the evaluation points).

## Algorithm

```python
def fft(x: list[float], n: int, w: complex) -> list[complex]:
"""Computes fft.
Parameters
----------
x
Vector of n numbers.
n
n = 2**l for some l
w
Primitive n-th root of 1
Returns
-------
Vector of n complex Fourier coefficients.
"""
if n == 1:
return x[0]

evens = fft(x[::2], n/2, w**2)
odds = fft(x[1::2], n/2, w**2)

coefs = [0 for _ in range(n)]
w_to_i = 1
for i in range(n/2):
coefs[i] = evens[i] + w_to_i * odds[i]
# w^{n/2 + i} = -w^i
coefs[n/2 + i] = evens[i] - w_to_i * odds[i]
w_to_i *= w

return coefs
```
Binary file added imgs/5_primitive_root_of_one.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 25 additions & 0 deletions primitive_root_of_one.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Primitive root of 1

In complex numbers, 1 has many n-th roots, but not all of them are primitive.
Number $x \in \mathbb{C}$ is primitive n-th root of 1 if $x^n = 1$ and $x^0,
x^1, \ldots, x^{n-1} \ne 1$.

Each complex number has at least 2 N-th primitive roots: $\omega =
e^{2{\pi}i/n}$ and $\overline{\omega} = e^{-2{\pi}i/n}$. These come from [Eulers
formula](./eulers_formula.md). If we imagine complex numbers on 2D plane,
computing their power is like rotating them counter-clockwise (for $\omega$) or
clockwise (for $\overline{\omega}$). We get N-th primitive root, by splitting
the circle in $n$ parts ($2\pi/n$) and taking the first one.

![5-th primitive root of one. From Labyrint algoritmů
(https://pruvodce.ucw.cz/)](./imgs/5_primitive_root_of_one.png)

## Characteristics

N-th primitive roots have some nice properties:

- $\omega^j \ne \omega^k$ for $0 \le j < k < n$
- For even $n$: $\omega^\frac{n}{2} = \sqrt{\omega^n} = \sqrt{1} = -1$ (it
cannot be 1 since $\frac{n}{2} < n$ and $\omega$ is n-th primitive root of 1)
- For even $n$: $\omega^{\frac{n}{2} + j} = \omega^\frac{n}{2} \cdot \omega^j =
-\omega^j$
9 changes: 9 additions & 0 deletions spectral_analysis_of_sound.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Spectral analysis of sound signal

Spectral analysis of sound signal describes how to split up any sound signal to
pure frequencies the sound is made up of. This is done using [Discrete Fourier
transform (DFT)](./discrete_fourier_transform.md).

TODO:
- why we can do it -- what properties are necessary, some intuition for it
- how do we do it

0 comments on commit 56ffde9

Please sign in to comment.