fast fourier transform and related

dburian · Oct 5, 2024 · 56ffde9 · 56ffde9
1 parent 24c12ef
commit 56ffde9
Show file tree

Hide file tree

Showing 6 changed files with 305 additions and 0 deletions.
diff --git a/beginners_guite_to_asr.md b/beginners_guite_to_asr.md
@@ -24,6 +24,95 @@ $$
   }
 $$
 
+## Glossary
+
+### Frame
+
+Part of the signal in time in between two timestamps. 
+
+### Window, Hop length
+
+Some functions are **windowed**, meaning they are applied repeatedly for different
+frames. *Windows are part of frames*, where window size doesn't have to be equal,
+but often is to size of frame. The distance between consecutive windows is
+called **hop length**, i.e. the overlap of two consecutive windows.
+
+
+### Frequency, Phase, Amplitude
+
+If we take *pure sound* we can study its 3 basic properties:
+
+$$
+a * \sin(2\pi * (ft - \phi),
+$$
+
+where
+
+- $a$ is **amplitude** -- height of signal's peak (loudness of it)
+- $f$ is **frequency** of the sound -- the number of cycles per second
+- $t$ is time in seconds
+- $\phi \in [0, 1)$ is **phase** of the signal -- horizontal shift of the sine
+  wave
+
+
+### Amplitude envelope
+
+The max of waveform for each frame. Gives an idea of loudness of a signal.
+Sensitive to outliers.
+
+$$
+\text{AE}_t = \max_{i \in t} s(i)
+$$
+
+where
+
+- $t$ - frame index
+- $i$ - timestamp index in a frame
+- $s(i)$ - amplitude at timestamp $i$
+
+### Root-mean-square energy
+
+The root of mean of square amplitude within each frame. Describes mean loudness
+within a frame, with bias towards high amplitude (loudnesses)
+
+$$
+\text{RMS}_t = \sqrt{\sum_{i \in t} s(i)^2}
+$$
+
+where
+
+- $t$ - frame index
+- $i$ - timestamp index in a frame
+- $s(i)$ - amplitude at timestamp $i$
+
+### Zero-crossing rate
+
+Number of times signal crossed `y=0` line (going from positive to negative or
+vice-versa). ZCR can tell us about the sound's:
+
+- percussiveness vs pitch -- both have high ZCR, but percussive sound's ZCR
+  varies more
+- presence/absence of voice -- voiced sounds tend to have higher ZCR
+- pitch -- assuming monotonic pitch, ZCR would correspond to frequency
+
+$$
+\text{ZCR}_t = \frac{1}{2}
+  \sum_{i \in t} |
+    \operatorname{sgn}\left(s(i)\right) -
+    \operatorname{sgn}\left(s(i+1)\right)
+  |
+$$
+
+where
+
+- $t$ - frame index
+- $i$ - timestamp index in a frame
+- $s(i)$ - amplitude at timestamp $i$
+
 ## Common approaches
 
 
+## Good sources of info
+
+- [Audio Signal Processing for ML
+  (YouTube)](https://www.youtube.com/playlist?list=PL-wATfeyAMNqIee7cH3q1bh4QJFAaeNv0)
diff --git a/discrete_fourier_transform.md b/discrete_fourier_transform.md
@@ -0,0 +1,53 @@
+# Discrete Fourier transform
+
+Discrete Fourier transform (DFT) is a function $\mathcal{F}: \mathbb{C}^n
+\rightarrow \mathbb{C}^n$:
+
+$$
+\mathcal{F}(x)_j = \sum_{k = 0}^{n -1} x_k \cdot \omega_n^{jk},
+\quad j \in \{0..n-1\},
+$$
+
+where $\omega_n$ is the [n-th primitive root of 1](./primitive_root_of_one.md).
+
+## What it is
+
+DFT is essentially evaluating polynomial $(x_0, x_1, \ldots, x_{n-1})$ at points
+$(\omega_n^{0}, \omega_n^{1}, \ldots, \omega_n^{n-1})$. The reason why we chose
+$\omega_n^j$, is due to the characteristics of n-th primitive roots of one,
+thanks to which we can evaluate DFT in $O(n\log n)$ time using algorithm called
+[Fast Fourier Transform](./fft.md).
+
+## There is an inverse too
+
+Since DFT is a linear projection:
+$$
+\mathcal{F}(\mathbf{x}) = \mathbf{\Omega} \mathbf{x}, \;\text{s.t.}\;
+\mathbf{\Omega}_jk = \omega^{jk}
+$$
+
+
+There is also an inverse:
+$$
+\mathbf{\Omega^{-1}} = \frac{\overline{\mathbf{\Omega}}}{n}
+$$
+
+
+## Why is it useful
+
+DFT is useful for many things. Here is some short list to give you an idea.
+
+### Multiplying polynomials
+
+For a naive multiplication of two polynomials of size $n$, we would need
+quadratic time. Since polynomial of size $n$ is defined by arbitrary, but different
+$n$ points. We can do the following:
+1. Evaluate each polynomial in $n$ points using DFT -- $O(n\log n)$
+2. Multiply corresponding pair of coefficients together -- $O(n)$
+3. Do inverse of DFT -- $O(n\log n)$
+
+### Spectral analysis of a signal
+
+For a signal (e.g. sound) we can use DFT to find pure frequencies the signal is
+made up of. This is somehow more involved and there is a separate note dedicated
+to [Spectral analysis of sound](./spectral_analysis_of_sound.md).
diff --git a/fft.md b/fft.md
@@ -0,0 +1,129 @@
+# Fast Fourier transform
+
+Fast Fourier transform (FFT) computes [Discrete Fourier
+transform (DFT)](./discrete_fourier_transform.md) in $O(n\log n)$ time.
+
+Naive evaluation would run in $O(n^2)$ time since we need to evaluate polynomial
+$x_0, \ldots, x_{n-1}$ in $n$ points $(P(z_0), \ldots, P(z_{n-1}))$.
+
+## Idea of FFT
+
+The idea of FFT is pretty simple:
+
+- We assume: $n = 2^l$ for some $l$, if $n$ is smaller, we pad it with zeros.
+
+- divide the polynomial $P$ into even and odd powers:
+
+$$
+\begin{align}
+P(z) &=
+\left(
+  x_0 z^0 +
+  x_2 z^2 +
+  x_4 z^4 +
+  \ldots +
+  x_{n - 2} z^{n - 2}
+\right) \\
+&\quad +
+z \left(
+  x_1 z^0 +
+  x_3 z^2 +
+  x_5 z^4 +
+  \ldots +
+  x_{n - 1} z^{n - 2}
+\right) \\
+&= P_{e}(z^2) + z P_o(z^2)
+\end{align}
+$$
+
+- now we have to evaluate two polynomials that are half the original size at $n$
+  points
+
+- Further, we strategically pick the $n$ points, s.t. they are paired:
+
+$$
+z_k = - z_j, \,\text{for some}\, i,j \in {0, \ldots, n-1}
+$$
+
+- now we only have to evaluate two polynomials that are half the original size
+  at half the points since $z_k^2 = z_j^2$:
+
+$$
+\begin{align}
+P(z_k) &= P_e(z_k^2) + z_k P_o(z_k^2) \\
+P(z_j) &= P_e(z_k^2) - z_j P_o(z_k^2) \\
+\end{align}
+$$
+
+- We apply the same idea recursively, which gives us $\log_2 n$ iterations. The
+  number of $z$s to evaluate also decreases by a factor of 2 ($z$s are
+  paired), which means at level $l$ we are going to be computing $\frac{1}{2^l}$
+  $z$s. However the number of branches also grows exponentially which mean
+  at each depth, we are going to be doing $n$ computations. This means $O(n\log
+  n)$ time in total.
+
+The only hiccup is that after the first iteration, the $z$s are all going
+to be positive and we will not be able to find $k, j$ s.t. $z^2_j = -
+z^2_k$. This is at least true for **real numbers**.
+
+For complex numbers there is a convenient option for $z_k$ to be the [n-th
+root of 1](./primitive_root_of_one.md) at the power of $k$. This means that we
+define $z$ as:
+
+$$
+z_k = \omega^k,
+$$
+
+where $\omega$ is n-th primitive root of one. Thanks to the characteristics of
+primitive roots, the following are true:
+
+- $\omega^j \ne \omega^k$ for $0 \le j < k < n$ -- ergo we will get value of $P$
+  at $n$ different points
+- For even $n$: $\omega^{\frac{n}{2} + j} = \omega^\frac{n}{2} \cdot \omega^j =
+  -\omega^j$ -- ergo we can find $j, k$ such that $z_k$ and $z_j$ are paired
+- For even $n$: $\omega^2$ is $\frac{n}{2}$-th primitive root of 1
+  - all powers are unique and less then one $(\omega^2)^0 < (\omega^2)^1 <
+    \ldots < (\omega^2)^{\frac{n}{2} - 1} < (\omega^2)^\frac{n}{2} = 1
+  - meaning we can find pairing on every depth as long as $n = 2^l$
+
+### Summary
+
+The reason why the algorithm is as fast is thanks to the fact that we can
+divide the problem to 2 parts, each of which is four times easier than the
+original solution (half the polynomial size, half of the evaluation points).
+
+## Algorithm
+
+```python
+def fft(x: list[float], n: int, w: complex) -> list[complex]:
+  """Computes fft.
+
+  Parameters
+  ----------
+  x
+    Vector of n numbers.
+  n
+    n = 2**l for some l
+  w
+    Primitive n-th root of 1
+
+  Returns
+  -------
+  Vector of n complex Fourier coefficients.
+  """
+  if n == 1:
+    return x[0]
+
+  evens = fft(x[::2], n/2, w**2)
+  odds = fft(x[1::2], n/2, w**2)
+
+  coefs = [0 for _ in range(n)]
+  w_to_i = 1
+  for i in range(n/2):
+    coefs[i] = evens[i] + w_to_i * odds[i]
+    # w^{n/2 + i} = -w^i
+    coefs[n/2 + i] = evens[i] - w_to_i * odds[i]
+    w_to_i *= w
+
+  return coefs
+```
diff --git a/imgs/5_primitive_root_of_one.png b/imgs/5_primitive_root_of_one.png
diff --git a/primitive_root_of_one.md b/primitive_root_of_one.md
@@ -0,0 +1,25 @@
+# Primitive root of 1
+
+In complex numbers, 1 has many n-th roots, but not all of them are primitive.
+Number $x \in \mathbb{C}$ is primitive n-th root of 1 if $x^n = 1$ and $x^0,
+x^1, \ldots, x^{n-1} \ne 1$.
+
+Each complex number has at least 2 N-th primitive roots: $\omega =
+e^{2{\pi}i/n}$ and $\overline{\omega} = e^{-2{\pi}i/n}$. These come from [Eulers
+formula](./eulers_formula.md). If we imagine complex numbers on 2D plane,
+computing their power is like rotating them counter-clockwise (for $\omega$) or
+clockwise (for $\overline{\omega}$). We get N-th primitive root, by splitting
+the circle in $n$ parts ($2\pi/n$) and taking the first one.
+
+![5-th primitive root of one. From Labyrint algoritmů
+(https://pruvodce.ucw.cz/)](./imgs/5_primitive_root_of_one.png)
+
+## Characteristics
+
+N-th primitive roots have some nice properties:
+
+- $\omega^j \ne \omega^k$ for $0 \le j < k < n$
+- For even $n$: $\omega^\frac{n}{2} = \sqrt{\omega^n} = \sqrt{1} = -1$ (it
+  cannot be 1 since $\frac{n}{2} < n$ and $\omega$ is n-th primitive root of 1)
+- For even $n$: $\omega^{\frac{n}{2} + j} = \omega^\frac{n}{2} \cdot \omega^j =
+  -\omega^j$
diff --git a/spectral_analysis_of_sound.md b/spectral_analysis_of_sound.md
@@ -0,0 +1,9 @@
+# Spectral analysis of sound signal
+
+Spectral analysis of sound signal describes how to split up any sound signal to
+pure frequencies the sound is made up of. This is done using [Discrete Fourier
+transform (DFT)](./discrete_fourier_transform.md).
+
+TODO:
+- why we can do it -- what properties are necessary, some intuition for it
+- how do we do it