Merge pull request #371 from particle1331/dev

Attention and transformers
particle1331 · Jun 9, 2024 · d0d58e9 · d0d58e9
2 parents 37bf98c + 5cd051b
commit d0d58e9
Show file tree

Hide file tree

Showing 30 changed files with 348,528 additions and 330,243 deletions.
diff --git a/.gitignore b/.gitignore
@@ -131,5 +131,4 @@ models/
 trained_models/
 data/
 lightning_logs/
-docs/nb/mlops/task-queue/distributed-task-queue/
-docs/nb/dl/checkpoint.pt
+TODO.md
diff --git a/Makefile b/Makefile
@@ -1,8 +1,8 @@
 # See https://madewithml.com/courses/mlops/makefile/
-.PHONY: docs
 dev:
 	tox -e build
 
+.PHONY: docs
 docs:
 	rm -rf docs/_build
 	tox -e build

diff --git a/README.md b/README.md
@@ -30,7 +30,7 @@ if you find that this is not the case (as I oftentimes do)!
 ```
 git clone git@github.com:particle1331/ok-transformer.git
 cd ok-transformer
-pip install -r build-requirements.txt
+pip install -r requirements-build.txt
 make docs
 ```
 

diff --git a/docs/_config.yml b/docs/_config.yml
@@ -35,8 +35,8 @@ sphinx:
     html_js_files:
     - https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.4/require.min.js
     html_theme_options:
-      pygment_light_style: "tango"
-      pygment_dark_style: "monokai"
+      pygments_light_style: "tango"
+      pygments_dark_style: "monokai"
       use_download_button: false
       repository_url: https://github.com/particle1331/ok-transformer
       use_issues_button: true

diff --git a/docs/_toc.yml b/docs/_toc.yml
@@ -13,9 +13,14 @@ parts:
   - file: nb/dl/03-cnn
   - file: nb/dl/04-lm
   - file: nb/dl/05-training
-  # - file: nb/dl/06-residuals
+  - file: nb/dl/07-attention
 
-- caption: Engineering and MLOps
+# - caption: NLP / LLMs
+#   chapters:
+#   - file: nb/dl/06-gpt
+#   - file: nb/dl/08-translation
+
+- caption: ML Engineering & MLOps
   chapters:
   - file: nb/mlops/01-intro
   - file: nb/mlops/02-package

diff --git a/docs/img/nn/03-VGG_classes.png b/docs/img/nn/03-VGG_classes.png
diff --git a/docs/img/nn/03-lenet-timeline.png b/docs/img/nn/03-lenet-timeline.png
diff --git a/docs/img/nn/05-grad-path-distribution.png b/docs/img/nn/05-grad-path-distribution.png
diff --git a/docs/img/nn/05-residual-unroll.png b/docs/img/nn/05-residual-unroll.png
diff --git a/docs/img/nn/05-resnet_block.svg b/docs/img/nn/05-resnet_block.svg
diff --git a/docs/img/nn/05-singular-ellipsoid.svg b/docs/img/nn/05-singular-ellipsoid.svg
diff --git a/docs/img/nn/07-alibi.png b/docs/img/nn/07-alibi.png
diff --git a/docs/img/nn/07-inference-extrapolation.png b/docs/img/nn/07-inference-extrapolation.png
diff --git a/docs/img/nn/07-mha.png b/docs/img/nn/07-mha.png
diff --git a/docs/img/nn/07-preln.png b/docs/img/nn/07-preln.png
diff --git a/docs/img/nn/07-qkv.svg b/docs/img/nn/07-qkv.svg
diff --git a/docs/img/nn/07-transformer.svg b/docs/img/nn/07-transformer.svg
diff --git a/docs/nb/dl/00-backprop.ipynb b/docs/nb/dl/00-backprop.ipynb
@@ -31,7 +31,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this notebook, we look at **backpropagation** on computational graphs as an algorithm for efficiently computing gradients. Backprop involves local message passing of activations in the forward pass, and gradients in the backward pass. The resulting time complexity is linear in the number of size of the network, i.e. the total number of weights and neurons for neural nets. Neural networks are computational graphs with nodes for differentiable operations. This fact allows scaling training large neural networks. We implement a minimal scalar-valued **autograd engine** and a neural net library on top it to train a small regression model."
+    "In this notebook, we introduce the **backpropagation algorithm** for efficient gradient computation on computational graphs. Backpropagation involves local message passing of activations in the forward pass, and gradients in the backward pass. The resulting time complexity is linear in the number of size of the network, i.e. the total number of weights and neurons for neural networks. Neural networks are computational graphs with nodes for differentiable operations. This fact allows scaling training large neural networks. We will implement a minimal scalar-valued **autograd engine** and a neural net library on top it to train a small regression model."
    ]
   },
   {

diff --git a/docs/nb/dl/01-intro.ipynb b/docs/nb/dl/01-intro.ipynb
diff --git a/docs/nb/dl/03-cnn.ipynb b/docs/nb/dl/03-cnn.ipynb
diff --git a/docs/nb/dl/04-lm.ipynb b/docs/nb/dl/04-lm.ipynb
@@ -30,7 +30,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this notebook, we introduce a **character-level** language model. This model can be used to generate new names using a Markov process after learning from a dataset of names. Our focus will be on introducing the overall framework of language modeling that includes probabilistic modeling and optimization. This notebook draws on relevant parts of [this lecture series](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)."
+    "In this notebook, we introduce a **character-level** language model. This model can be used to generate new names using an autoregressive Markov process after learning from a dataset of names. Our focus will be on introducing the overall framework of language modeling that includes probabilistic modeling and optimization. This notebook draws on relevant parts of [this lecture series](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)."
    ]
   },
   {
@@ -18869,7 +18869,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Fitting the bigram model\n",
     "bigram_model = CountingModel()\n",
     "bigram_model.fit(bigram_train)"
    ]
@@ -18928,6 +18927,13 @@
     "evaluate_model(bigram_model, bigram_valid)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Increasing the context size:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 19,
@@ -18954,7 +18960,6 @@
     }
    ],
    "source": [
-    "# Fitting the bigram model\n",
     "trigram_model = CountingModel()\n",
     "trigram_model.fit(trigram_train)\n",
     "evaluate_model(trigram_model, trigram_valid)"
@@ -19412,13 +19417,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "[001/500]    loss=3.8112\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
+      "[001/500]    loss=3.8112\n",
       "[051/500]    loss=2.7564\n",
       "[101/500]    loss=2.6348\n",
       "[151/500]    loss=2.6056\n",
@@ -19475,13 +19474,7 @@
       "p(n|o)=0.1125    nll=2.1847\n",
       "p(.|n)=0.1774    nll=1.7295\n",
       "p(l|.)=0.0513    nll=2.9692\n",
-      "p(l|l)=0.1499    nll=1.8978\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
+      "p(l|l)=0.1499    nll=1.8978\n",
       "...\n",
       "nll = 2.5284 (overall)\n"
      ]
@@ -21464,25 +21457,27 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Remark.** That one yellow dot is for \"qu\". :)"
+    "**Remark.** That one yellow dot is for \"qu\". ☺"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "(dl/04-lm/character-embeddings)=\n",
     "## Character embeddings"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this section, we implement a language model that learns \n",
+    "In this section, we implement a language model that uses learns \n",
     "**character embeddings**.\n",
-    "Instead of learning a large lookup table for each character sequence, we learn an \n",
-    "embedding vector for each character which are concatenated to represent a character\n",
-    "sequence. This approach leads to better generalization."
+    "This approach leads to better generalization due to the added complexity \n",
+    "of having character encodings as learnable vectors, in contrast to our \n",
+    "earlier approach of learning a large lookup table for each sequence of \n",
+    "characters as context."
    ]
   },
   {
@@ -21512,7 +21507,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The first component of the network is an embedding matrix, where each row corresponds to an embedding vector. Then, the embedding vectors are concatenated in the correct order and passed to the two-layer MLP. Here the first layer applies tanh nonlinearity, while the second layer simply performs a linear operation to get the logits. "
+    "The first component of the network is an embedding matrix, where each row corresponds to an embedding vector. Then, the embedding vectors are concatenated in the correct order and passed to the two-layer MLP to get the logits after a dense operation."
    ]
   },
   {
@@ -41050,14 +41045,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Appendix: WaveNet"
+    "(dl/04-lm/temporal-convolutions)=\n",
+    "## Appendix: Causal convolutions"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "One problem with our previous network is that sequential information from the inputs are mixed or squashed too fast (i.e. in one layer). We can make this network deeper by adding dense layers, but it still does not solve this problem. In this section, we implement a convolutional neural network architecture similar to {cite}`wavenet`. This allows the character embeddings to be fused slowly."
+    "One problem with our previous network is that sequential information from the inputs are mixed or squashed too fast (i.e. in one layer). We can make this network deeper by adding dense layers, but it still does not solve this problem. In this section, we implement a convolutional neural network architecture similar to **WaveNet** {cite}`wavenet`. This allows the character embeddings to be fused slowly."
    ]
   },
   {
@@ -41069,7 +41065,7 @@
     "width: 700px\n",
     "name: wavenet\n",
     "---\n",
-    "Tree-like structure formed by a stack of dilated convolutional layers. {cite}`wavenet`\n",
+    "{cite}`wavenet` Tree-like structure formed by a stack of dilated *causal* convolutional layers. The term causal is used since the network is constrained so that an output node at position `t` can only depend on input nodes at position `t-k:t`.\n",
     "```"
    ]
   },
@@ -41210,7 +41206,7 @@
    "source": [
     "Information from characters can flow better through the network due to its heirarchical nature. \n",
     "This allows us to increase embedding size and width, as well as make the network is deeper. \n",
-    "Note that we need three layers to combine all characters: 2 x 2 x 2 = 8 (block size). This looks like our previous network. Stride happens in the way character blocks are fed to the layers, so convolutions are not explicitly used."
+    "Note that we need three layers to combine all characters: 2 x 2 x 2 = 8 (block size). This looks like our previous network. Stride happens in the way character blocks are fed to the layers, so convolutions need not be explicitly used."
    ]
   },
   {