Skip to content

Commit

Permalink
Merge pull request #371 from particle1331/dev
Browse files Browse the repository at this point in the history
Attention and transformers
  • Loading branch information
particle1331 committed Jun 9, 2024
2 parents 37bf98c + 5cd051b commit d0d58e9
Show file tree
Hide file tree
Showing 30 changed files with 348,528 additions and 330,243 deletions.
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -131,5 +131,4 @@ models/
trained_models/
data/
lightning_logs/
docs/nb/mlops/task-queue/distributed-task-queue/
docs/nb/dl/checkpoint.pt
TODO.md
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# See https://madewithml.com/courses/mlops/makefile/
.PHONY: docs
dev:
tox -e build

.PHONY: docs
docs:
rm -rf docs/_build
tox -e build
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ if you find that this is not the case (as I oftentimes do)!
```
git clone git@github.com:particle1331/ok-transformer.git
cd ok-transformer
pip install -r build-requirements.txt
pip install -r requirements-build.txt
make docs
```

Expand Down
4 changes: 2 additions & 2 deletions docs/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,8 @@ sphinx:
html_js_files:
- https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.4/require.min.js
html_theme_options:
pygment_light_style: "tango"
pygment_dark_style: "monokai"
pygments_light_style: "tango"
pygments_dark_style: "monokai"
use_download_button: false
repository_url: https://github.com/particle1331/ok-transformer
use_issues_button: true
Expand Down
9 changes: 7 additions & 2 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,14 @@ parts:
- file: nb/dl/03-cnn
- file: nb/dl/04-lm
- file: nb/dl/05-training
# - file: nb/dl/06-residuals
- file: nb/dl/07-attention

- caption: Engineering and MLOps
# - caption: NLP / LLMs
# chapters:
# - file: nb/dl/06-gpt
# - file: nb/dl/08-translation

- caption: ML Engineering & MLOps
chapters:
- file: nb/mlops/01-intro
- file: nb/mlops/02-package
Expand Down
Binary file added docs/img/nn/03-VGG_classes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/nn/03-lenet-timeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/nn/05-grad-path-distribution.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/nn/05-residual-unroll.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,199 changes: 1,199 additions & 0 deletions docs/img/nn/05-resnet_block.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
452 changes: 226 additions & 226 deletions docs/img/nn/05-singular-ellipsoid.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/nn/07-alibi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/nn/07-inference-extrapolation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/nn/07-mha.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/nn/07-preln.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
153 changes: 153 additions & 0 deletions docs/img/nn/07-qkv.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
316 changes: 316 additions & 0 deletions docs/img/nn/07-transformer.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/nb/dl/00-backprop.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, we look at **backpropagation** on computational graphs as an algorithm for efficiently computing gradients. Backprop involves local message passing of activations in the forward pass, and gradients in the backward pass. The resulting time complexity is linear in the number of size of the network, i.e. the total number of weights and neurons for neural nets. Neural networks are computational graphs with nodes for differentiable operations. This fact allows scaling training large neural networks. We implement a minimal scalar-valued **autograd engine** and a neural net library on top it to train a small regression model."
"In this notebook, we introduce the **backpropagation algorithm** for efficient gradient computation on computational graphs. Backpropagation involves local message passing of activations in the forward pass, and gradients in the backward pass. The resulting time complexity is linear in the number of size of the network, i.e. the total number of weights and neurons for neural networks. Neural networks are computational graphs with nodes for differentiable operations. This fact allows scaling training large neural networks. We will implement a minimal scalar-valued **autograd engine** and a neural net library on top it to train a small regression model."
]
},
{
Expand Down
853 changes: 113 additions & 740 deletions docs/nb/dl/01-intro.ipynb

Large diffs are not rendered by default.

262 changes: 143 additions & 119 deletions docs/nb/dl/03-cnn.ipynb

Large diffs are not rendered by default.

50 changes: 23 additions & 27 deletions docs/nb/dl/04-lm.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, we introduce a **character-level** language model. This model can be used to generate new names using a Markov process after learning from a dataset of names. Our focus will be on introducing the overall framework of language modeling that includes probabilistic modeling and optimization. This notebook draws on relevant parts of [this lecture series](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)."
"In this notebook, we introduce a **character-level** language model. This model can be used to generate new names using an autoregressive Markov process after learning from a dataset of names. Our focus will be on introducing the overall framework of language modeling that includes probabilistic modeling and optimization. This notebook draws on relevant parts of [this lecture series](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)."
]
},
{
Expand Down Expand Up @@ -18869,7 +18869,6 @@
"metadata": {},
"outputs": [],
"source": [
"# Fitting the bigram model\n",
"bigram_model = CountingModel()\n",
"bigram_model.fit(bigram_train)"
]
Expand Down Expand Up @@ -18928,6 +18927,13 @@
"evaluate_model(bigram_model, bigram_valid)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Increasing the context size:"
]
},
{
"cell_type": "code",
"execution_count": 19,
Expand All @@ -18954,7 +18960,6 @@
}
],
"source": [
"# Fitting the bigram model\n",
"trigram_model = CountingModel()\n",
"trigram_model.fit(trigram_train)\n",
"evaluate_model(trigram_model, trigram_valid)"
Expand Down Expand Up @@ -19412,13 +19417,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"[001/500] loss=3.8112\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[001/500] loss=3.8112\n",
"[051/500] loss=2.7564\n",
"[101/500] loss=2.6348\n",
"[151/500] loss=2.6056\n",
Expand Down Expand Up @@ -19475,13 +19474,7 @@
"p(n|o)=0.1125 nll=2.1847\n",
"p(.|n)=0.1774 nll=1.7295\n",
"p(l|.)=0.0513 nll=2.9692\n",
"p(l|l)=0.1499 nll=1.8978\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"p(l|l)=0.1499 nll=1.8978\n",
"...\n",
"nll = 2.5284 (overall)\n"
]
Expand Down Expand Up @@ -21464,25 +21457,27 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Remark.** That one yellow dot is for \"qu\". :)"
"**Remark.** That one yellow dot is for \"qu\". "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(dl/04-lm/character-embeddings)=\n",
"## Character embeddings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section, we implement a language model that learns \n",
"In this section, we implement a language model that uses learns \n",
"**character embeddings**.\n",
"Instead of learning a large lookup table for each character sequence, we learn an \n",
"embedding vector for each character which are concatenated to represent a character\n",
"sequence. This approach leads to better generalization."
"This approach leads to better generalization due to the added complexity \n",
"of having character encodings as learnable vectors, in contrast to our \n",
"earlier approach of learning a large lookup table for each sequence of \n",
"characters as context."
]
},
{
Expand Down Expand Up @@ -21512,7 +21507,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The first component of the network is an embedding matrix, where each row corresponds to an embedding vector. Then, the embedding vectors are concatenated in the correct order and passed to the two-layer MLP. Here the first layer applies tanh nonlinearity, while the second layer simply performs a linear operation to get the logits. "
"The first component of the network is an embedding matrix, where each row corresponds to an embedding vector. Then, the embedding vectors are concatenated in the correct order and passed to the two-layer MLP to get the logits after a dense operation."
]
},
{
Expand Down Expand Up @@ -41050,14 +41045,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Appendix: WaveNet"
"(dl/04-lm/temporal-convolutions)=\n",
"## Appendix: Causal convolutions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One problem with our previous network is that sequential information from the inputs are mixed or squashed too fast (i.e. in one layer). We can make this network deeper by adding dense layers, but it still does not solve this problem. In this section, we implement a convolutional neural network architecture similar to {cite}`wavenet`. This allows the character embeddings to be fused slowly."
"One problem with our previous network is that sequential information from the inputs are mixed or squashed too fast (i.e. in one layer). We can make this network deeper by adding dense layers, but it still does not solve this problem. In this section, we implement a convolutional neural network architecture similar to **WaveNet** {cite}`wavenet`. This allows the character embeddings to be fused slowly."
]
},
{
Expand All @@ -41069,7 +41065,7 @@
"width: 700px\n",
"name: wavenet\n",
"---\n",
"Tree-like structure formed by a stack of dilated convolutional layers. {cite}`wavenet`\n",
"{cite}`wavenet` Tree-like structure formed by a stack of dilated *causal* convolutional layers. The term causal is used since the network is constrained so that an output node at position `t` can only depend on input nodes at position `t-k:t`.\n",
"```"
]
},
Expand Down Expand Up @@ -41210,7 +41206,7 @@
"source": [
"Information from characters can flow better through the network due to its heirarchical nature. \n",
"This allows us to increase embedding size and width, as well as make the network is deeper. \n",
"Note that we need three layers to combine all characters: 2 x 2 x 2 = 8 (block size). This looks like our previous network. Stride happens in the way character blocks are fed to the layers, so convolutions are not explicitly used."
"Note that we need three layers to combine all characters: 2 x 2 x 2 = 8 (block size). This looks like our previous network. Stride happens in the way character blocks are fed to the layers, so convolutions need not be explicitly used."
]
},
{
Expand Down
Loading

0 comments on commit d0d58e9

Please sign in to comment.