add notes about distributed training

dburian · Jul 5, 2024 · caa93b7 · caa93b7
1 parent daa1df5
commit caa93b7
Show file tree

Hide file tree

Showing 3 changed files with 64 additions and 0 deletions.
diff --git a/distributed_training.md b/distributed_training.md
@@ -0,0 +1,14 @@
+# Distributed training
+
+Training ML models on multiple GPUs/servers is called distributed training. The
+vocabulary here is:
+
+- *node* -- the name for a single computing unit (server)
+- *world size* -- the number of processes (usually one process == one GPU) for all nodes
+- *rank* -- index of a particular process
+
+So for training on 2 servers, each with 4 GPUs we have:
+
+- 2 nodes (== 2 servers)
+- 2 * 4 = 8 world size
+- ranks go `[0, 1, 2, ..., 6, 7]`
diff --git a/hf_datasets.md b/hf_datasets.md
@@ -0,0 +1,24 @@
+# HuggingFace `datasets`
+
+HuggingFace's `datasets` library composes several useful features:
+
+- dataset repository -- source of data
+- dataset transformation tools
+- API that is useful for ML training
+
+
+## `IterableDataset`
+
+IterableDataset are efficient for ML training since they are lazy: all
+transformations are done lazily and the data are *streamed* from disk. This
+means that compared to classic map-stype `Dataset`
+
+- it is more memory efficient. 
+- but also that it cannot support full shuffle (it implements approximate
+  shuffle with buffer)
+
+## Sharding
+
+Datasets can be split into multiple *shards* -- pieces of dataset, designed to
+run on different [*node* (GPU)](./distributed_training.md).
+`split_dataset_by_node` retrieves shards assigned to a a given node.
diff --git a/torch_dataloaders.md b/torch_dataloaders.md
@@ -0,0 +1,26 @@
+# Torch `DataLoader` in a nutshell
+
+`DataLoader` class is able to serve data to a ML model with a set of handy
+features.
+
+
+## `pin_memory`
+
+When transfering data to a CUDA device, it is necessary to move the data to a
+"page-locked" are in RAM, since CUDA drivers cannot accessed pages into which
+RAM is normally divided. From the page-locked area, CUDA copies the data
+straight into its RAM.
+
+*Pinning memory* means we avoid coping the data from a page in memory to a
+page-locked location, by saving the data straight in the page-locked memory
+location. This **speeds up** the data transfers between CPU and CUDA.
+
+## `num_workers`
+
+If DataLoader is given more than one worker, it spawns/forks new processes, each
+having access to the same dataset. It cycles the preocesses, giving each in turn
+batch of indices (if `batch_sampler` is given) or index (if `sampler` is given).
+Meaning the sampler is global for all processes.
+
+After there is a global prefetch queue (of length `prefetch_factor` *
+`num_workers`) joining output data from all processes.