Skip to content

Commit

Permalink
add notes about distributed training
Browse files Browse the repository at this point in the history
  • Loading branch information
dburian committed Jul 5, 2024
1 parent daa1df5 commit caa93b7
Show file tree
Hide file tree
Showing 3 changed files with 64 additions and 0 deletions.
14 changes: 14 additions & 0 deletions distributed_training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Distributed training

Training ML models on multiple GPUs/servers is called distributed training. The
vocabulary here is:

- *node* -- the name for a single computing unit (server)
- *world size* -- the number of processes (usually one process == one GPU) for all nodes
- *rank* -- index of a particular process

So for training on 2 servers, each with 4 GPUs we have:

- 2 nodes (== 2 servers)
- 2 * 4 = 8 world size
- ranks go `[0, 1, 2, ..., 6, 7]`
24 changes: 24 additions & 0 deletions hf_datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# HuggingFace `datasets`

HuggingFace's `datasets` library composes several useful features:

- dataset repository -- source of data
- dataset transformation tools
- API that is useful for ML training


## `IterableDataset`

IterableDataset are efficient for ML training since they are lazy: all
transformations are done lazily and the data are *streamed* from disk. This
means that compared to classic map-stype `Dataset`

- it is more memory efficient.
- but also that it cannot support full shuffle (it implements approximate
shuffle with buffer)

## Sharding

Datasets can be split into multiple *shards* -- pieces of dataset, designed to
run on different [*node* (GPU)](./distributed_training.md).
`split_dataset_by_node` retrieves shards assigned to a a given node.
26 changes: 26 additions & 0 deletions torch_dataloaders.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Torch `DataLoader` in a nutshell

`DataLoader` class is able to serve data to a ML model with a set of handy
features.


## `pin_memory`

When transfering data to a CUDA device, it is necessary to move the data to a
"page-locked" are in RAM, since CUDA drivers cannot accessed pages into which
RAM is normally divided. From the page-locked area, CUDA copies the data
straight into its RAM.

*Pinning memory* means we avoid coping the data from a page in memory to a
page-locked location, by saving the data straight in the page-locked memory
location. This **speeds up** the data transfers between CPU and CUDA.

## `num_workers`

If DataLoader is given more than one worker, it spawns/forks new processes, each
having access to the same dataset. It cycles the preocesses, giving each in turn
batch of indices (if `batch_sampler` is given) or index (if `sampler` is given).
Meaning the sampler is global for all processes.

After there is a global prefetch queue (of length `prefetch_factor` *
`num_workers`) joining output data from all processes.

0 comments on commit caa93b7

Please sign in to comment.