-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add notes about distributed training
- Loading branch information
Showing
3 changed files
with
64 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Distributed training | ||
|
||
Training ML models on multiple GPUs/servers is called distributed training. The | ||
vocabulary here is: | ||
|
||
- *node* -- the name for a single computing unit (server) | ||
- *world size* -- the number of processes (usually one process == one GPU) for all nodes | ||
- *rank* -- index of a particular process | ||
|
||
So for training on 2 servers, each with 4 GPUs we have: | ||
|
||
- 2 nodes (== 2 servers) | ||
- 2 * 4 = 8 world size | ||
- ranks go `[0, 1, 2, ..., 6, 7]` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# HuggingFace `datasets` | ||
|
||
HuggingFace's `datasets` library composes several useful features: | ||
|
||
- dataset repository -- source of data | ||
- dataset transformation tools | ||
- API that is useful for ML training | ||
|
||
|
||
## `IterableDataset` | ||
|
||
IterableDataset are efficient for ML training since they are lazy: all | ||
transformations are done lazily and the data are *streamed* from disk. This | ||
means that compared to classic map-stype `Dataset` | ||
|
||
- it is more memory efficient. | ||
- but also that it cannot support full shuffle (it implements approximate | ||
shuffle with buffer) | ||
|
||
## Sharding | ||
|
||
Datasets can be split into multiple *shards* -- pieces of dataset, designed to | ||
run on different [*node* (GPU)](./distributed_training.md). | ||
`split_dataset_by_node` retrieves shards assigned to a a given node. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Torch `DataLoader` in a nutshell | ||
|
||
`DataLoader` class is able to serve data to a ML model with a set of handy | ||
features. | ||
|
||
|
||
## `pin_memory` | ||
|
||
When transfering data to a CUDA device, it is necessary to move the data to a | ||
"page-locked" are in RAM, since CUDA drivers cannot accessed pages into which | ||
RAM is normally divided. From the page-locked area, CUDA copies the data | ||
straight into its RAM. | ||
|
||
*Pinning memory* means we avoid coping the data from a page in memory to a | ||
page-locked location, by saving the data straight in the page-locked memory | ||
location. This **speeds up** the data transfers between CPU and CUDA. | ||
|
||
## `num_workers` | ||
|
||
If DataLoader is given more than one worker, it spawns/forks new processes, each | ||
having access to the same dataset. It cycles the preocesses, giving each in turn | ||
batch of indices (if `batch_sampler` is given) or index (if `sampler` is given). | ||
Meaning the sampler is global for all processes. | ||
|
||
After there is a global prefetch queue (of length `prefetch_factor` * | ||
`num_workers`) joining output data from all processes. |