Skip to content

Commit

Permalink
Update/merge multi-gpu docs (#2021)
Browse files Browse the repository at this point in the history
* merge multi-gpu docs

* extend slurm docs

* update links to elastic

* format docs and type hints in distrib parts

* reference multi-gpu/slurm in trainer args docs

* fix doctest

* typo

* doctest

* Apply suggestions from code review

Co-authored-by: Lucas Vazquez <lucasgouvaz@gmail.com>

* wall time

* Update docs/source/slurm.rst

Co-authored-by: Lucas Vazquez <lucasgouvaz@gmail.com>

* fix title

* update docs for weights summary

* update changelog

Co-authored-by: Lucas Vazquez <lucasgouvaz@gmail.com>
  • Loading branch information
2 people authored and justusschock committed Jun 29, 2020
1 parent 6495f14 commit 93dbca9
Show file tree
Hide file tree
Showing 6 changed files with 283 additions and 400 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

- Re-Enable Logger's `ImportError`s ([#1938](https://github.com/PyTorchLightning/pytorch-lightning/pull/1938))

- Changed the default value of the Trainer argument `weights_summary` from `full` to `top` ([#2029](https://github.com/PyTorchLightning/pytorch-lightning/pull/2029))

### Deprecated

- Deprecated `ModelCheckpoint`'s attributes `best` and `kth_best_model` ([#1799](https://github.com/PyTorchLightning/pytorch-lightning/pull/1799))
Expand Down
3 changes: 3 additions & 0 deletions docs/source/apex.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ Lightning offers 16-bit training for CPUs, GPUs and TPUs.
GPU 16-bit
-----------
Lightning uses NVIDIA apex to handle 16-bit precision training.
16 bit precision can cut your memory footprint by half.
If using volta architecture GPUs it can give a dramatic training speed-up as well.

To use 16-bit precision, do two things:

Expand All @@ -18,6 +20,7 @@ To use 16-bit precision, do two things:

Install apex
^^^^^^^^^^^^

.. code-block:: bash
$ git clone https://github.com/NVIDIA/apex
Expand Down
134 changes: 119 additions & 15 deletions docs/source/multi_gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,9 +81,9 @@ when needed.

.. note:: For iterable datasets, we don't do this automatically.

Make Model Picklable
^^^^^^^^^^^^^^^^^^^^
It's very likely your code is already `picklable <https://docs.python.org/3/library/pickle.html>`_,
Make model pickleable
^^^^^^^^^^^^^^^^^^^^^
It's very likely your code is already `pickleable <https://docs.python.org/3/library/pickle.html>`_,
so you don't have to do anything to make this change.
However, if you run distributed and see an error like this:

Expand Down Expand Up @@ -122,22 +122,102 @@ is usually helpful.
ie: in the stacktrace example here, there seems to be a lambda function somewhere in the user code
which cannot be pickled.

GPU device selection
--------------------

You can select the GPU devices with ranges, a list of indices or a string containing
a comma separated list of GPU ids:

.. testsetup::

k = 1

.. testcode::
:skipif: torch.cuda.device_count() < 2

# DEFAULT (int) specifies how many GPUs to use
Trainer(gpus=k)

# Above is equivalent to
Trainer(gpus=list(range(k)))

# Specify which GPUs to use (don't use if running on cluster)
Trainer(gpus=[0, 1])

# can also be a string
Trainer(gpus='0, 1')

# can also be -1 or '-1', this uses all available GPUs
# equivalent to list(range(torch.cuda.available_devices()))
Trainer(gpus=-1)

The table below lists examples of possible input formats and how they are interpreted by Lightning.
Note in particular the difference between `gpus=0`, `gpus=[0]` and `gpus="0"`.

+---------------+-----------+---------------------+---------------------------------+
| `gpus` | Type | Parsed | Meaning |
+===============+===========+=====================+=================================+
| None | NoneType | None | CPU |
+---------------+-----------+---------------------+---------------------------------+
| 0 | int | None | CPU |
+---------------+-----------+---------------------+---------------------------------+
| 3 | int | [0, 1, 2] | first 3 GPUs |
+---------------+-----------+---------------------+---------------------------------+
| -1 | int | [0, 1, 2, ...] | all available GPUs |
+---------------+-----------+---------------------+---------------------------------+
| [0] | list | [0] | GPU 0 |
+---------------+-----------+---------------------+---------------------------------+
| [1, 3] | list | [1, 3] | GPUs 1 and 3 |
+---------------+-----------+---------------------+---------------------------------+
| "0" | str | [0] | GPU 0 |
+---------------+-----------+---------------------+---------------------------------+
| "3" | str | [3] | GPU 3 |
+---------------+-----------+---------------------+---------------------------------+
| "1, 3" | str | [1, 3] | GPUs 1 and 3 |
+---------------+-----------+---------------------+---------------------------------+
| "-1" | str | [0, 1, 2, ...] | all available GPUs |
+---------------+-----------+---------------------+---------------------------------+

CUDA flags
^^^^^^^^^^

CUDA flags make certain GPUs visible to your script.
Lightning sets these for you automatically, there's NO NEED to do this yourself.

.. testcode::

# lightning will set according to what you give the trainer
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

However, when using a cluster, Lightning will NOT set these flags (and you should not either).
SLURM will set these for you.
For more details see the `SLURM cluster guide <slurm.rst>`_.


Distributed modes
-----------------
Lightning allows multiple ways of training

- Data Parallel (`distributed_backend='dp'`) (multiple-gpus, 1 machine)
- DistributedDataParallel (`distributed_backend='ddp'`) (multiple-gpus across many machines).
- DistributedDataParallel2 (`distributed_backend='ddp2'`) (dp in a machine, ddp across machines).
- DistributedDataParallel 2 (`distributed_backend='ddp2'`) (dp in a machine, ddp across machines).
- Horovod (`distributed_backend='horovod'`) (multi-machine, multi-gpu, configured at runtime)
- TPUs (`tpu_cores=8|x`) (tpu or TPU pod)

.. note:: If you request multiple GPUs without setting a mode, ddp will be automatically used.
.. note::
If you request multiple GPUs or nodes without setting a mode, ddp will be automatically used.

For a deeper understanding of what Lightning is doing, feel free to read this
`guide <https://medium.com/@_willfalcon/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565>`_.

Data Parallel (dp)
^^^^^^^^^^^^^^^^^^
`DataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.DataParallel>`_ splits a batch across k GPUs. That is, if you have a batch of 32 and use dp with 2 gpus,
each GPU will process 16 samples, after which the root node will aggregate the results.


Data Parallel
^^^^^^^^^^^^^
`DataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.DataParallel>`_ splits a batch across k GPUs.
That is, if you have a batch of 32 and use dp with 2 gpus, each GPU will process 16 samples,
after which the root node will aggregate the results.

.. warning:: DP use is discouraged by PyTorch and Lightning. Use ddp which is more stable and at least 3x faster

Expand All @@ -157,7 +237,7 @@ Distributed Data Parallel

3. Each process inits the model.

.. note:: Make sure to set the random seed so that each model inits with the same weights
.. note:: Make sure to set the random seed so that each model initializes with the same weights.

4. Each process performs a full forward and backward pass in parallel.

Expand All @@ -176,11 +256,11 @@ Distributed Data Parallel
Distributed Data Parallel 2
^^^^^^^^^^^^^^^^^^^^^^^^^^^
In certain cases, it's advantageous to use all batches on the same machine instead of a subset.
For instance you might want to compute a NCE loss where it pays to have more negative samples.
For instance you might want to compute a NCE loss where it pays to have more negative samples.

In this case, we can use ddp2 which behaves like dp in a machine and ddp across nodes. DDP2 does the following:

1. Copies a subset of the data to each node.
1. Copies a subset of the data to each node.

2. Inits a model on each node.

Expand Down Expand Up @@ -297,7 +377,7 @@ In pseudocode, the full sequence is:
# use the full batch for something like softmax
full out = model.training_step_end(all_results)
to illustrate why this is needed, let's look at dataparallel
to illustrate why this is needed, let's look at DataParallel

.. testcode::

Expand Down Expand Up @@ -332,6 +412,30 @@ Validation and test step also have the same option when using dp
def test_step_end(self, batch_parts_outputs):
...


Distributed and 16-bit precision
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Due to an issue with apex and DistributedDataParallel (PyTorch and NVIDIA issue), Lightning does
not allow 16-bit and DP training. We tried to get this to work, but it's an issue on their end.

Below are the possible configurations we support.

+-------+---------+----+-----+---------+------------------------------------------------------------+
| 1 GPU | 1+ GPUs | DP | DDP | 16-bit | command |
+=======+=========+====+=====+=========+============================================================+
| Y | | | | | `Trainer(gpus=1)` |
+-------+---------+----+-----+---------+------------------------------------------------------------+
| Y | | | | Y | `Trainer(gpus=1, use_amp=True)` |
+-------+---------+----+-----+---------+------------------------------------------------------------+
| | Y | Y | | | `Trainer(gpus=k, distributed_backend='dp')` |
+-------+---------+----+-----+---------+------------------------------------------------------------+
| | Y | | Y | | `Trainer(gpus=k, distributed_backend='ddp')` |
+-------+---------+----+-----+---------+------------------------------------------------------------+
| | Y | | Y | Y | `Trainer(gpus=k, distributed_backend='ddp', use_amp=True)` |
+-------+---------+----+-----+---------+------------------------------------------------------------+


Implement Your Own Distributed (DDP) training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you need your own way to init PyTorch DDP you can override :meth:`pytorch_lightning.core.LightningModule.`.
Expand Down Expand Up @@ -388,7 +492,7 @@ Lightning supports the use of PytorchElastic to enable fault-tolerent and elasti
Trainer(gpus=8, distributed_backend='ddp')
Following the `PytorchElastic Quickstart documentation <https://pytorch.org/elastic/0.2.0/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:
Following the `PytorchElastic Quickstart documentation <https://pytorch.org/elastic/latest/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:

.. code-block:: bash
Expand All @@ -410,5 +514,5 @@ And then launch the elastic job with:
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
See the official `PytorchElastic documentation <https://pytorch.org/elastic/0.2.0/index.html>`_ for details
See the official `PytorchElastic documentation <https://pytorch.org/elastic>`_ for details
on installation and more use cases.
Loading

0 comments on commit 93dbca9

Please sign in to comment.