Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update/merge multi-gpu docs #2021

Merged
merged 14 commits into from
Jun 2, 2020
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

- Re-Enable Logger's `ImportError`s ([#1938](https://github.com/PyTorchLightning/pytorch-lightning/pull/1938))

- Changed the default value of the Trainer argument `weights_summary` from `full` to `top` ([#2029](https://github.com/PyTorchLightning/pytorch-lightning/pull/2029))

### Deprecated

- Deprecated `ModelCheckpoint`'s attributes `best` and `kth_best_model` ([#1799](https://github.com/PyTorchLightning/pytorch-lightning/pull/1799))
Expand Down
3 changes: 3 additions & 0 deletions docs/source/apex.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ Lightning offers 16-bit training for CPUs, GPUs and TPUs.
GPU 16-bit
-----------
Lightning uses NVIDIA apex to handle 16-bit precision training.
16 bit precision can cut your memory footprint by half.
If using volta architecture GPUs it can give a dramatic training speed-up as well.

To use 16-bit precision, do two things:

Expand All @@ -18,6 +20,7 @@ To use 16-bit precision, do two things:

Install apex
^^^^^^^^^^^^

.. code-block:: bash

$ git clone https://github.com/NVIDIA/apex
Expand Down
134 changes: 119 additions & 15 deletions docs/source/multi_gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,9 +81,9 @@ when needed.

.. note:: For iterable datasets, we don't do this automatically.

Make Model Picklable
^^^^^^^^^^^^^^^^^^^^
It's very likely your code is already `picklable <https://docs.python.org/3/library/pickle.html>`_,
Make model pickleable
^^^^^^^^^^^^^^^^^^^^^
It's very likely your code is already `pickleable <https://docs.python.org/3/library/pickle.html>`_,
so you don't have to do anything to make this change.
However, if you run distributed and see an error like this:

Expand Down Expand Up @@ -122,22 +122,102 @@ is usually helpful.
ie: in the stacktrace example here, there seems to be a lambda function somewhere in the user code
which cannot be pickled.

GPU device selection
--------------------

You can select the GPU devices with ranges, a list of indices or a string containing
a comma separated list of GPU ids:

.. testsetup::

k = 1

.. testcode::
:skipif: torch.cuda.device_count() < 2

# DEFAULT (int) specifies how many GPUs to use
Trainer(gpus=k)

# Above is equivalent to
Trainer(gpus=list(range(k)))

# Specify which GPUs to use (don't use if running on cluster)
Trainer(gpus=[0, 1])

# can also be a string
Trainer(gpus='0, 1')

# can also be -1 or '-1', this uses all available GPUs
# equivalent to list(range(torch.cuda.available_devices()))
Trainer(gpus=-1)

The table below lists examples of possible input formats and how they are interpreted by Lightning.
Note in particular the difference between `gpus=0`, `gpus=[0]` and `gpus="0"`.

+---------------+-----------+---------------------+---------------------------------+
| `gpus` | Type | Parsed | Meaning |
+===============+===========+=====================+=================================+
| None | NoneType | None | CPU |
+---------------+-----------+---------------------+---------------------------------+
| 0 | int | None | CPU |
+---------------+-----------+---------------------+---------------------------------+
| 3 | int | [0, 1, 2] | first 3 GPUs |
+---------------+-----------+---------------------+---------------------------------+
| -1 | int | [0, 1, 2, ...] | all available GPUs |
+---------------+-----------+---------------------+---------------------------------+
| [0] | list | [0] | GPU 0 |
+---------------+-----------+---------------------+---------------------------------+
| [1, 3] | list | [1, 3] | GPUs 1 and 3 |
+---------------+-----------+---------------------+---------------------------------+
| "0" | str | [0] | GPU 0 |
+---------------+-----------+---------------------+---------------------------------+
| "3" | str | [3] | GPU 3 |
+---------------+-----------+---------------------+---------------------------------+
| "1, 3" | str | [1, 3] | GPUs 1 and 3 |
+---------------+-----------+---------------------+---------------------------------+
| "-1" | str | [0, 1, 2, ...] | all available GPUs |
+---------------+-----------+---------------------+---------------------------------+

CUDA flags
^^^^^^^^^^

CUDA flags make certain GPUs visible to your script.
Lightning sets these for you automatically, there's NO NEED to do this yourself.

.. testcode::

# lightning will set according to what you give the trainer
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

However, when using a cluster, Lightning will NOT set these flags (and you should not either).
SLURM will set these for you.
For more details see the `SLURM cluster guide <slurm.rst>`_.


Distributed modes
-----------------
Lightning allows multiple ways of training

- Data Parallel (`distributed_backend='dp'`) (multiple-gpus, 1 machine)
- DistributedDataParallel (`distributed_backend='ddp'`) (multiple-gpus across many machines).
- DistributedDataParallel2 (`distributed_backend='ddp2'`) (dp in a machine, ddp across machines).
- DistributedDataParallel 2 (`distributed_backend='ddp2'`) (dp in a machine, ddp across machines).
- Horovod (`distributed_backend='horovod'`) (multi-machine, multi-gpu, configured at runtime)
- TPUs (`tpu_cores=8|x`) (tpu or TPU pod)

.. note:: If you request multiple GPUs without setting a mode, ddp will be automatically used.
.. note::
If you request multiple GPUs or nodes without setting a mode, ddp will be automatically used.

For a deeper understanding of what Lightning is doing, feel free to read this
`guide <https://medium.com/@_willfalcon/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565>`_.

Data Parallel (dp)
^^^^^^^^^^^^^^^^^^
`DataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.DataParallel>`_ splits a batch across k GPUs. That is, if you have a batch of 32 and use dp with 2 gpus,
each GPU will process 16 samples, after which the root node will aggregate the results.


Data Parallel
^^^^^^^^^^^^^
`DataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.DataParallel>`_ splits a batch across k GPUs.
That is, if you have a batch of 32 and use dp with 2 gpus, each GPU will process 16 samples,
after which the root node will aggregate the results.

.. warning:: DP use is discouraged by PyTorch and Lightning. Use ddp which is more stable and at least 3x faster

Expand All @@ -157,7 +237,7 @@ Distributed Data Parallel

3. Each process inits the model.

.. note:: Make sure to set the random seed so that each model inits with the same weights
.. note:: Make sure to set the random seed so that each model initializes with the same weights.

4. Each process performs a full forward and backward pass in parallel.

Expand All @@ -176,11 +256,11 @@ Distributed Data Parallel
Distributed Data Parallel 2
^^^^^^^^^^^^^^^^^^^^^^^^^^^
In certain cases, it's advantageous to use all batches on the same machine instead of a subset.
For instance you might want to compute a NCE loss where it pays to have more negative samples.
For instance you might want to compute a NCE loss where it pays to have more negative samples.

In this case, we can use ddp2 which behaves like dp in a machine and ddp across nodes. DDP2 does the following:

1. Copies a subset of the data to each node.
1. Copies a subset of the data to each node.

2. Inits a model on each node.

Expand Down Expand Up @@ -297,7 +377,7 @@ In pseudocode, the full sequence is:
# use the full batch for something like softmax
full out = model.training_step_end(all_results)

to illustrate why this is needed, let's look at dataparallel
to illustrate why this is needed, let's look at DataParallel

.. testcode::

Expand Down Expand Up @@ -332,6 +412,30 @@ Validation and test step also have the same option when using dp
def test_step_end(self, batch_parts_outputs):
...


Distributed and 16-bit precision
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Due to an issue with apex and DistributedDataParallel (PyTorch and NVIDIA issue), Lightning does
not allow 16-bit and DP training. We tried to get this to work, but it's an issue on their end.

Below are the possible configurations we support.

+-------+---------+----+-----+---------+------------------------------------------------------------+
| 1 GPU | 1+ GPUs | DP | DDP | 16-bit | command |
+=======+=========+====+=====+=========+============================================================+
| Y | | | | | `Trainer(gpus=1)` |
+-------+---------+----+-----+---------+------------------------------------------------------------+
| Y | | | | Y | `Trainer(gpus=1, use_amp=True)` |
+-------+---------+----+-----+---------+------------------------------------------------------------+
| | Y | Y | | | `Trainer(gpus=k, distributed_backend='dp')` |
+-------+---------+----+-----+---------+------------------------------------------------------------+
| | Y | | Y | | `Trainer(gpus=k, distributed_backend='ddp')` |
+-------+---------+----+-----+---------+------------------------------------------------------------+
| | Y | | Y | Y | `Trainer(gpus=k, distributed_backend='ddp', use_amp=True)` |
+-------+---------+----+-----+---------+------------------------------------------------------------+


Implement Your Own Distributed (DDP) training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you need your own way to init PyTorch DDP you can override :meth:`pytorch_lightning.core.LightningModule.`.
Expand Down Expand Up @@ -388,7 +492,7 @@ Lightning supports the use of PytorchElastic to enable fault-tolerent and elasti
Trainer(gpus=8, distributed_backend='ddp')


Following the `PytorchElastic Quickstart documentation <https://pytorch.org/elastic/0.2.0/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:
Following the `PytorchElastic Quickstart documentation <https://pytorch.org/elastic/latest/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:

.. code-block:: bash

Expand All @@ -410,5 +514,5 @@ And then launch the elastic job with:
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)


See the official `PytorchElastic documentation <https://pytorch.org/elastic/0.2.0/index.html>`_ for details
See the official `PytorchElastic documentation <https://pytorch.org/elastic>`_ for details
on installation and more use cases.
Loading