-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Torch Elastic documentation #8248
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8248 +/- ##
=======================================
- Coverage 93% 88% -5%
=======================================
Files 212 212
Lines 13695 13703 +8
=======================================
- Hits 12735 12060 -675
- Misses 960 1643 +683 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eggsellent!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @kaushikb11 !
Torch Distributed Elastic | ||
------------------------- | ||
Lightning supports the use of Torch Distributed Elastic to enable fault-tolerant and elastic distributed job scheduling. To use it, specify the 'ddp' or 'ddp2' backend and the number of gpus you want to use in the trainer. | ||
|
||
.. code-block:: python | ||
|
||
Trainer(gpus=8, accelerator='ddp') | ||
|
||
|
||
Following the `TorchElastic Quickstart documentation <https://pytorch.org/elastic/latest/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts: | ||
To launch a fault-tolerant job, run the following on all nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edenafek - do you think its worth having a dedicated docs section on which schedulers/launchers Lightning supports? This could cover launching with SLURM, torch distributed elastic, etc
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
What does this PR do?
TorchElastic is now upstream, part of torch.distributed.
Fixes #8243
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃