Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Horovod as a distributed backend #1518

Closed
tgaddair opened this issue Apr 17, 2020 · 4 comments · Fixed by #1529
Closed

Add support for Horovod as a distributed backend #1518

tgaddair opened this issue Apr 17, 2020 · 4 comments · Fixed by #1529
Assignees
Labels
feature Is an improvement or enhancement help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@tgaddair
Copy link
Contributor

🚀 Feature

Horovod is a framework for performing data-parallel distributed training for PyTorch (in addition to other frameworks like TensorFlow and MXNet). It uses the allreduce technique to synchronously aggregate gradients across workers, similar to PyTorch's DDP API.

The goal of this feature is to implemented support for Horovod as another distributed_backend option for PyTorch Lightning, providing an abstraction layer over the Horovod API so users don't need to modify their training code when scaling from one GPU to many.

Motivation

At Uber, many of our researchers are interested in adopting PyTorch Lightning as a standard platform-level API. Because our infrastructure is highly integrated with Horovod, one of the prerequisites for adoption is to be able to run PyTorch Lightning using Horovod for distributed training.

We considered making this an internal layer built on top of PyTorch Lightning, but because Horovod is a popular API used by other companies in industry, we thought this would make the most sense as a contribution to PyTorch Lightning.

Pitch

With this change, all users would need to do to add Horovod support would be to make the following change to their Trainer to run on GPU (single or multiple):

trainer = Trainer(distributed_backend='horovod', gpus=1)

Or to run on CPU:

trainer = Trainer(distributed_backend='horovod')

Then the training script can be launched via the horovodrun command-line tool, where the host/GPU allocation is specified:

horovodrun -np 8 -H host1:4,host2:4 python train.py

Alternatives

  1. Build Horovod support outside of PyTorch Lightning. This has been some by some users in the past, but requires building a separate abstraction of Lightning. It'll be difficult to keep such solutions in sync as Lightning continues to add new features, or to make it fully compatible with user LightningModules (if we need to use the same methods/hooks to implement the required functionality).

  2. Launch Horovod in-process as opposed to from a driver application. Horovod supports launching programmatically via the horovod.run API. However, this requires pickling code, which is prone to serialization errors for some models. Most Horovod users are accustomed to using horovodrun / mpirun to launch their jobs. Also, using horovodrun allows us to decouple the training code from the resource requirements (num_gpus, etc.) which is useful for our users.

Additional context

A proof of concept has been implemented here: master...tgaddair:horovod

Docs and unit tests are forthcoming. But before creating a full PR, I wanted to get the thoughts of the PyTorch Lightning devs to see if this implementation aligns with your goals for the project.

cc @alsrgv

@tgaddair tgaddair added feature Is an improvement or enhancement help wanted Open to be worked on labels Apr 17, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@williamFalcon
Copy link
Contributor

yes! this is awesome. we talked about it early on but seemed no one was using it much.
However, I do think this would be a great addition.

Mind submitting a PR with this change?
If you give me a day, I can refactor DDP so that the horovod integration can be easier to do

@tgaddair
Copy link
Contributor Author

tgaddair commented Apr 17, 2020

Hi @williamFalcon, thanks for getting back to me so quickly. I have a branch out there that is feature complete if you'd like to take a quick look: master...tgaddair:horovod

My plan was to submit the PR once I add docs / unit tests later today or Monday at the latest. Does that sound reasonable?

The code itself should be fully functional in its current form (tested on some simple models like MNIST with 2 GPUs / CPU), but would be happy to refactor if you think the abstractions can be improved.

@alsrgv
Copy link
Contributor

alsrgv commented Apr 17, 2020

+1 for Horovod support. While DDP came long way, it does not offer a way to aggregate complex metric Python objects, which is easily doable with Horovod + mpi4py. I've also had a lot of segfaults among cv2, DataLoader multiprocessing and DDP forking, while it all works dandy with Horovod one-processes-per-GPU ideology.

@Borda Borda added this to the 0.7.4 milestone Apr 17, 2020
@Borda Borda modified the milestones: 0.7.4, v0.7.x Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants