Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable running pytorch.torchvision.train with distributed data parallel #1698

Merged
merged 9 commits into from
Aug 7, 2023

Conversation

ohaijen
Copy link
Contributor

@ohaijen ohaijen commented Aug 2, 2023

Two main changes to enable running the following command in DistributedDataParallel mode:

CUDA_VISIBLE_DEVICES=<GPUs> python -m torch.distributed.launch \
 --nproc_per_node <NUM GPUs> \
sparseml.torchvision.train  <TRAIN.PY ARGUMENTS>

Before, it would throw an error (if run as above) or run in DataParallel mode (if run from CLI). After this change, running from CLI will still run DP, but running as above will run error-free with DDP, which is generally the desired outcome.

Concretely, the following changes were necessary:

  • Accept local_rank argument set by torch.distributed.launch
  • in case of distributed/DDP, ensure model gets sent to single GPU

Additionally, update documentation to explain how to run with DDP.

Copy link
Member

@bfineran bfineran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

model, device, _ = model_to_device(model=model, device=device)
ddp = False
if local_rank is not None:
torch.cuda.set_device(local_rank)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice - looks like it will make a lot of our defaults work out of the box

@bfineran bfineran merged commit 8e4dc20 into main Aug 7, 2023
10 checks passed
@bfineran bfineran deleted the torchvision_ddp_fix branch August 7, 2023 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants