Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix run_demo(demo_model_parallel, world_size) issue #2367

Merged
merged 5 commits into from
Jun 2, 2023

Conversation

TheMemoryDealer
Copy link
Contributor

@TheMemoryDealer TheMemoryDealer commented May 31, 2023

Fixes #1750

Description

Fixes run_demo(demo_model_parallel, world_size) issue as described in #1750

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnessessary issues are included into this pull request.

cc @mrshenli @osalpekar @H-Huang @kwen2501

@netlify
Copy link

netlify bot commented May 31, 2023

Deploy Preview for pytorch-tutorials-preview ready!

Name Link
🔨 Latest commit 86acfe7
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/647a3bc622d6be00081ef72f
😎 Deploy Preview https://deploy-preview-2367--pytorch-tutorials-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@github-actions github-actions bot added distributed docathon-h1-2023 A label for the docathon in H1 2023 easy and removed cla signed labels May 31, 2023
@svekars svekars requested a review from subramen May 31, 2023 19:15
@svekars svekars changed the title Update ddp_tutorial.rst Update ddp_tutorial.rs May 31, 2023
@subramen
Copy link
Contributor

subramen commented Jun 1, 2023

Changing the value of world_size impacts the values of dev0 and dev1, you will need to update how dev is calculated to account for the new world_size value

Copy link
Contributor

@subramen subramen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the dev calculation in the model class to reflect the new world_size value

In the function demo_model_parallel, dev0 and dev1 are computed in a way that assigns two distinct GPUs to each process. This is achieved by doubling the rank and applying modulus operation with twice the world_size. Assuming 8 gpus world_size is set to 4, leading to the creation of 4 processes. Each of these processes is allocated two distinct GPUs. For instance, the first process (process 0) is assigned GPUs 0 and 1, the second process (process 1) is assigned GPUs 2 and 3, and so forth.
@TheMemoryDealer
Copy link
Contributor Author

@subramen I've updated the calculation in a simple way to take into account the prior division. So now dev0 and dev1 are calculated as

    dev0 = (rank * 2) % (world_size * 2)
    dev1 = (rank * 2 + 1) % (world_size * 2)

@TheMemoryDealer
Copy link
Contributor Author

@subramen going back on it, does % (world_size * 2) not simply eliminate the world_size = n_gpus//2 ? Would it not make more sense to just ignore world_size = n_gpus//2 and keep

dev0 = rank * 2
dev1 = rank * 2 + 1

?

@subramen
Copy link
Contributor

subramen commented Jun 2, 2023

Yes, you don't actually need world_size anymore :) that is the update I was looking for.

should work well now assuming half as many processes as there are gpus
@svekars svekars changed the title Update ddp_tutorial.rs Fix run_demo(demo_model_parallel, world_size) issue Jun 2, 2023
@svekars svekars merged commit 420037e into pytorch:main Jun 2, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed distributed docathon-h1-2023 A label for the docathon in H1 2023 easy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix Model Parallel demo world_size parameter in DDP Tutorial
4 participants