fix selecting GPUs using CUDA_VISIBLE_DEVICES #2739

ibeltagy · 2020-07-28T15:34:15Z

What does this PR do?

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

williamFalcon · 2020-07-28T19:54:07Z

pytorch_lightning/trainer/distrib_data_parallel.py

@@ -528,7 +528,7 @@ def ddp_train(self, process_idx, q, model, is_master=False, proc_offset=0):
            if is_master:
                # source of truth is cuda for gpu idx
                gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
-                gpu_idx = int(gpus[self.local_rank])
+                gpu_idx = self.local_rank


this won’t work... because if you have access to gpus 4,5,6,7 and you request “2,3” you’re asking for “6,7”

In my view, having PL run your mode on GPUs 6,7 is the expected behavior in this case.

This fixes another problem with ddp. If gpus=3 and CUDA_VISIBLE_DEVICES=4,5,6,7, ddp will run only two jobs on GPUs 5,6, and the job on GPU4 won't work.

i agree but here's what happens:

gpus available: 0, 1, 2, 3, 4, 5
index: 0, 1, 2, 3, 4, 5
gpu[2] = 2

when you set CUDA_VISIBLE_DEVICES your numbering changes
CUDA_VISIBLE_DEVICES='2, 4,5'
now your indexes 0, 1, 2

So once you set visible devices the mapping changes:
gpus[0] = 2
gpus[2] = 5

can you share code that breaks so i can reproduce and verify? your fix might fix this problem but it's likely to break other DDP settings

pytorch_lightning/trainer/distrib_data_parallel.py

williamFalcon · 2020-07-31T12:22:17Z

I suspect that although this will fix the problem you mentioned, it will break other setups. Mind adding a test first to show that it fails and then a test showing that it passes with the fix?

our CI uses 2 GPus, so you can base the test off of that.

Borda

I think it is correct, just pls add test for this case, our test are running on device with 2xK80

ibeltagy · 2020-07-31T14:32:58Z

Can you point me to a similar unit test that I can follow?

Borda · 2020-07-31T15:15:05Z

Can you point me to a similar unit test that I can follow?

have look at tests/models/test_gpu.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

codecov · 2020-07-31T23:50:16Z

Codecov Report

Merging #2739 into master will increase coverage by 0%.
The diff coverage is 0%.

@@          Coverage Diff           @@
##           master   #2739   +/-   ##
======================================
  Coverage      91%     91%           
======================================
  Files          76      76           
  Lines        6787    6786    -1     
======================================
  Hits         6150    6150           
+ Misses        637     636    -1

Borda

pls add test for this case 🐰

williamFalcon · 2020-08-02T03:21:14Z

Ok, tested in a simple case and it worked. Merged and tested in a more complicated node and my worries came true.

This PR introduces a new bug where with ddp local rank will always be 0 on the master node... so, when cuda visible devices is something else, the master should pull the 0th device index and NOT always run on 0.

Here's the issue:

python train.py --gpus '4,5' --distributed_backend 'ddp'

WIth the fix in this PR, the GPUs used will actually be 0 and 5. Since local_rank=0 always for the master process.

In the PR where I fix this issue #2796, then i will actually use GPUs 4, 5 and NOT 0, 5. Since master has a local rank=0... it pulls the 0th GPU index which is 4 and thus training starts correctly.

mergify bot requested a review from a team July 28, 2020 15:34

Borda added the bug Something isn't working label Jul 28, 2020

williamFalcon reviewed Jul 28, 2020

View reviewed changes

mergify bot requested a review from a team July 28, 2020 19:55

ibeltagy requested review from williamFalcon and removed request for a team July 31, 2020 05:18

mergify bot requested a review from a team July 31, 2020 05:19

Borda reviewed Jul 31, 2020

View reviewed changes

pytorch_lightning/trainer/distrib_data_parallel.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team July 31, 2020 12:16

Borda reviewed Jul 31, 2020

View reviewed changes

mergify bot requested a review from a team July 31, 2020 12:22

ananyahjha93 self-requested a review July 31, 2020 15:44

ibeltagy and others added 2 commits July 31, 2020 19:35

fix Lightning-AI#2407

0dc4cb8

Update pytorch_lightning/trainer/distrib_data_parallel.py

dc4c5e1

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

ananyahjha93 force-pushed the fix_2407 branch from 6f5776b to dc4c5e1 Compare July 31, 2020 23:35

Borda requested changes Aug 1, 2020

View reviewed changes

mergify bot requested a review from a team August 1, 2020 07:18

williamFalcon merged commit 38fce2e into Lightning-AI:master Aug 2, 2020

williamFalcon mentioned this pull request Aug 2, 2020

Gpu idx #2796

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix selecting GPUs using CUDA_VISIBLE_DEVICES #2739

fix selecting GPUs using CUDA_VISIBLE_DEVICES #2739

ibeltagy commented Jul 28, 2020

williamFalcon Jul 28, 2020

ibeltagy Jul 28, 2020 •

edited

Loading

ibeltagy Jul 28, 2020 •

edited

Loading

williamFalcon Jul 31, 2020

williamFalcon Jul 31, 2020

williamFalcon commented Jul 31, 2020

Borda left a comment

ibeltagy commented Jul 31, 2020

Borda commented Jul 31, 2020

codecov bot commented Jul 31, 2020

Borda left a comment

williamFalcon commented Aug 2, 2020 •

edited

Loading

fix selecting GPUs using CUDA_VISIBLE_DEVICES #2739

fix selecting GPUs using CUDA_VISIBLE_DEVICES #2739

Conversation

ibeltagy commented Jul 28, 2020

What does this PR do?

Before submitting

PR review

Did you have fun?

williamFalcon Jul 28, 2020

Choose a reason for hiding this comment

ibeltagy Jul 28, 2020 • edited Loading

Choose a reason for hiding this comment

ibeltagy Jul 28, 2020 • edited Loading

Choose a reason for hiding this comment

williamFalcon Jul 31, 2020

Choose a reason for hiding this comment

williamFalcon Jul 31, 2020

Choose a reason for hiding this comment

williamFalcon commented Jul 31, 2020

Borda left a comment

Choose a reason for hiding this comment

ibeltagy commented Jul 31, 2020

Borda commented Jul 31, 2020

codecov bot commented Jul 31, 2020

Codecov Report

Borda left a comment

Choose a reason for hiding this comment

williamFalcon commented Aug 2, 2020 • edited Loading

ibeltagy Jul 28, 2020 •

edited

Loading

ibeltagy Jul 28, 2020 •

edited

Loading

williamFalcon commented Aug 2, 2020 •

edited

Loading