Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set correct device ids in DDP [wip] #4297

Merged
merged 7 commits into from
Oct 24, 2020
Merged

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Oct 21, 2020

Fixes #3791
Fixes #4171
Fixes #3865 (maybe, need more info)

All of these commands work now and run on the correct devices:

running on gpu 1 and 2:

python image_classifier.py --gpus "1,2" 
python image_classifier.py --gpus "1,2" --distributed_backend ddp

running on gpu 1 and 2 with CUDA_VISIBLE_DEVICES:

CUDA_VISIBLE_DEVICES="1,2" python image_classifier.py --gpus "0,1" 
CUDA_VISIBLE_DEVICES="1,2" python image_classifier.py --gpus "0,1" --distributed_backend ddp
# or equivalent
CUDA_VISIBLE_DEVICES="1,2" python image_classifier.py --gpus 2 --distributed_backend ddp

non-consecutive ids also work
running on gpu 1 and 3:

python image_classifier.py --gpus "1,3" 
python image_classifier.py --gpus "1,3"  --distributed_backend ddp

equivalent with CUDA_VISIBLE_DEVICES

CUDA_VISIBLE_DEVICES="1,3" python image_classifier.py --gpus "0,1" 
CUDA_VISIBLE_DEVICES="1,3" python image_classifier.py --gpus "0,1"  --distributed_backend ddp
# or 
CUDA_VISIBLE_DEVICES="1,3" python image_classifier.py --gpus 2  --distributed_backend ddp

debug


c


d


dd


d


d


d


ads


d


d


d


f


rank


f


v


d


d


d


d


d


d


d


d


d


d


d


set 


drop PL_DDP_PID


clean up


keep set gpus


revert


Revert "drop PL_DDP_PID"

This reverts commit 7d88cae.
d


pid


gpus


clean up


clean up 


misconfig?


misconfig


clean


clean
@awaelchli awaelchli added bug Something isn't working distributed Generic distributed-related topic labels Oct 21, 2020
@awaelchli awaelchli added this to the 1.0.x milestone Oct 21, 2020
@pep8speaks
Copy link

pep8speaks commented Oct 21, 2020

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-24 21:33:15 UTC

@awaelchli awaelchli changed the title Set correct device ids in DDP [WIP] Set correct device ids in DDP Oct 21, 2020
@awaelchli awaelchli changed the title [WIP] Set correct device ids in DDP [WIP] Set correct device ids in DDP [skip ci] Oct 21, 2020
@codecov
Copy link

codecov bot commented Oct 22, 2020

Codecov Report

Merging #4297 into master will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #4297   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         111     111           
  Lines        8021    7993   -28     
======================================
- Hits         7459    7434   -25     
+ Misses        562     559    -3     

@awaelchli awaelchli changed the title [WIP] Set correct device ids in DDP [skip ci] Set correct device ids in DDP Oct 22, 2020
@awaelchli awaelchli marked this pull request as ready for review October 22, 2020 01:41
@mergify mergify bot requested a review from a team October 22, 2020 01:42
Copy link
Contributor

@teddykoker teddykoker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@mergify mergify bot requested a review from a team October 22, 2020 05:21
@awaelchli
Copy link
Contributor Author

please hold back with merge until I have @williamFalcon 's review/approval because it's his ddp code. Thanks

@Borda Borda changed the title Set correct device ids in DDP Set correct device ids in DDP [wip] Oct 22, 2020
@williamFalcon
Copy link
Contributor

looking now. this is great!

@awaelchli
Copy link
Contributor Author

@ananthsub maybe you want to have a look here too. The more eyes on ddp the better :)

@@ -206,7 +206,7 @@ def select_accelerator(self):

# ddp script mode uses the same flags as TE
# TODO: decouple from TE
if os.environ.get('PL_DDP_PID', False):
if os.environ.get('PL_IN_DDP_SUBPROCESS', False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there documentation around what env vars lightning sets anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I am sure there is no documentation about these. They are internally used only, not meant to be modified by user.
Should they be documented? I'm not sure

Copy link
Contributor

@ananthsub ananthsub Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I meant for contributors/developers primarily. I should probably read the code again too haha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good point. where could env variables be documented, they are like global variables. maybe at the top of file of the accelerator base class. I have no good suggestions :)

@@ -180,7 +173,7 @@ def set_world_ranks(self, process_idx):
self.trainer.world_size = self.trainer.num_nodes * self.trainer.num_processes

def model_to_device(self, model, process_idx):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need process_idx at all then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can probably remove it. the device the model is on is anyway not related to the process index, which was actually the root cause of the bug here.

Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few n00b questions but LGTM!

@@ -162,7 +162,7 @@ def set_world_ranks(self, process_idx):
self.trainer.world_size = self.trainer.num_nodes * self.trainer.num_processes

def model_to_device(self, model, process_idx, is_master):
gpu_idx = process_idx
gpu_idx = self.trainer.data_parallel_device_ids[self.trainer.local_rank]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this same fix apply to the ddp_torchelastic_accelerator? are there other DDP accelerators which need this change too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, some of these accelerators like torchelastic and ddp2 should also get this fix. The problem is, I didn't want to touch these because I cannot test the fix myself, lacking multi node setup. It would be good to follow up on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, i actually don't think the others need this fix.

The reason is that torchelastic and co derive their process numbers from the environment, so they don't need to do this (local rank, global rank, etc).

Our DDP is a special case where we are acting like torchelastic, so we need to set this stuff up manually

@mergify
Copy link
Contributor

mergify bot commented Oct 24, 2020

This pull request is now in conflict... :(

@williamFalcon
Copy link
Contributor

ok, just verified that this works!

@williamFalcon williamFalcon merged commit 28d45a2 into master Oct 24, 2020
@williamFalcon
Copy link
Contributor

@ananthsub, just pushed an rc.

Mind checking on slurm and TE to make sure this is fine so we don't run into issues during the patch on Tuesday?

@awaelchli awaelchli deleted the bugfix/device-ordinal-2 branch October 24, 2020 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic
Projects
None yet
9 participants