Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddp fix for trainer.test() + add basic ddp tests #2997

Merged
merged 33 commits into from
Aug 16, 2020
Merged

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Aug 16, 2020

What does this PR do?

Fixes #2683
Fixes #2765
Fixes #2807 (docs for clarification)
Fixes #2537 (docs for clarification)
(maybe also #2901)

  • Setting the random port is problematic in subprocesses when they get different ports. Also a random port can collide with a used port. Solution: Find a unused port and use that.
  • Added basic test for ddp mode. The test fails on master.
  • In ddp mode, it is only possible to run one .fit or one .test. Raise a runtime error in case user makes multiple calls.
  • updated docs in which cases it is not possible to use DDP

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Nope, in this one definitely not! :(

@codecov
Copy link

codecov bot commented Aug 16, 2020

Codecov Report

Merging #2997 into master will decrease coverage by 0%.
The diff coverage is 61%.

@@          Coverage Diff           @@
##           master   #2997   +/-   ##
======================================
- Coverage      90%     90%   -0%     
======================================
  Files          82      82           
  Lines        7628    7626    -2     
======================================
- Hits         6862    6860    -2     
  Misses        766     766           

@awaelchli awaelchli mentioned this pull request Aug 16, 2020
@awaelchli awaelchli marked this pull request as ready for review August 16, 2020 06:39
@mergify mergify bot requested a review from a team August 16, 2020 06:39
@awaelchli awaelchli added bug Something isn't working distributed Generic distributed-related topic docs Documentation related ci Continuous Integration labels Aug 16, 2020
@@ -893,10 +893,6 @@ def init_ddp_connection(self, global_rank: int, world_size: int, is_slurm_managi
log.info(f"initializing ddp: GLOBAL_RANK: {global_rank}, MEMBER: {global_rank+1}/{world_size}")
torch_distrib.init_process_group(torch_backend, rank=global_rank, world_size=world_size)

"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it does not belong here. My question would be why is it here in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mergify mergify bot requested a review from a team August 16, 2020 11:41
@williamFalcon williamFalcon merged commit 188e06c into master Aug 16, 2020
@awaelchli awaelchli deleted the bugfix/ddp-test branch August 16, 2020 16:02
ameliatqy pushed a commit to ameliatqy/pytorch-lightning that referenced this pull request Aug 17, 2020
* add ddp script variations

* add ddp test

* rename

* shell

* test

* test

* try call

* try without subprocess

* test

* display the error

* list all variations

* try string

* try copy env

* debug

* pythonpath

* path

* update test

* change

* simple ddp test

* replace

* remove random port

* random port

* str

* clean up

* check run spawn

* clean up

* docs

* docs

* update test

* docs

* changelog

* changelog
@Borda Borda added this to the 0.9.0 milestone Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci Continuous Integration distributed Generic distributed-related topic docs Documentation related
Projects
None yet
3 participants