Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct CWD for ddp subprocesses when using Hydra #2719

Merged
merged 3 commits into from
Jul 28, 2020
Merged

Correct CWD for ddp subprocesses when using Hydra #2719

merged 3 commits into from
Jul 28, 2020

Conversation

yukw777
Copy link
Contributor

@yukw777 yukw777 commented Jul 27, 2020

What does this PR do?

Fixes #2691. This PR sets the cwd of the ddp subprocesses to the original cwd returned by Hydra (when it's enabled). I looked for specific tests, but I think this will be caught by current integration tests? I did verify this manually with my own training script that I used to reproduce the bug.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • N/A Did you make sure to update the documentation with your changes?
  • N/A Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • N/A If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team July 27, 2020 03:48
@yukw777
Copy link
Contributor Author

yukw777 commented Jul 27, 2020

@omry @romesc-CMU

@mergify mergify bot requested a review from a team July 27, 2020 06:20
Copy link
Contributor

@omry omry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the integration tests are not testing a Hydra app with ddp why would they cover this scenario?

pytorch_lightning/trainer/distrib_data_parallel.py Outdated Show resolved Hide resolved
@yukw777
Copy link
Contributor Author

yukw777 commented Jul 27, 2020

@omry you're right, we don't have test coverage for this particular case where the training script is built using hydra. i just meant that this doesn't necessarily decrease the test coverage.

@Borda could you advise me on how we should add tests for this? Is it ok to merge this as is since we currently don't have tests for hydra scripts, or do we want to use this PR as an opportunity to add some (which will take more time of course)?

@@ -467,7 +468,12 @@ def spawn_ddp_children(self, model):
env_copy['LOCAL_RANK'] = f'{local_rank}'

# start process
proc = subprocess.Popen(command, env=env_copy)
# if hydra is available and initialized, make sure to set the cwd correctly
cwd: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break for non hydra users... no?
shouldn't cwd be set to something for non hydra people?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is the default value for cwd (https://docs.python.org/3/library/subprocess.html#subprocess.Popen). If None, it just uses whatever the current value for cwd is.

If cwd is not None, the function changes the working directory to cwd before executing the child. cwd can be a string, bytes or path-like object. In particular, the function looks for executable (or for the first item in args) relative to cwd if the executable path is a relative path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently, we're setting cwd to None for everyone, hence the problem for Hydra scripts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pull the cwd correctly then and also use it for non hydra users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but what should we set it to for non hydra users? do you think we need to save the original cwd and then use that? I feel like that's going down the slippery slope b/c we can't possibly anticipate how a script will change the cwd... for hydra scripts we can be sure b/c of the way it operates...

@mergify mergify bot requested a review from a team July 27, 2020 16:42
@Borda Borda requested a review from williamFalcon July 28, 2020 14:56
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test for this update?

@mergify mergify bot requested a review from a team July 28, 2020 14:58
@codecov
Copy link

codecov bot commented Jul 28, 2020

Codecov Report

Merging #2719 into master will decrease coverage by 0%.
The diff coverage is 14%.

@@          Coverage Diff           @@
##           master   #2719   +/-   ##
======================================
- Coverage      91%     91%   -0%     
======================================
  Files          82      82           
  Lines        6784    6789    +5     
======================================
  Hits         6151    6151           
- Misses        633     638    +5     

@williamFalcon williamFalcon merged commit b7f613b into Lightning-AI:master Jul 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Subprocess launched in ddp have the wrong cwd when using hydra.
4 participants