Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix disabling progress bar on non-zero ranks using Horovod backend #1709

Merged
merged 15 commits into from
May 4, 2020

Conversation

tgaddair
Copy link
Contributor

@tgaddair tgaddair commented May 2, 2020

This PR also adds a barrier when restoring model weights, consistent with other distributed backends.

@mergify mergify bot requested a review from a team May 2, 2020 23:21
@Borda Borda added the bug Something isn't working label May 3, 2020
pytorch_lightning/trainer/distrib_data_parallel.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
@mergify mergify bot requested a review from a team May 3, 2020 07:05
@codecov
Copy link

codecov bot commented May 3, 2020

Codecov Report

Merging #1709 into master will decrease coverage by 0%.
The diff coverage is 71%.

@@          Coverage Diff           @@
##           master   #1709   +/-   ##
======================================
- Coverage      88%     88%   -0%     
======================================
  Files          69      69           
  Lines        4135    4151   +16     
======================================
+ Hits         3659    3670   +11     
- Misses        476     481    +5     

@Borda
Copy link
Member

Borda commented May 3, 2020

I have seen this error also in other PRs

________________ ERROR collecting tests/models/test_horovod.py _________________
tests/models/test_horovod.py:88: in <module>
    @pytest.mark.skipif(not _nccl_available(), reason="test requires Horovod with NCCL support")
tests/models/test_horovod.py:34: in _nccl_available
    return nccl_built(verbose=True)
/Users/runner/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/horovod/common/util.py:110: in wrapper
    retval = f(*args, **kwargs)
/Users/runner/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/horovod/common/util.py:155: in nccl_built
    raise RuntimeError('Failed to determine if NCCL support has been built. '
E   RuntimeError: Failed to determine if NCCL support has been built. Run again with --verbose for more details.

@tgaddair
Copy link
Contributor Author

tgaddair commented May 3, 2020

I have seen this error also in other PRs

________________ ERROR collecting tests/models/test_horovod.py _________________
tests/models/test_horovod.py:88: in <module>
    @pytest.mark.skipif(not _nccl_available(), reason="test requires Horovod with NCCL support")
tests/models/test_horovod.py:34: in _nccl_available
    return nccl_built(verbose=True)
/Users/runner/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/horovod/common/util.py:110: in wrapper
    retval = f(*args, **kwargs)
/Users/runner/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/horovod/common/util.py:155: in nccl_built
    raise RuntimeError('Failed to determine if NCCL support has been built. '
E   RuntimeError: Failed to determine if NCCL support has been built. Run again with --verbose for more details.

This is the same caching error we've seen before. Basically, something is causing this pattern:

  1. Install torch
  2. Install horovod
  3. Upgrade torch

If torch is upgraded, but Horovod is not upgraded as well (afterwards), then Horovod will have been built against the wrong version of torch, and will fail at runtime.

@Borda
Copy link
Member

Borda commented May 3, 2020

Thx, but how it is possible that the failing is kind of random...
Also since the last cage was made there was not upgrade of torch nor horovod

@tgaddair
Copy link
Contributor Author

tgaddair commented May 3, 2020

Thx, but how it is possible that the failing is kind of random...
Also since the last cage was made there was not upgrade of torch nor horovod

My guess is that the presence or absence of elements in the cache can vary based on a number of different factors (like which host the test lands on, or what sequence different tests were run), so there will be times when we see a cache hit (which will cause the problem) and a other times a cache miss (which will avoid the problem).

@Borda
Copy link
Member

Borda commented May 3, 2020

@tgaddair just rerun the tests and pass without any change so really curious what is happening with the cache...

@Borda Borda added the ready PRs ready to be merged label May 3, 2020
@mergify mergify bot requested a review from a team May 3, 2020 21:46
@Borda
Copy link
Member

Borda commented May 4, 2020

@tgaddair I think that the cache skip is not needed anymore after #1725

@tgaddair
Copy link
Contributor Author

tgaddair commented May 4, 2020

@tgaddair I think that the cache skip is not needed anymore after #1725

Hey @Borda, I think that PR will fix it most of the time, but there are still a couple cases where it could break:

  1. The latest version of PyTorch increases.
  2. The minimum version of PyTorch increases.

In either of those cases, the cache will miss on PyTorch but hit on Horovod, which will cause the issue.

I put together a quick change that should allow Horovod to use the cache optimistically, and only reinstall if it has to.

@mergify
Copy link
Contributor

mergify bot commented May 4, 2020

This pull request is now in conflict... :(

@tgaddair
Copy link
Contributor Author

tgaddair commented May 4, 2020

Hey @Borda @williamFalcon, flakiness in the GitHub CI tests should be reduced with this change. Ready to land whenever you're ready.

@williamFalcon williamFalcon merged commit f90afa2 into Lightning-AI:master May 4, 2020
@williamFalcon
Copy link
Contributor

do we need a better general solution? i thought we already made it so that prog bar only showed on rank 0?

@tgaddair
Copy link
Contributor Author

tgaddair commented May 4, 2020

do we need a better general solution? i thought we already made it so that prog bar only showed on rank 0?

Currently it seems that every framework will independently call self.progress_bar_callback.disable() in its train function. I agree, it would be nice to generalize this across frameworks, along with a few other things (like barriers, for instance).

@tgaddair tgaddair deleted the horovod-progress-bar branch May 4, 2020 17:05
@Borda
Copy link
Member

Borda commented May 4, 2020

Currently it seems that every framework will independently call self.progress_bar_callback.disable() in its train function. I agree, it would be nice to generalize this across frameworks, along with a few other things (like barriers, for instance).

@awaelchli @hadim any idea? ^^

@awaelchli
Copy link
Contributor

I planned to add the zero_rank_only decorator to the progress bar callback and then get rid of the explicit .disable() calls. This would be consistent with how the other callbacks work. Given the comments above, horovod seems to be a special case where the rank must be determined in the init of the Trainer and not later as it is done in ddp.

@Borda Borda added this to the 0.7.6 milestone May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants