Fix disabling progress bar on non-zero ranks using Horovod backend #1709

tgaddair · 2020-05-02T23:20:56Z

This PR also adds a barrier when restoring model weights, consistent with other distributed backends.

pytorch_lightning/trainer/distrib_data_parallel.py

pytorch_lightning/trainer/trainer.py

codecov · 2020-05-03T15:49:04Z

Codecov Report

Merging #1709 into master will decrease coverage by 0%.
The diff coverage is 71%.

@@          Coverage Diff           @@
##           master   #1709   +/-   ##
======================================
- Coverage      88%     88%   -0%     
======================================
  Files          69      69           
  Lines        4135    4151   +16     
======================================
+ Hits         3659    3670   +11     
- Misses        476     481    +5

Borda · 2020-05-03T20:42:57Z

I have seen this error also in other PRs

________________ ERROR collecting tests/models/test_horovod.py _________________
tests/models/test_horovod.py:88: in <module>
    @pytest.mark.skipif(not _nccl_available(), reason="test requires Horovod with NCCL support")
tests/models/test_horovod.py:34: in _nccl_available
    return nccl_built(verbose=True)
/Users/runner/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/horovod/common/util.py:110: in wrapper
    retval = f(*args, **kwargs)
/Users/runner/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/horovod/common/util.py:155: in nccl_built
    raise RuntimeError('Failed to determine if NCCL support has been built. '
E   RuntimeError: Failed to determine if NCCL support has been built. Run again with --verbose for more details.

tgaddair · 2020-05-03T21:09:25Z

I have seen this error also in other PRs

________________ ERROR collecting tests/models/test_horovod.py _________________
tests/models/test_horovod.py:88: in <module>
    @pytest.mark.skipif(not _nccl_available(), reason="test requires Horovod with NCCL support")
tests/models/test_horovod.py:34: in _nccl_available
    return nccl_built(verbose=True)
/Users/runner/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/horovod/common/util.py:110: in wrapper
    retval = f(*args, **kwargs)
/Users/runner/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/horovod/common/util.py:155: in nccl_built
    raise RuntimeError('Failed to determine if NCCL support has been built. '
E   RuntimeError: Failed to determine if NCCL support has been built. Run again with --verbose for more details.

This is the same caching error we've seen before. Basically, something is causing this pattern:

Install torch
Install horovod
Upgrade torch

If torch is upgraded, but Horovod is not upgraded as well (afterwards), then Horovod will have been built against the wrong version of torch, and will fail at runtime.

Borda · 2020-05-03T21:17:04Z

Thx, but how it is possible that the failing is kind of random...
Also since the last cage was made there was not upgrade of torch nor horovod

tgaddair · 2020-05-03T21:30:20Z

Thx, but how it is possible that the failing is kind of random...
Also since the last cage was made there was not upgrade of torch nor horovod

My guess is that the presence or absence of elements in the cache can vary based on a number of different factors (like which host the test lands on, or what sequence different tests were run), so there will be times when we see a cache hit (which will cause the problem) and a other times a cache miss (which will avoid the problem).

Borda · 2020-05-03T21:44:33Z

@tgaddair just rerun the tests and pass without any change so really curious what is happening with the cache...

Borda · 2020-05-04T15:15:58Z

@tgaddair I think that the cache skip is not needed anymore after #1725

tgaddair · 2020-05-04T15:29:05Z

@tgaddair I think that the cache skip is not needed anymore after #1725

Hey @Borda, I think that PR will fix it most of the time, but there are still a couple cases where it could break:

The latest version of PyTorch increases.
The minimum version of PyTorch increases.

In either of those cases, the cache will miss on PyTorch but hit on Horovod, which will cause the issue.

I put together a quick change that should allow Horovod to use the cache optimistically, and only reinstall if it has to.

mergify · 2020-05-04T15:39:00Z

This pull request is now in conflict... :(

tgaddair · 2020-05-04T16:39:59Z

Hey @Borda @williamFalcon, flakiness in the GitHub CI tests should be reduced with this change. Ready to land whenever you're ready.

williamFalcon · 2020-05-04T17:04:00Z

do we need a better general solution? i thought we already made it so that prog bar only showed on rank 0?

tgaddair · 2020-05-04T17:05:31Z

do we need a better general solution? i thought we already made it so that prog bar only showed on rank 0?

Currently it seems that every framework will independently call self.progress_bar_callback.disable() in its train function. I agree, it would be nice to generalize this across frameworks, along with a few other things (like barriers, for instance).

Borda · 2020-05-04T17:10:34Z

Currently it seems that every framework will independently call self.progress_bar_callback.disable() in its train function. I agree, it would be nice to generalize this across frameworks, along with a few other things (like barriers, for instance).

@awaelchli @hadim any idea? ^^

awaelchli · 2020-05-04T17:38:15Z

I planned to add the zero_rank_only decorator to the progress bar callback and then get rid of the explicit .disable() calls. This would be consistent with how the other callbacks work. Given the comments above, horovod seems to be a special case where the rank must be determined in the init of the Trainer and not later as it is done in ddp.

mergify bot requested a review from a team May 2, 2020 23:21

Borda added the bug Something isn't working label May 3, 2020

Borda reviewed May 3, 2020

View reviewed changes

pytorch_lightning/trainer/distrib_data_parallel.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team May 3, 2020 07:05

Borda approved these changes May 3, 2020

View reviewed changes

Borda added the ready PRs ready to be merged label May 3, 2020

mergify bot requested a review from a team May 3, 2020 21:46

Borda mentioned this pull request May 3, 2020

instable GitHub action cache #1721

Closed

alsrgv approved these changes May 4, 2020

View reviewed changes

tgaddair added 10 commits May 4, 2020 17:18

Fix Horovod backend to disable progress bar on all ranks except 0

8e488db

Add join barriers

6235717

Added changelog

85b657f

Make protected and add verbosity

4259c53

Refactor to disable progress bar callback in train

5fad8e7

Removed vebose setting

870b462

Add cache check for Horovod

ba26442

Test run again

e0c0559

Updated comment

15246c2

Always skip cache for Horovod

482006f

Borda force-pushed the horovod-progress-bar branch from f372275 to 482006f Compare May 4, 2020 15:18

Only reinstall when necessary

8553285

Added separate step

7216ae4

Fixed spacing

040e28a

tgaddair added 2 commits May 4, 2020 08:48

Skip Python 3.8

b8d8af5

Merge branch 'master' into horovod-progress-bar

7647cd2

williamFalcon merged commit f90afa2 into Lightning-AI:master May 4, 2020

tgaddair deleted the horovod-progress-bar branch May 4, 2020 17:05

Borda added this to the 0.7.6 milestone May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix disabling progress bar on non-zero ranks using Horovod backend #1709

Fix disabling progress bar on non-zero ranks using Horovod backend #1709

tgaddair commented May 2, 2020

codecov bot commented May 3, 2020 •

edited

Loading

Borda commented May 3, 2020

tgaddair commented May 3, 2020

Borda commented May 3, 2020

tgaddair commented May 3, 2020

Borda commented May 3, 2020

Borda commented May 4, 2020

tgaddair commented May 4, 2020

mergify bot commented May 4, 2020

tgaddair commented May 4, 2020

williamFalcon commented May 4, 2020

tgaddair commented May 4, 2020

Borda commented May 4, 2020

awaelchli commented May 4, 2020

Fix disabling progress bar on non-zero ranks using Horovod backend #1709

Fix disabling progress bar on non-zero ranks using Horovod backend #1709

Conversation

tgaddair commented May 2, 2020

codecov bot commented May 3, 2020 • edited Loading

Codecov Report

Borda commented May 3, 2020

tgaddair commented May 3, 2020

Borda commented May 3, 2020

tgaddair commented May 3, 2020

Borda commented May 3, 2020

Borda commented May 4, 2020

tgaddair commented May 4, 2020

mergify bot commented May 4, 2020

tgaddair commented May 4, 2020

williamFalcon commented May 4, 2020

tgaddair commented May 4, 2020

Borda commented May 4, 2020

awaelchli commented May 4, 2020

codecov bot commented May 3, 2020 •

edited

Loading