Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise an error when lightning replaces an existing sampler #2020

Conversation

devashishshankar
Copy link
Contributor

@devashishshankar devashishshankar commented May 30, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1995

Currently, Trainer replaces the existing sampler with DistributedSampler
if running distributing training and replace_sampler_ddp=True (default
behaviour). If a user has configured an existing sampler, this would
lead to widely different results if running a distributed vs
non-distributed training.

This PR fixes this by raising an Error if user has configured a sampler
and uses replace_sampler_ddp=True. The recommended behavior from now
on is to either remove the sampler or set replace_sampler_ddp=False

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 馃檭

Currently, Trainer replaces the existing sampler with DistributedSampler
if running distributing training and `replace_sampler_ddp=True` (default
behaviour). If a user has configured an existing sampler, this would
lead to widely different results if running a distributed vs
non-distributed training.

This PR fixes this by raising an Error if user has configured a sampler
and uses `replace_sampler_ddp=True`. The recommended behavior from now
on is to either remove the sampler or set `replace_sampler_ddp=False`
@mergify mergify bot requested a review from a team May 30, 2020 19:30
@devashishshankar

This comment has been minimized.

@devashishshankar
Copy link
Contributor Author

@Borda I didn't realize that DistributedSampler by default has shuffle on! So actually my condition only makes sense when someone is using a custom sampler (other than SequentialSampler/ RandomSampler). I think that's a neater way to solve the problem and doesn't require any change to existing tests. Updated PR with the same.

@pep8speaks
Copy link

pep8speaks commented May 30, 2020

Hello @devashishshankar! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 馃嵒

Comment last updated at 2020-06-01 20:47:58 UTC

@devashishshankar devashishshankar force-pushed the feature/1995_log_error_replace_sampler branch from 84175f6 to dc076ac Compare May 30, 2020 20:46
@devashishshankar devashishshankar force-pushed the feature/1995_log_error_replace_sampler branch from dc076ac to 2a849c3 Compare May 30, 2020 20:52
@devashishshankar devashishshankar marked this pull request as ready for review May 30, 2020 21:00
@codecov
Copy link

codecov bot commented May 31, 2020

Codecov Report

Merging #2020 into master will increase coverage by 0%.
The diff coverage is 92%.

@@          Coverage Diff           @@
##           master   #2020   +/-   ##
======================================
  Coverage      88%     88%           
======================================
  Files          74      74           
  Lines        4654    4658    +4     
======================================
+ Hits         4076    4080    +4     
  Misses        578     578           

@@ -113,39 +113,48 @@ def auto_add_sampler(self, dataloader: DataLoader, train: bool) -> DataLoader:
need_dist_sampler = (self.use_ddp or self.use_ddp2 or self.use_horovod or self.use_tpu)

if self.replace_sampler_ddp and need_dist_sampler:
if not isinstance(dataloader.sampler, (SequentialSampler, RandomSampler)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way you also have to include SubsetRandomSampler and WeightedRandomSampler I guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, shouldn't a warning really be raised if one of these is getting replaced by DistributedSampler? The original motivation of this issue was that I had a WeightedRandomSampler - and lightning replaced it leading to val acc drop. Took me a long time to debug this. Hence thought that it may be a good idea to raise an Error when an existing sampler is replaced.

Also, if someone uses one of these sampler - won't they get different results if using distributed vs non-distributed training?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you're right about that :)

@mergify mergify bot requested a review from a team June 1, 2020 10:57
@mergify mergify bot requested a review from a team June 1, 2020 17:56
@Borda Borda added the feature Is an improvement or enhancement label Jun 1, 2020
@Borda Borda added this to the 0.8.0 milestone Jun 1, 2020
pytorch_lightning/trainer/data_loading.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/data_loading.py Outdated Show resolved Hide resolved
tests/trainer/test_dataloaders.py Outdated Show resolved Hide resolved
@mergify mergify bot requested a review from a team June 1, 2020 20:48
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 馃殌

@mergify mergify bot requested a review from a team June 1, 2020 20:49
@williamFalcon williamFalcon merged commit ade3f36 into Lightning-AI:master Jun 2, 2020
justusschock pushed a commit that referenced this pull request Jun 29, 2020
* Raise an error when lightning replaces an existing sampler

Currently, Trainer replaces the existing sampler with DistributedSampler
if running distributing training and `replace_sampler_ddp=True` (default
behaviour). If a user has configured an existing sampler, this would
lead to widely different results if running a distributed vs
non-distributed training.

This PR fixes this by raising an Error if user has configured a sampler
and uses `replace_sampler_ddp=True`. The recommended behavior from now
on is to either remove the sampler or set `replace_sampler_ddp=False`

* Fix tests

* Simpler fix

* Fix tests

* Make inner method protected

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Log a warning/ raise an error when lightning replaces an existing Sampler
6 participants