Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing sampler logic for ddp with iterable dataset #1734

Merged
merged 1 commit into from
May 5, 2020

Conversation

twangnyc
Copy link
Contributor

@twangnyc twangnyc commented May 5, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes training with iterable dataset with ddp.
Currently if using iterable dataset, without setting sampler in ddp, self.train_dataloader.sampler will be _InfiniteConstantSampler by default in Pytorch. This sampler doesn't have attribute 'set_epoch'. This will create the error as following:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/root/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 372, in ddp_train
    self.run_pretrain_routine(model)
  File "/root/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 914, in run_pretrain_routine
    self.train()
  File "/root/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 332, in train
    self.train_dataloader.sampler.set_epoch(epoch)
AttributeError: '_InfiniteConstantSampler' object has no attribute 'set_epoch'

The bug is caused by the logic sequence between or and and in the following code snippet:

if self.use_ddp or self.use_horovod \
    and hasattr(self.train_dataloader.sampler, 'set_epoch'):
True or False and False = True
(True or False) and False = False

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team May 5, 2020 04:23
@codecov
Copy link

codecov bot commented May 5, 2020

Codecov Report

Merging #1734 into master will not change coverage.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #1734   +/-   ##
======================================
  Coverage      88%     88%           
======================================
  Files          69      69           
  Lines        4151    4151           
======================================
  Hits         3661    3661           
  Misses        490     490           

@Borda Borda added the bug Something isn't working label May 5, 2020
@Borda Borda added this to the 0.7.6 milestone May 5, 2020
@Borda Borda added the ready PRs ready to be merged label May 5, 2020
@mergify mergify bot requested a review from a team May 5, 2020 07:24
Copy link
Member

@ethanwharris ethanwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot requested a review from a team May 5, 2020 11:01
@williamFalcon williamFalcon merged commit d6a0375 into Lightning-AI:master May 5, 2020
@twangnyc twangnyc deleted the fix_sampler_attr_ddp branch May 11, 2020 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants