Keep the setting of user created DataLoader in replacing DistributedSampler #4650

wlkz · 2020-11-13T07:46:56Z

🚀 Feature

Motivation

As mention at #2789, the default behavior of replace_sampler_ddp is creating a new DistributedSampler. The shuffle setting depends on the kind of dataloader (train or val/test dataloader). However, this behavior override the setting of user defined dataloader, such as shuffle or drop_last. A more reasonable solution is to get this setting direly from user created dataloader, and apply the same setting in DistributedSampler.

Pitch

For example, we can get the shuffle setting form dataloader.sampler. If this is a instance of SequentialSampler, shuffle=False.

Alternatives

Set replace_sampler_ddp=False, and handle it by hand.

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2020-11-13T07:47:41Z

Hi! thanks for your contribution!, great first issue!

rohitgr7 · 2020-11-13T20:08:21Z

Set replace_sampler_ddp=False, and handle it by hand.

I think this is the ideal way to handle this. Else we will make things more complicated.

wlkz · 2020-11-14T11:34:54Z

But more I concern is the side effect of the default setting replace_sampler_ddp=True. Both Getting started-Lightning in 2 steps and Common Use Cases-Multi-GPU training do not describe the magic of sampler replacement. In most case it seems correct. But unexpected setting change sometimes make thing worse. For example, some loss function is batch size sensitive, and the flag drop_last=True must be set in dataloader. However, the replaced DistributedSampler ignore this setting, which will give a unexpected gradient update in last batch. In this situation, the error is hard to debug, as the training work well in single GPU, but fail in DDP training. Thing won't get better until you have a look to lengthy API reference of Trainer, and find correct samplers mess up everything.

Set replace_sampler_ddp=False, and handle it by hand.

I think this is the ideal way to handle this. Else we will make things more complicated.

I totally agree with you. If this is a user responsibility to handle it, I suggest to make a brief description about replace_sampler_ddp in Common Use Cases-Multi-GPU training, telling the user the setting drop_last and shuffle would be reset, and handle it by hand in this situation.

stale · 2020-12-14T13:36:17Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

wlkz added feature Is an improvement or enhancement help wanted Open to be worked on labels Nov 13, 2020

stale bot added the won't fix This will not be worked on label Dec 14, 2020

rohitgr7 mentioned this issue Dec 14, 2020

Minor doc fixes #5139

Merged

11 tasks

rohitgr7 removed the won't fix This will not be worked on label Dec 16, 2020

Borda closed this as completed in #5139 Dec 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep the setting of user created DataLoader in replacing DistributedSampler #4650

Keep the setting of user created DataLoader in replacing DistributedSampler #4650

wlkz commented Nov 13, 2020

github-actions bot commented Nov 13, 2020

rohitgr7 commented Nov 13, 2020

wlkz commented Nov 14, 2020

stale bot commented Dec 14, 2020

Keep the setting of user created DataLoader in replacing DistributedSampler #4650

Keep the setting of user created DataLoader in replacing DistributedSampler #4650

Comments

wlkz commented Nov 13, 2020

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

github-actions bot commented Nov 13, 2020

rohitgr7 commented Nov 13, 2020

wlkz commented Nov 14, 2020

stale bot commented Dec 14, 2020