Log a warning/ raise an error when lightning replaces an existing Sampler #1995
Labels
feature
Is an improvement or enhancement
good first issue
Good for newcomers
help wanted
Open to be worked on
🚀 Feature
First of all, thanks for this awesome project! Really enjoying using it!
Feature Request: Log a warning or raise a MisconfigurationException when lightning replaces an existing sampler with DistributedSampler.
Even though this behaviour is documented, it's not intuitive. Also, if someone has defined a sampler - it will lead to very different training results if you use Single GPU vs DDP. So best to warn a user to remove the sampler from their code if using
replace_sampler_ddp=True
.Motivation
I'm pretty new to lightning - and recently moved one of my models to lightning. Unaware of the details, I chose ddp training strategy (since it was recommended). For a long time, my model was not converging on the val set - and it took me a lot of time to realise what was happening - lightning had replaced my Dataloader's sampler with DistributedSampler! In my particular model, balanced sampling iskey to convergence (since data is highly imbalanced). Lightning's behaviour makes sense in hindsight, and is well documented - but may have saved me a lot of debugging time had it given a simple warning or error when this happens.
Additional context
The code change is very trivial. We can take advantage of the fact that DataLoader's default sampler is
SequentialSampler
(if shuffle=False and sampler=None).data_loading.py:115
Does this make sense? Happy to submit a PR if it does.
On a separate note, can anyone here explain why exactly is
DistributedSampler
needed when using DDP? I can see that it will help if I have shuffle=false. It will probably help maintain sanity of an epoch (since otherwise epoch length will become num_processes*epoch).But let's say I am doing random shuffling - and am cognizant of the fact that my epoch sizes will be higher, there is no real reason that I need to use
DistributedSampler
for DDP right? Or am I missing something?The text was updated successfully, but these errors were encountered: