validation_epoch_end behavior with DDP #1479

VitorGuizilini · 2020-04-13T19:27:22Z

I might be misunderstanding how PL works, but when using DDP my validation_epoch_end argument still contains batches from single GPUs, and I thought they would be collated from all GPUs.
E.g. My validation dataset has 888 images, but when I validate on 8 GPUs (batch size of 1), I only get 111 batches in validation_epoch_end.
If that's correct, how can I produce metrics that combine information from all GPUs?

WSzP · 2020-04-14T09:45:44Z

validation_step operates on a single batch of data from the validation set.
validation_epoch_end is called at the end of the validation epoch with the outputs of all validation steps.
So I believe your problem lies at validation_step. Can you show us your validation step, ideally the whole model?

VitorGuizilini · 2020-04-14T15:20:24Z

I asked around and apparently that is the intended behaviour right now, i.e. validation_epoch_end is per-process, and we cannot access global information for metrics or logging. I was able to solve this by doing the reduce_all myself, with something like this:

metrics[key] = metrics[key].to('cuda:{}'.format(self.trainer.proc_rank))
dist.all_reduce(metrics[key])
metrics[key] /= self.trainer.world_size

Not sure why I had to explicitly cast the tensors to their processes (their devices were all set to -1).

xiadingZ · 2020-05-05T14:24:24Z

Also want native support of this ability. such as adding an argument to validation_epoch_end , when it's set to True, then metrics returned by validation_epoch_end will automatically be all_reduce and logged, instead of only metrics on each process

hocop · 2021-03-25T12:52:27Z

I am using the latest version 1.2.2 and I get the same behavior.
@williamFalcon I see this was fixed in #2029. But I did not find how to make validation_epoch_end receive batches from all gpus.
Could someone please give me a hint what can I do to fix this?

VitorGuizilini added bug Something isn't working help wanted Open to be worked on labels Apr 13, 2020

williamFalcon mentioned this issue May 31, 2020

Replaces ddp .spawn with subprocess #2029

Merged

williamFalcon closed this as completed in #2029 Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validation_epoch_end behavior with DDP #1479

validation_epoch_end behavior with DDP #1479

VitorGuizilini commented Apr 13, 2020

WSzP commented Apr 14, 2020

VitorGuizilini commented Apr 14, 2020 •

edited

Loading

xiadingZ commented May 5, 2020

hocop commented Mar 25, 2021 •

edited

Loading

validation_epoch_end behavior with DDP #1479

validation_epoch_end behavior with DDP #1479

Comments

VitorGuizilini commented Apr 13, 2020

WSzP commented Apr 14, 2020

VitorGuizilini commented Apr 14, 2020 • edited Loading

xiadingZ commented May 5, 2020

hocop commented Mar 25, 2021 • edited Loading

VitorGuizilini commented Apr 14, 2020 •

edited

Loading

hocop commented Mar 25, 2021 •

edited

Loading