Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validation_epoch_end behavior with DDP #1479

Closed
VitorGuizilini opened this issue Apr 13, 2020 · 4 comments · Fixed by #2029
Closed

validation_epoch_end behavior with DDP #1479

VitorGuizilini opened this issue Apr 13, 2020 · 4 comments · Fixed by #2029
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@VitorGuizilini
Copy link
Contributor

I might be misunderstanding how PL works, but when using DDP my validation_epoch_end argument still contains batches from single GPUs, and I thought they would be collated from all GPUs.
E.g. My validation dataset has 888 images, but when I validate on 8 GPUs (batch size of 1), I only get 111 batches in validation_epoch_end.
If that's correct, how can I produce metrics that combine information from all GPUs?

@VitorGuizilini VitorGuizilini added bug Something isn't working help wanted Open to be worked on labels Apr 13, 2020
@WSzP
Copy link

WSzP commented Apr 14, 2020

validation_step operates on a single batch of data from the validation set.
validation_epoch_end is called at the end of the validation epoch with the outputs of all validation steps.
So I believe your problem lies at validation_step. Can you show us your validation step, ideally the whole model?

@VitorGuizilini
Copy link
Contributor Author

VitorGuizilini commented Apr 14, 2020

I asked around and apparently that is the intended behaviour right now, i.e. validation_epoch_end is per-process, and we cannot access global information for metrics or logging. I was able to solve this by doing the reduce_all myself, with something like this:

metrics[key] = metrics[key].to('cuda:{}'.format(self.trainer.proc_rank))
dist.all_reduce(metrics[key])
metrics[key] /= self.trainer.world_size

Not sure why I had to explicitly cast the tensors to their processes (their devices were all set to -1).

@xiadingZ
Copy link

xiadingZ commented May 5, 2020

Also want native support of this ability. such as adding an argument to validation_epoch_end , when it's set to True, then metrics returned by validation_epoch_end will automatically be all_reduce and logged, instead of only metrics on each process

@hocop
Copy link

hocop commented Mar 25, 2021

I am using the latest version 1.2.2 and I get the same behavior.
@williamFalcon I see this was fixed in #2029. But I did not find how to make validation_epoch_end receive batches from all gpus.
Could someone please give me a hint what can I do to fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants