validation loops run the partial dataset with horovod #1684

thnkim · 2020-05-01T03:58:36Z

Hello,
It seems to be the same issue as #1161.
When I use horovod, validation_step and validation_epoch_end are called multiple times).
Thank you.

Borda · 2020-05-01T07:36:56Z

@tgaddair pls ^^

tgaddair · 2020-05-01T12:25:14Z

I'll take a look.

tgaddair · 2020-05-01T15:57:29Z

Hey @thnkim, can you provide a minimum reproducible example that demonstrates the behavior you're describing?

I just ran quick test with an MNIST dataset. With 1 GPU, it ran 3750 training steps and 1875 validation steps per epoch. With 2 GPUs, it ran 1876 training steps and 938 validation steps per worker, which is consistent with the expected behavior.

thnkim · 2020-05-02T06:21:28Z

Hi @tgaddair!
Thank you. It looks I'm missing something.

As you mentioned, with 2 GPUs and horovod, my 1901 validation samples are splitted to 951 for one GPU and 951 (not 950 here) for the other.
Then validation_epoch_end() is called two times; their outputs are as follows:

Validation Accuracy: 95.0578% (904/951)
Validation Accuracy: 94.9527% (903/951)

I have two questions:

How can I merge these two outputs into one?
Since both processes handled 951 samples, I guess there will be one duplicate. Isn't it problematic?

Thank you!

tgaddair · 2020-05-02T23:07:12Z

Hey @thnkim, to answer your questions:

With Horovod, every worker process is going to call validation_epoch_end() separately, which is why you're seeing it called twice (for -np 2). If you want to only do something on one of the workers, you can write some Horovod-specific code like this:

import horovod.torch as hvd

...

    def on_validation_end(self, outputs):
        if hvd.rank() == 0:
            # do something only on the first process
            ...

That's the behavior of PyTorch's DistributedSampler, which is what PyTorch Lightning will use to distribute the dataset if you don't provide a sampler yourself. So one option would be to create your own sampler if it becomes an issue, but in practice it shouldn't be a problem (the oversampled elements will change each epoch anyway). This would be a good change to PyTorch itself, though, to allow DistributedSampler to not pad the lists to be the same length for every worker.

thnkim · 2020-05-03T02:29:34Z

Thank you, @tgaddair!
I guessed merging the validation results from multiple processes are internally merged.
I did it using hvd.allreduce(). :)

And for DistributedSampler, yes it would not be problematic in my case.
Thank you again!

thnkim added bug Something isn't working help wanted Open to be worked on labels May 1, 2020

thnkim changed the title ~~validation and training loops run the partial dataset with horovod~~ validation loops run the partial dataset with horovod May 1, 2020

thnkim closed this as completed May 3, 2020

Borda added this to the 0.7.6 milestone May 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validation loops run the partial dataset with horovod #1684

validation loops run the partial dataset with horovod #1684

thnkim commented May 1, 2020

Borda commented May 1, 2020

tgaddair commented May 1, 2020

tgaddair commented May 1, 2020

thnkim commented May 2, 2020

tgaddair commented May 2, 2020

thnkim commented May 3, 2020

validation loops run the partial dataset with horovod #1684

validation loops run the partial dataset with horovod #1684

Comments

thnkim commented May 1, 2020

Borda commented May 1, 2020

tgaddair commented May 1, 2020

tgaddair commented May 1, 2020

thnkim commented May 2, 2020

tgaddair commented May 2, 2020

thnkim commented May 3, 2020