Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validation loops run the partial dataset with horovod #1684

Closed
thnkim opened this issue May 1, 2020 · 6 comments
Closed

validation loops run the partial dataset with horovod #1684

thnkim opened this issue May 1, 2020 · 6 comments
Labels
bug Something isn't working help wanted Open to be worked on
Milestone

Comments

@thnkim
Copy link

thnkim commented May 1, 2020

Hello,
It seems to be the same issue as #1161.
When I use horovod, validation_step and validation_epoch_end are called multiple times).
Thank you.

@thnkim thnkim added bug Something isn't working help wanted Open to be worked on labels May 1, 2020
@thnkim thnkim changed the title validation and training loops run the partial dataset with horovod validation loops run the partial dataset with horovod May 1, 2020
@Borda
Copy link
Member

Borda commented May 1, 2020

@tgaddair pls ^^

@tgaddair
Copy link
Contributor

tgaddair commented May 1, 2020

I'll take a look.

@tgaddair
Copy link
Contributor

tgaddair commented May 1, 2020

Hey @thnkim, can you provide a minimum reproducible example that demonstrates the behavior you're describing?

I just ran quick test with an MNIST dataset. With 1 GPU, it ran 3750 training steps and 1875 validation steps per epoch. With 2 GPUs, it ran 1876 training steps and 938 validation steps per worker, which is consistent with the expected behavior.

@thnkim
Copy link
Author

thnkim commented May 2, 2020

Hi @tgaddair!
Thank you. It looks I'm missing something.

As you mentioned, with 2 GPUs and horovod, my 1901 validation samples are splitted to 951 for one GPU and 951 (not 950 here) for the other.
Then validation_epoch_end() is called two times; their outputs are as follows:

Validation Accuracy: 95.0578% (904/951)
Validation Accuracy: 94.9527% (903/951)

I have two questions:

  1. How can I merge these two outputs into one?
  2. Since both processes handled 951 samples, I guess there will be one duplicate. Isn't it problematic?

Thank you!

@tgaddair
Copy link
Contributor

tgaddair commented May 2, 2020

Hey @thnkim, to answer your questions:

  1. With Horovod, every worker process is going to call validation_epoch_end() separately, which is why you're seeing it called twice (for -np 2). If you want to only do something on one of the workers, you can write some Horovod-specific code like this:
import horovod.torch as hvd

...

    def on_validation_end(self, outputs):
        if hvd.rank() == 0:
            # do something only on the first process
            ...
  1. That's the behavior of PyTorch's DistributedSampler, which is what PyTorch Lightning will use to distribute the dataset if you don't provide a sampler yourself. So one option would be to create your own sampler if it becomes an issue, but in practice it shouldn't be a problem (the oversampled elements will change each epoch anyway). This would be a good change to PyTorch itself, though, to allow DistributedSampler to not pad the lists to be the same length for every worker.

@thnkim
Copy link
Author

thnkim commented May 3, 2020

Thank you, @tgaddair!
I guessed merging the validation results from multiple processes are internally merged.
I did it using hvd.allreduce(). :)

And for DistributedSampler, yes it would not be problematic in my case.
Thank you again!

@thnkim thnkim closed this as completed May 3, 2020
@Borda Borda added this to the 0.7.6 milestone May 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

3 participants