-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregate output of validation_end across all ddp processes #243
Comments
hi, the way ddp works, each process is walled from the other. I’m not sure you can transfer arbitrary tensors around but we can look into it (likely using the dist library). the validation_end will be called for every process to calculate all the scores. I guess it would be helpful to know what you are trying to do to see how we modify the code. |
It seems that I want to do test in val function. I find a workaround to do this only when
If I only want to do validation_end only once and records its score, instead of do it on each process, how can I do? |
so, let’s break this down a bit. Case 1 (current): Case 2 (i think this is the one we need to support): example: Case 2: If your data are uniformly shuffled and your batches are big enough, case 1 and 2 are almost identical. if your batch is too small, the estimate will be a little bit off. |
Agree that case 2 is important to support, and should maybe even be the default. That seems like what a new user would expect. |
Yeah, agreed. I'll turn this into a ticket. The first approach I can think of is to make this change: out = validation_end
for k, v in out.items():
out[k] = dist.all_reduce(v) |
@neggert does it also make sense to do the same after training_step? I'm a bit concerned about the speed impact of adding these calls... so maybe only validation_end needs it as it's called once per validation cycle? |
I can't think of a good reason to do it after a training step. Any metrics measured batch-by-batch are going to be noisy anyway, so only taking numbers from one process shouldn't make much difference. It's possible there's some use case I haven't thought of, though. |
I guess I'll also note that averaging across nodes isn't always going to be the right thing to do. Metric learning problems, for instance, actually need a full set of feature vectors to compute recall and NMI. The cleanest thing IMO would be to collect |
Also having this issue. Temporarily disabling the distributedsampler so I can have a full validation set for each process. |
Hello @neggert @williamFalcon did an update happen for this? I'm also trying to aggregate validation_end between ddp GPU processes on a single node. I'm trying to use dist.all_gather or dist.all_gather_multigpu but really unsure how it is done with pytorch lightning. I also think new users would expect info on how to aggregate from multiple processes on GPUs |
moving the discussion to #702 |
In
validation_end
I want to gather all outputs fromvalidation_step
, then use another file to calculate scores. but when use multi-gpu ddp mode, I find it launches multiple process ofvalidation_end
to calculate scores. how to make it only call once?The text was updated successfully, but these errors were encountered: