Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate output of validation_end across all ddp processes #243

Closed
xiadingZ opened this issue Sep 23, 2019 · 12 comments
Closed

Aggregate output of validation_end across all ddp processes #243

xiadingZ opened this issue Sep 23, 2019 · 12 comments
Labels
feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@xiadingZ
Copy link

In validation_end I want to gather all outputs from validation_step, then use another file to calculate scores. but when use multi-gpu ddp mode, I find it launches multiple process of validation_end to calculate scores. how to make it only call once?

@xiadingZ xiadingZ added the question Further information is requested label Sep 23, 2019
@williamFalcon
Copy link
Contributor

hi, the way ddp works, each process is walled from the other. I’m not sure you can transfer arbitrary tensors around but we can look into it (likely using the dist library).

the validation_end will be called for every process to calculate all the scores.

I guess it would be helpful to know what you are trying to do to see how we modify the code.

@xiadingZ
Copy link
Author

xiadingZ commented Sep 23, 2019

It seems that I want to do test in val function. I find a workaround to do this only when self.trainer.proc_rank ==0, in validation_end.
but there is another problem

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 815, in ddp_train
    self.__run_pretrain_routine(model)
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 925, in __run_pretrain_routine
    self.__train()
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 953, in __train
    self.run_tng_epoch()
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1004, in run_tng_epoch
    self.__run_evaluation(test=self.testing)
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1292, in __run_evaluation
    self.__add_tqdm_metrics(eval_out_metrics)
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 457, in __add_tqdm_metrics
    for k, v in metrics.items():
AttributeError: 'NoneType' object has no attribute 'items'

If I only want to do validation_end only once and records its score, instead of do it on each process, how can I do?

@williamFalcon
Copy link
Contributor

so, let’s break this down a bit.

Case 1 (current):
Each process calls validation_end. If you log anything or monitor early stopping it only uses the metrics from proc 0. This is equivalent to using 1/nb_process in this calculation.

Case 2 (i think this is the one we need to support):
Same as above, but use dist.all_reduce for each metric returned from validation_end to get the full value instead of the estimated 1/nb_process value.

example:
Case 1:
you calculate accuracy. then the accuracy you log (only from proc 0) is an estimate of the full accuracy.

Case 2:
you calculate accuracy, then the dist.all_reduce is not an estimate but the full accuracy.

If your data are uniformly shuffled and your batches are big enough, case 1 and 2 are almost identical. if your batch is too small, the estimate will be a little bit off.

@neggert
Copy link
Contributor

neggert commented Sep 23, 2019

Agree that case 2 is important to support, and should maybe even be the default. That seems like what a new user would expect.

@williamFalcon
Copy link
Contributor

williamFalcon commented Sep 23, 2019

Yeah, agreed. I'll turn this into a ticket.

The first approach I can think of is to make this change:

out = validation_end
for k, v in out.items():
    out[k] = dist.all_reduce(v)

@williamFalcon williamFalcon changed the title How to calll validation_end only once when useing 'ddp' on single node Aggregate output of validation_end across all ddp processes Sep 23, 2019
@williamFalcon
Copy link
Contributor

williamFalcon commented Sep 23, 2019

@neggert does it also make sense to do the same after training_step? I'm a bit concerned about the speed impact of adding these calls... so maybe only validation_end needs it as it's called once per validation cycle?

@williamFalcon williamFalcon added feature Is an improvement or enhancement help wanted Open to be worked on and removed question Further information is requested labels Sep 23, 2019
@neggert
Copy link
Contributor

neggert commented Sep 24, 2019

I can't think of a good reason to do it after a training step. Any metrics measured batch-by-batch are going to be noisy anyway, so only taking numbers from one process shouldn't make much difference. It's possible there's some use case I haven't thought of, though.

@neggert
Copy link
Contributor

neggert commented Sep 24, 2019

I guess I'll also note that averaging across nodes isn't always going to be the right thing to do. Metric learning problems, for instance, actually need a full set of feature vectors to compute recall and NMI.

The cleanest thing IMO would be to collect validation_step outputs from all nodes using something along the lines of all_gather, then pass the whole thing to validation_end on each node. I'm not sure how to actually do this, though.

@williamFalcon
Copy link
Contributor

FYI @adefazio.

@neggert Let's go for this behavior:

  • reduce each metric on validation_step only.

@magic282
Copy link

magic282 commented Dec 7, 2019

Also having this issue. Temporarily disabling the distributedsampler so I can have a full validation set for each process.

@brucemuller
Copy link

Hello

@neggert @williamFalcon did an update happen for this? I'm also trying to aggregate validation_end between ddp GPU processes on a single node. I'm trying to use dist.all_gather or dist.all_gather_multigpu but really unsure how it is done with pytorch lightning. I also think new users would expect info on how to aggregate from multiple processes on GPUs

@williamFalcon
Copy link
Contributor

williamFalcon commented Feb 11, 2020

moving the discussion to #702

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

5 participants