Aggregate output of validation_end across all ddp processes #243

xiadingZ · 2019-09-23T08:53:27Z

In validation_end I want to gather all outputs from validation_step, then use another file to calculate scores. but when use multi-gpu ddp mode, I find it launches multiple process of validation_end to calculate scores. how to make it only call once?

The text was updated successfully, but these errors were encountered:

williamFalcon · 2019-09-23T12:09:28Z

hi, the way ddp works, each process is walled from the other. I’m not sure you can transfer arbitrary tensors around but we can look into it (likely using the dist library).

the validation_end will be called for every process to calculate all the scores.

I guess it would be helpful to know what you are trying to do to see how we modify the code.

xiadingZ · 2019-09-23T12:50:27Z

It seems that I want to do test in val function. I find a workaround to do this only when self.trainer.proc_rank ==0, in validation_end.
but there is another problem

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 815, in ddp_train
    self.__run_pretrain_routine(model)
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 925, in __run_pretrain_routine
    self.__train()
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 953, in __train
    self.run_tng_epoch()
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1004, in run_tng_epoch
    self.__run_evaluation(test=self.testing)
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1292, in __run_evaluation
    self.__add_tqdm_metrics(eval_out_metrics)
  File "/home/dingxia/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 457, in __add_tqdm_metrics
    for k, v in metrics.items():
AttributeError: 'NoneType' object has no attribute 'items'

If I only want to do validation_end only once and records its score, instead of do it on each process, how can I do?

williamFalcon · 2019-09-23T13:12:34Z

so, let’s break this down a bit.

Case 1 (current):
Each process calls validation_end. If you log anything or monitor early stopping it only uses the metrics from proc 0. This is equivalent to using 1/nb_process in this calculation.

Case 2 (i think this is the one we need to support):
Same as above, but use dist.all_reduce for each metric returned from validation_end to get the full value instead of the estimated 1/nb_process value.

example:
Case 1:
you calculate accuracy. then the accuracy you log (only from proc 0) is an estimate of the full accuracy.

Case 2:
you calculate accuracy, then the dist.all_reduce is not an estimate but the full accuracy.

If your data are uniformly shuffled and your batches are big enough, case 1 and 2 are almost identical. if your batch is too small, the estimate will be a little bit off.

neggert · 2019-09-23T16:28:53Z

Agree that case 2 is important to support, and should maybe even be the default. That seems like what a new user would expect.

williamFalcon · 2019-09-23T16:34:51Z

Yeah, agreed. I'll turn this into a ticket.

The first approach I can think of is to make this change:

out = validation_end
for k, v in out.items():
    out[k] = dist.all_reduce(v)

williamFalcon · 2019-09-23T16:38:13Z

@neggert does it also make sense to do the same after training_step? I'm a bit concerned about the speed impact of adding these calls... so maybe only validation_end needs it as it's called once per validation cycle?

neggert · 2019-09-24T22:09:58Z

I can't think of a good reason to do it after a training step. Any metrics measured batch-by-batch are going to be noisy anyway, so only taking numbers from one process shouldn't make much difference. It's possible there's some use case I haven't thought of, though.

neggert · 2019-09-24T22:18:04Z

I guess I'll also note that averaging across nodes isn't always going to be the right thing to do. Metric learning problems, for instance, actually need a full set of feature vectors to compute recall and NMI.

The cleanest thing IMO would be to collect validation_step outputs from all nodes using something along the lines of all_gather, then pass the whole thing to validation_end on each node. I'm not sure how to actually do this, though.

williamFalcon · 2019-10-02T16:41:58Z

FYI @adefazio.

@neggert Let's go for this behavior:

reduce each metric on validation_step only.

magic282 · 2019-12-07T14:27:35Z

Also having this issue. Temporarily disabling the distributedsampler so I can have a full validation set for each process.

brucemuller · 2020-01-19T16:14:13Z

Hello

@neggert @williamFalcon did an update happen for this? I'm also trying to aggregate validation_end between ddp GPU processes on a single node. I'm trying to use dist.all_gather or dist.all_gather_multigpu but really unsure how it is done with pytorch lightning. I also think new users would expect info on how to aggregate from multiple processes on GPUs

williamFalcon · 2020-02-11T17:46:37Z

moving the discussion to #702

xiadingZ added the question Further information is requested label Sep 23, 2019

williamFalcon changed the title ~~How to calll validation_end only once when useing 'ddp' on single node~~ Aggregate output of validation_end across all ddp processes Sep 23, 2019

williamFalcon added feature Is an improvement or enhancement help wanted Open to be worked on and removed question Further information is requested labels Sep 23, 2019

brucemuller mentioned this issue Jan 19, 2020

Bringing together results on ddp on a single machine #702

Closed

williamFalcon closed this as completed Feb 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate output of validation_end across all ddp processes #243

Aggregate output of validation_end across all ddp processes #243

xiadingZ commented Sep 23, 2019

williamFalcon commented Sep 23, 2019

xiadingZ commented Sep 23, 2019 •

edited

Loading

williamFalcon commented Sep 23, 2019

neggert commented Sep 23, 2019

williamFalcon commented Sep 23, 2019 •

edited

Loading

williamFalcon commented Sep 23, 2019 •

edited

Loading

neggert commented Sep 24, 2019

neggert commented Sep 24, 2019

williamFalcon commented Oct 2, 2019

magic282 commented Dec 7, 2019

brucemuller commented Jan 19, 2020

williamFalcon commented Feb 11, 2020 •

edited

Loading

Aggregate output of validation_end across all ddp processes #243

Aggregate output of validation_end across all ddp processes #243

Comments

xiadingZ commented Sep 23, 2019

williamFalcon commented Sep 23, 2019

xiadingZ commented Sep 23, 2019 • edited Loading

williamFalcon commented Sep 23, 2019

neggert commented Sep 23, 2019

williamFalcon commented Sep 23, 2019 • edited Loading

williamFalcon commented Sep 23, 2019 • edited Loading

neggert commented Sep 24, 2019

neggert commented Sep 24, 2019

williamFalcon commented Oct 2, 2019

magic282 commented Dec 7, 2019

brucemuller commented Jan 19, 2020

williamFalcon commented Feb 11, 2020 • edited Loading

xiadingZ commented Sep 23, 2019 •

edited

Loading

williamFalcon commented Sep 23, 2019 •

edited

Loading

williamFalcon commented Sep 23, 2019 •

edited

Loading

williamFalcon commented Feb 11, 2020 •

edited

Loading