Gather all validation_step outputs on one machine #4175

cattaneod · 2020-10-15T15:13:09Z

🚀 Feature

I think it would be nice to provide one hook that gather all the validation_step outputs on one machine, regardless of the backend.

Motivation

I'm trying to train a network for place recognition, thus i need to gather the embeddings (network's outputs) of all validation samples in a single process to create a KDTree, and this is a bit tricky when using ddp.

Alternatives

I was able to solve my problem by calling dist.all_gather inside validation_epoch_end:

def validation_epoch_end(self, outputs: list):
    embs = []
    for i in range(len(outputs)):
        embs.append(outputs[i]['embedding'])
    embs = torch.cat(embs)
    out_emb = [torch.zeros_like(embs) for _ in range(dist.get_world_size())]
    dist.barrier()
    dist.all_gather(out_emb, embs)
    if dist.get_rank() == 0:
        interleaved_out = torch.empty((embs.shape[0]*dist.get_world_size(), embs.shape[1]), device=embs.device, dtype=embs.dtype)
        for current_rank in range(dist.get_world_size()):
            interleaved_out[current_rank::dist.get_world_size()] = out_emb[current_rank]
        interleaved_out = interleaved_out[:len(dataset)]
        # Create KDTree and compute recall

The distributed sampler distribute the dataset in the following way (supposing to have 3 gpus):
GPU0 will process samples [0, 3, 6 , ...]
GPU1 will process samples [1, 4, 7 , ...]
GPU2 will process samples [2, 5, 8 , ...]

The interleaved_out is thus needed to collect the outputs in the correct order.

However, in this way i can only use ddp training, and i lose some nice pytorch lightning features.

EDIT: as a side note, when the number of samples in the dataset is not divisible by the number of GPUs, the distributed sampler adds repeated samples to split it evenly, thus the line interleaved_out = interleaved_out[:len(dataset)] is needed to remove the repeated outputs.

The text was updated successfully, but these errors were encountered:

github-actions · 2020-10-15T15:13:53Z

Hi! thanks for your contribution!, great first issue!

stale · 2020-11-14T22:23:58Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

cattaneod · 2020-11-16T08:58:25Z

Any update on this?

stale · 2020-12-16T09:46:06Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

blakedewey · 2020-12-17T01:11:57Z

+1 to see this updated.

stale · 2021-01-16T03:27:04Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

ouenal · 2021-01-20T23:45:59Z

+1

Arij-Aladel · 2021-01-29T07:43:47Z

need solutions please I am stuck for three days no issue of this kind was solved!

cattaneod · 2021-01-29T11:37:29Z

need solutions please I am stuck for three days no issue of this kind was solved!

Did you try to call dist.all_gather inside validation_epoch_end?
As i said in the issue, I was able to solve in that way, however it works only when using DistributedDataParallel as backend.

Arij-Aladel · 2021-01-29T13:28:24Z

seems dist.all_gather works just with list of tensors in my case the out put from each validation_step is a dict as following

output = {"loss": float_num, "batch_length": int_num, "pred": text_pred, "answer": text_ans, "doc":text_doc}
so the input for validation_epoch_end is a list of dictionaries , I need to gather all outputs no matter they are tensors or not with dist.all_gather seems I cannot. Any other suggestion? yeah I am using ddp

cattaneod · 2021-01-29T13:34:17Z

The dict should be fine, probably you have to call dist.all_gather on every element of the dict.
However, i think that you need to convert to tensor the various element of the dict to use all_gather.

Arij-Aladel · 2021-01-29T15:43:46Z

this is long process the data set is huge, hope find easier solution

cattaneod · 2021-02-02T12:19:38Z

@Arij-Aladel Pytorch 1.7 now supports all_gather for python object: pytorch/pytorch#42189

You should solve your problem by updating to Pytorch 1.7.

stale · 2021-03-15T06:58:03Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

danielyan86129 · 2022-09-22T19:00:11Z

@cattaneod by the way, this works across all machines, not just on one machine, correct?
Update: yes it does.

SagiPolaczek · 2022-10-06T06:25:36Z

+1

cattaneod added feature Is an improvement or enhancement help wanted Open to be worked on labels Oct 15, 2020

cattaneod mentioned this issue Oct 15, 2020

Bringing together results on ddp on a single machine #702

Closed

stale bot added the won't fix This will not be worked on label Nov 14, 2020

stale bot removed the won't fix This will not be worked on label Nov 16, 2020

stale bot added the won't fix This will not be worked on label Dec 16, 2020

stale bot removed the won't fix This will not be worked on label Dec 17, 2020

stale bot added the won't fix This will not be worked on label Jan 16, 2021

stale bot removed the won't fix This will not be worked on label Jan 20, 2021

Arij-Aladel mentioned this issue Jan 29, 2021

Bringing together results on ddp on a single machine #5707

Closed

stale bot added the won't fix This will not be worked on label Mar 15, 2021

stale bot closed this as completed Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gather all validation_step outputs on one machine #4175

Gather all validation_step outputs on one machine #4175

cattaneod commented Oct 15, 2020 •

edited

Loading

github-actions bot commented Oct 15, 2020

stale bot commented Nov 14, 2020

cattaneod commented Nov 16, 2020

stale bot commented Dec 16, 2020

blakedewey commented Dec 17, 2020

stale bot commented Jan 16, 2021

ouenal commented Jan 20, 2021

Arij-Aladel commented Jan 29, 2021

cattaneod commented Jan 29, 2021

Arij-Aladel commented Jan 29, 2021 •

edited

Loading

cattaneod commented Jan 29, 2021

Arij-Aladel commented Jan 29, 2021

cattaneod commented Feb 2, 2021

stale bot commented Mar 15, 2021

danielyan86129 commented Sep 22, 2022 •

edited

Loading

SagiPolaczek commented Oct 6, 2022 •

edited

Loading

Gather all validation_step outputs on one machine #4175

Gather all validation_step outputs on one machine #4175

Comments

cattaneod commented Oct 15, 2020 • edited Loading

🚀 Feature

Motivation

Alternatives

github-actions bot commented Oct 15, 2020

stale bot commented Nov 14, 2020

cattaneod commented Nov 16, 2020

stale bot commented Dec 16, 2020

blakedewey commented Dec 17, 2020

stale bot commented Jan 16, 2021

ouenal commented Jan 20, 2021

Arij-Aladel commented Jan 29, 2021

cattaneod commented Jan 29, 2021

Arij-Aladel commented Jan 29, 2021 • edited Loading

cattaneod commented Jan 29, 2021

Arij-Aladel commented Jan 29, 2021

cattaneod commented Feb 2, 2021

stale bot commented Mar 15, 2021

danielyan86129 commented Sep 22, 2022 • edited Loading

SagiPolaczek commented Oct 6, 2022 • edited Loading

cattaneod commented Oct 15, 2020 •

edited

Loading

Arij-Aladel commented Jan 29, 2021 •

edited

Loading

danielyan86129 commented Sep 22, 2022 •

edited

Loading

SagiPolaczek commented Oct 6, 2022 •

edited

Loading