Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather all validation_step outputs on one machine #4175

Closed
cattaneod opened this issue Oct 15, 2020 · 16 comments
Closed

Gather all validation_step outputs on one machine #4175

cattaneod opened this issue Oct 15, 2020 · 16 comments
Labels
feature Is an improvement or enhancement help wanted Open to be worked on won't fix This will not be worked on

Comments

@cattaneod
Copy link

cattaneod commented Oct 15, 2020

🚀 Feature

I think it would be nice to provide one hook that gather all the validation_step outputs on one machine, regardless of the backend.

Motivation

I'm trying to train a network for place recognition, thus i need to gather the embeddings (network's outputs) of all validation samples in a single process to create a KDTree, and this is a bit tricky when using ddp.

Alternatives

I was able to solve my problem by calling dist.all_gather inside validation_epoch_end:

def validation_epoch_end(self, outputs: list):
    embs = []
    for i in range(len(outputs)):
        embs.append(outputs[i]['embedding'])
    embs = torch.cat(embs)
    out_emb = [torch.zeros_like(embs) for _ in range(dist.get_world_size())]
    dist.barrier()
    dist.all_gather(out_emb, embs)
    if dist.get_rank() == 0:
        interleaved_out = torch.empty((embs.shape[0]*dist.get_world_size(), embs.shape[1]), device=embs.device, dtype=embs.dtype)
        for current_rank in range(dist.get_world_size()):
            interleaved_out[current_rank::dist.get_world_size()] = out_emb[current_rank]
        interleaved_out = interleaved_out[:len(dataset)]
        # Create KDTree and compute recall

The distributed sampler distribute the dataset in the following way (supposing to have 3 gpus):
GPU0 will process samples [0, 3, 6 , ...]
GPU1 will process samples [1, 4, 7 , ...]
GPU2 will process samples [2, 5, 8 , ...]

The interleaved_out is thus needed to collect the outputs in the correct order.

However, in this way i can only use ddp training, and i lose some nice pytorch lightning features.

EDIT: as a side note, when the number of samples in the dataset is not divisible by the number of GPUs, the distributed sampler adds repeated samples to split it evenly, thus the line interleaved_out = interleaved_out[:len(dataset)] is needed to remove the repeated outputs.

@cattaneod cattaneod added feature Is an improvement or enhancement help wanted Open to be worked on labels Oct 15, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@stale
Copy link

stale bot commented Nov 14, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Nov 14, 2020
@cattaneod
Copy link
Author

Any update on this?

@stale stale bot removed the won't fix This will not be worked on label Nov 16, 2020
@stale
Copy link

stale bot commented Dec 16, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Dec 16, 2020
@blakedewey
Copy link

+1 to see this updated.

@stale stale bot removed the won't fix This will not be worked on label Dec 17, 2020
@stale
Copy link

stale bot commented Jan 16, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jan 16, 2021
@ouenal
Copy link

ouenal commented Jan 20, 2021

+1

@stale stale bot removed the won't fix This will not be worked on label Jan 20, 2021
@Arij-Aladel
Copy link

need solutions please I am stuck for three days no issue of this kind was solved!

@cattaneod
Copy link
Author

need solutions please I am stuck for three days no issue of this kind was solved!

Did you try to call dist.all_gather inside validation_epoch_end?
As i said in the issue, I was able to solve in that way, however it works only when using DistributedDataParallel as backend.

@Arij-Aladel
Copy link

Arij-Aladel commented Jan 29, 2021

seems dist.all_gather works just with list of tensors in my case the out put from each validation_step is a dict as following

output = {"loss": float_num, "batch_length": int_num, "pred": text_pred, "answer": text_ans, "doc":text_doc}
so the input for validation_epoch_end is a list of dictionaries , I need to gather all outputs no matter they are tensors or not with dist.all_gather seems I cannot. Any other suggestion? yeah I am using ddp

@cattaneod
Copy link
Author

The dict should be fine, probably you have to call dist.all_gather on every element of the dict.
However, i think that you need to convert to tensor the various element of the dict to use all_gather.

@Arij-Aladel
Copy link

this is long process the data set is huge, hope find easier solution

@cattaneod
Copy link
Author

@Arij-Aladel Pytorch 1.7 now supports all_gather for python object: pytorch/pytorch#42189

You should solve your problem by updating to Pytorch 1.7.

@stale
Copy link

stale bot commented Mar 15, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Mar 15, 2021
@stale stale bot closed this as completed Mar 22, 2021
@danielyan86129
Copy link

danielyan86129 commented Sep 22, 2022

@cattaneod by the way, this works across all machines, not just on one machine, correct?
Update: yes it does.

@SagiPolaczek
Copy link

SagiPolaczek commented Oct 6, 2022

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants