-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gather all validation_step outputs on one machine #4175
Comments
Hi! thanks for your contribution!, great first issue! |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Any update on this? |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
+1 to see this updated. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
+1 |
need solutions please I am stuck for three days no issue of this kind was solved! |
Did you try to call dist.all_gather inside validation_epoch_end? |
seems dist.all_gather works just with list of tensors in my case the out put from each validation_step is a dict as following output = {"loss": float_num, "batch_length": int_num, "pred": text_pred, "answer": text_ans, "doc":text_doc} |
The dict should be fine, probably you have to call dist.all_gather on every element of the dict. |
this is long process the data set is huge, hope find easier solution |
@Arij-Aladel Pytorch 1.7 now supports all_gather for python object: pytorch/pytorch#42189 You should solve your problem by updating to Pytorch 1.7. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
@cattaneod by the way, this works across all machines, not just on one machine, correct? |
+1 |
🚀 Feature
I think it would be nice to provide one hook that gather all the
validation_step
outputs on one machine, regardless of the backend.Motivation
I'm trying to train a network for place recognition, thus i need to gather the embeddings (network's outputs) of all validation samples in a single process to create a KDTree, and this is a bit tricky when using
ddp
.Alternatives
I was able to solve my problem by calling
dist.all_gather
insidevalidation_epoch_end
:The distributed sampler distribute the dataset in the following way (supposing to have 3 gpus):
GPU0 will process samples [0, 3, 6 , ...]
GPU1 will process samples [1, 4, 7 , ...]
GPU2 will process samples [2, 5, 8 , ...]
The
interleaved_out
is thus needed to collect the outputs in the correct order.However, in this way i can only use
ddp
training, and i lose some nice pytorch lightning features.EDIT: as a side note, when the number of samples in the dataset is not divisible by the number of GPUs, the distributed sampler adds repeated samples to split it evenly, thus the line
interleaved_out = interleaved_out[:len(dataset)]
is needed to remove the repeated outputs.The text was updated successfully, but these errors were encountered: