gather_all_tensors_if_available share the same underlying storage for all GPUs #3253

ShomyLiu · 2020-08-29T09:06:08Z

🐛 Bug

Hi, one of new features in #2528 gather_all_tensors_if_available has a list copy bug, and this would lead that tensors in all GPUs are the wrongly same as one GPU, since they share the same storage:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/metrics/converters.py#L304

gathered_result = world_size * [torch.zeros_like(result)]

change into:

gathered_result = [torch.zeros_like(result) for _ in range(world_size)]

The text was updated successfully, but these errors were encountered:

edenlightning · 2020-09-01T14:53:57Z

@justusschock @SkafteNicki

SkafteNicki · 2020-09-01T15:21:15Z

@ShomyLiu good catch, would you be up for sending a PR? Please, note that the function is not used anywhere yet, but are there for future changes to the metric package.

ShomyLiu · 2020-09-01T15:31:27Z

@SkafteNicki it's my pleasure for a PR. I will finish this as soon as possible.

Yeah, it's a new function to wrap the torch.distributed.all_gather. But I think it is a very common use case; especially, when using DDP mode, we always need to gather all the outputs cross all the GPUs.

SkafteNicki · 2020-09-01T15:35:57Z

@ShomyLiu Yes, I agree that it is a common use case.
Please ping me when PR is ready.

ShomyLiu · 2020-09-02T04:47:06Z

@SkafteNicki Hi, I have sent a PR jus now for your review #3319

* Fix: gather_all_tensors cross GPUs in metrics * add a test case for gather_all_tensors_ddp in #3253

ShomyLiu added bug Something isn't working help wanted Open to be worked on labels Aug 29, 2020

edenlightning added this to the 0.9.x milestone Sep 1, 2020

edenlightning added the Metrics label Sep 1, 2020

justusschock added this to To do in Metrics package via automation Sep 1, 2020

justusschock assigned SkafteNicki Sep 1, 2020

ShomyLiu mentioned this issue Sep 2, 2020

Fix: gather_all_tensors cross GPUs in DDP #3319

Merged

7 tasks

ShomyLiu added a commit to ShomyLiu/pytorch-lightning that referenced this issue Sep 3, 2020

add a test case for gather_all_tensors_ddp in Lightning-AI#3253

9f7e3db

justusschock pushed a commit to ShomyLiu/pytorch-lightning that referenced this issue Sep 3, 2020

add a test case for gather_all_tensors_ddp in Lightning-AI#3253

ba8afb6

SkafteNicki closed this as completed in #3319 Sep 3, 2020

Metrics package automation moved this from To do to Done Sep 3, 2020

SkafteNicki pushed a commit that referenced this issue Sep 3, 2020

Fix: gather_all_tensors cross GPUs in DDP (#3319)

d521c1b

* Fix: gather_all_tensors cross GPUs in metrics * add a test case for gather_all_tensors_ddp in #3253

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gather_all_tensors_if_available share the same underlying storage for all GPUs #3253

gather_all_tensors_if_available share the same underlying storage for all GPUs #3253

ShomyLiu commented Aug 29, 2020

edenlightning commented Sep 1, 2020

SkafteNicki commented Sep 1, 2020

ShomyLiu commented Sep 1, 2020

SkafteNicki commented Sep 1, 2020

ShomyLiu commented Sep 2, 2020

**gather_all_tensors_if_available** share the same underlying storage for all GPUs #3253

**gather_all_tensors_if_available** share the same underlying storage for all GPUs #3253

Comments

ShomyLiu commented Aug 29, 2020

🐛 Bug

edenlightning commented Sep 1, 2020

SkafteNicki commented Sep 1, 2020

ShomyLiu commented Sep 1, 2020

SkafteNicki commented Sep 1, 2020

ShomyLiu commented Sep 2, 2020

gather_all_tensors_if_available share the same underlying storage for all GPUs #3253

gather_all_tensors_if_available share the same underlying storage for all GPUs #3253