Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transfer_batch_to_device doesn't work under DP #2350

Closed
ZhaofengWu opened this issue Jun 24, 2020 · 19 comments · Fixed by #7343
Closed

transfer_batch_to_device doesn't work under DP #2350

ZhaofengWu opened this issue Jun 24, 2020 · 19 comments · Fixed by #7343
Labels
3rd party Related to a 3rd-party docs Documentation related priority: 2 Low priority task strategy: dp (removed in pl) DataParallel
Milestone

Comments

@ZhaofengWu
Copy link
Contributor

🐛 Bug

This is discussed under #1756 and I'm opening a separate issue here for visibility.

In the training loop, for DP/DDP/DDP2, we do not move the data to devices ourselves, but instead use the default scatter to transfer data. This results in transfer_batch_to_device not being called.

https://github.com/PyTorchLightning/pytorch-lightning/blob/16a7326e5259a3cdd20a508c34a0f84806d88f8e/pytorch_lightning/trainer/training_loop.py#L736-L737

Expected behavior

Ideally, we want transfer_batch_to_device to work in all settings. If it's not possible at all to override this behavior, at least a run-time warning and/or some warning in the doc should be given.

@ZhaofengWu ZhaofengWu added bug Something isn't working help wanted Open to be worked on labels Jun 24, 2020
@williamFalcon
Copy link
Contributor

Ummm... yeah good point. i'm not sure we can add a hook here. Maybe @awaelchli can look into this

@williamFalcon williamFalcon added feature Is an improvement or enhancement and removed bug Something isn't working labels Jun 25, 2020
@stale
Copy link

stale bot commented Aug 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Aug 24, 2020
@awaelchli awaelchli removed the won't fix This will not be worked on label Aug 24, 2020
@williamFalcon
Copy link
Contributor

@edenafek @awaelchli did we add a hook for this now?

@williamFalcon williamFalcon self-assigned this Sep 23, 2020
@awaelchli
Copy link
Member

No it is not there yet. This would require a custom scatter/gather in LightningDataParallel/LightningDistributedDataParallel that the user defines. Here I am not sure what the recommended way is in Lightning. Should the user subclass these classes and init them in configure_ddp hook?

@stale
Copy link

stale bot commented Oct 23, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 23, 2020
@rohitgr7 rohitgr7 removed the won't fix This will not be worked on label Oct 23, 2020
@ZhaofengWu
Copy link
Contributor Author

Not sure if the earlier label removal counts towards a new "activity" by the stale bot, so commenting here to indicate that this is not stale and still needs to be addressed.

@edenlightning
Copy link
Contributor

@awaelchli is this still relevant?

@edenlightning edenlightning added bug Something isn't working and removed feature Is an improvement or enhancement labels Feb 16, 2021
@rohitgr7
Copy link
Contributor

@edenlightning yes. Still not supported for DP AFAIK.

@edenlightning
Copy link
Contributor

@awaelchli is it possible to support?

@edenlightning edenlightning added this to the 1.2.x milestone Feb 16, 2021
@awaelchli
Copy link
Member

awaelchli commented Feb 18, 2021

Unlikely it can be supported in the near future. The blocker here is torch.nn.DataParallel which just inherently through its design cannot scatter and gather arbitrary objects. At this time, I don't see a non-hacky solution.

Users can still provide their own DataParallel wrapper. For example, torch-geometric people have to import the custom DP module from here: https://github.com/rusty1s/pytorch_geometric/blob/master/torch_geometric/nn/data_parallel.py and use that in Lightning (they would also have to do this without Lightning).

@rubencart
Copy link

I am not sure if these belongs here, but it seems related. I can also open another issue if you prefer.

For me, when I use DDP, my (custom) LightningDataModule.transfer_batch_to_device fails because its batch argument is now a tuple that contains not only the batch but also the batch index and the optimizer index. When I don't use an accelerator, the batch argument only contains a batch, as I would expect.

With normal execution on 1 GPU and without accelerator, the _step method in gpu_accelerator.py (line 59) calls args[0] = self.to_device(args[0]), where args contains the batch, its index and the optim index. transfer_batch_to_device is then correctly called with only the batch (and the device) as argument.

However, with DDP, the _step method in ddp_accelerator.py (line 173) calls self.ddp_plugin.on_before_forward(self.trainer.get_model(), *args), where args is again the batch, idx and optim idx. The ddp_plugin passes all arguments to transfer_batch_to_device without doing anything with them. It seems that this could thus maybe be fixed by changing line 173 into self.ddp_plugin.on_before_forward(self.trainer.get_model(), args[0])?

Or is this the thing you are talking about, which cannot be fixed in the near future? Then what would you recommend us to do to solve this? Add a check for which accelerator is used to transfer_batch_to_device and handle the batch accordingly?

@awaelchli
Copy link
Member

@rubencart This seems to be unrelated, and should not be a problem for DDP. This part of the code underwent a lot of changes lately. If you don't mind, would you send us a repro example in a new issue? If you ping me there and it is fixable, I will fix it.

@rohitgr7
Copy link
Contributor

However, with DDP, the _step method in ddp_accelerator.py (line 173) calls self.ddp_plugin.on_before_forward(self.trainer.get_model(), *args), where args is again the batch, idx and optim idx. The ddp_plugin passes all arguments to transfer_batch_to_device without doing anything with them. It seems that this could thus maybe be fixed by changing line 173 into self.ddp_plugin.on_before_forward(self.trainer.get_model(), args[0])?

this has been updated in the recent refactors, so won't be a problem now. you can try master. the new version will be officially released next week I guess.

@rubencart
Copy link

@rohitgr7 Okay thanks!

@rohitgr7 rohitgr7 changed the title transfer_batch_to_device doesn't work under DP/DDP/DDP2 transfer_batch_to_device doesn't work under DP Feb 19, 2021
@edenlightning edenlightning added 3rd party Related to a 3rd-party won't fix This will not be worked on and removed help wanted Open to be worked on labels Feb 23, 2021
@stale stale bot removed the won't fix This will not be worked on label Feb 23, 2021
@rohitgr7 rohitgr7 reopened this Feb 23, 2021
@rohitgr7 rohitgr7 removed the distributed Generic distributed-related topic label Feb 23, 2021
@edenlightning
Copy link
Contributor

@rohitgr7 is issue still in master?

@rohitgr7
Copy link
Contributor

rohitgr7 commented Mar 2, 2021

@edenlightning yes it is just with DP since pytorch DataParallel handles the splitting and transferring of batches to the device internally.

@awaelchli
Copy link
Member

@edenlightning Even though this issue was originally reported as a bug, it is really not a bug. We document that DP is not supported for this hook.

As mentioned already #2350 (comment)
the users can provide their own DP wrapper as they would anyway have to in regular pytorch code. We could add documentation for how to do that in a faq section somewhere? What do you think?

@Borda Borda modified the milestones: 1.2.x, 1.3 Apr 18, 2021
@edenlightning edenlightning added priority: 2 Low priority task docs Documentation related and removed bug Something isn't working labels Apr 27, 2021
@edenlightning
Copy link
Contributor

@awaelchli maybe in the multi_gpu guide under DP?

@edenlightning
Copy link
Contributor

@SeanNaren maybe add to your PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party docs Documentation related priority: 2 Low priority task strategy: dp (removed in pl) DataParallel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants