Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark subprocess vs spawn #5772

Closed
edenlightning opened this issue Feb 3, 2021 · 5 comments
Closed

benchmark subprocess vs spawn #5772

edenlightning opened this issue Feb 3, 2021 · 5 comments
Assignees
Labels
distributed Generic distributed-related topic
Milestone

Comments

@edenlightning
Copy link
Contributor

edenlightning commented Feb 3, 2021

A while back we replaced ddp .spawn with subprocess due to issues with subprocess and spawning multiple processes in the dataloader: #2029

Are there still performance issues using spawn? If these are fixed, we can change the messaging in our docs (https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-data-parallel-spawn)

@Borda Borda added this to the 1.2.x milestone Feb 4, 2021
@BlockWaving
Copy link

Using ddp_spawn (with multiple gpus at one node) , I observed that a) batch size can not set as large as ddp; b) under the ddp, the gpu memory can be used as high as 90% (21 out of 24GB each gpu), but with ddp_spawn, always get cuda memory error when gpu memory exceeding 12 (out of 24GB each GPU); c) with ddp_spawn, my training tends to crash after 7-8 epoches (num_workers=3); d) overall mid-training gpu cuda utilization is lower, around 65% with ddp_spawn, vs. that of 85% with ddp.

@justusschock
Copy link
Member

@BlockWaving Thanks for these informations. Do you have a script to reproduce these benchmarks?

@BlockWaving
Copy link

@justusschock can not disclose detail script for company policy. but you can see the Trainer setup snippets in the new thread i asked today -- have been forced to use ddp_spawn instead of ddp, because of the replicas errors with ddp mode.

This afternoon the training with ddp_spawn stuck again at epoch 2 at 84%.

pls be free to let me know if you have further questions.

@BlockWaving
Copy link

@justusschock check out #5894

@Borda Borda modified the milestones: 1.2.x, 1.3 Apr 18, 2021
@kaushikb11
Copy link
Contributor

Hi @justusschock, do we have any updates on this?

@edenlightning edenlightning modified the milestones: v1.3, v1.4 Apr 27, 2021
@edenlightning edenlightning added the distributed Generic distributed-related topic label May 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic
Projects
None yet
Development

No branches or pull requests

5 participants