-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU on AWS p2.8xlarge instance (ddp2 and ddp) #676
Comments
On inspection of the source code, it looks like ddp2 has to go through slurm, whereas ddp can use multiprocessing. Let's try ddp - I will edit this accordingly. |
Okay - so let's forget about ddp2 on AWS as I don't have the expertise to set up a SLURM cluster. I switched my torchtext BucketIterator to a vanilla torch.utils.data.DataLoader. The good news: DataParallel works all fine without having to do anything (just set gpus=8 and distributed_backend='dp' in the trainer). Bad news a) DP seems to be quite slow (almost as slow as CPU!) and the 8 GPUs aren't being utilised that much it seems (https://app.wandb.ai/laksh/Siamese_SNLI/runs/qbsfjywe/system?workspace=) This might be because I set num_workers=0 in the DataLoader so I will change this to 4 and see what happens tomorrow. Bad news b) DDP still doesn't work :( I get the following error:
Any help would greatly be appreciated, thanks so much PL community. |
any ddp uses slurm unfortunately (at least at the moment). |
sounds like you have a bottleneck. can you add some code and maybe your node config? |
Okay will do soon. Thanks. PL is really good! |
Single-node DDP should work fine without slurm, right? We just spawn multiple processes ourselves. Never seen that error before. Maybe something weird in the image you're using? |
I am using the ufoym/deepo image (the one which has all the deep learning libraries - it's quite popular with 5k stars on github |
We shall test it via #486 |
you may set for single node |
Hi, I will try ddp2 while setting this env variable cheers! |
As information, AWS p2.8xlarge has 8 K80s all on the same node.
I have tried my model
gpus=1
anddistributed_backend=None
on an AWS p2.xlarge instance (1 K80) and it works.When I try
gpus=8
anddistributed_backend='ddp2'
on an AWS p2.8xlarge, I get the following error:The text was updated successfully, but these errors were encountered: