Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU on AWS p2.8xlarge instance (ddp2 and ddp) #676

Closed
aced125 opened this issue Jan 9, 2020 · 11 comments
Closed

Multi-GPU on AWS p2.8xlarge instance (ddp2 and ddp) #676

aced125 opened this issue Jan 9, 2020 · 11 comments
Labels
bug Something isn't working

Comments

@aced125
Copy link

aced125 commented Jan 9, 2020

As information, AWS p2.8xlarge has 8 K80s all on the same node.

I have tried my model gpus=1 and distributed_backend=None on an AWS p2.xlarge instance (1 K80) and it works.

When I try gpus=8 and distributed_backend='ddp2' on an AWS p2.8xlarge, I get the following error:

  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 335, in fit
task = int(os.environ['SLURM_LOCALID'])
File "/usr/lib/python3.6/os.py", line 669, in __getitem__
raise KeyError(key) from None
KeyError: 'SLURM_LOCALID'
@aced125 aced125 added the bug Something isn't working label Jan 9, 2020
@aced125
Copy link
Author

aced125 commented Jan 9, 2020

On inspection of the source code, it looks like ddp2 has to go through slurm, whereas ddp can use multiprocessing. Let's try ddp - I will edit this accordingly.

@aced125 aced125 changed the title Multi-GPU on AWS p2.8xlarge instance Multi-GPU on AWS p2.8xlarge instance (ddp2 and ddp) Jan 9, 2020
@aced125
Copy link
Author

aced125 commented Jan 10, 2020

Okay - so let's forget about ddp2 on AWS as I don't have the expertise to set up a SLURM cluster.

I switched my torchtext BucketIterator to a vanilla torch.utils.data.DataLoader.

The good news: DataParallel works all fine without having to do anything (just set gpus=8 and distributed_backend='dp' in the trainer).

Bad news a) DP seems to be quite slow (almost as slow as CPU!) and the 8 GPUs aren't being utilised that much it seems (https://app.wandb.ai/laksh/Siamese_SNLI/runs/qbsfjywe/system?workspace=)

This might be because I set num_workers=0 in the DataLoader so I will change this to 4 and see what happens tomorrow.

Bad news b) DDP still doesn't work :( I get the following error:


02:22:54
wandb: Run `wandb off` to turn off syncing.

02:24:25
snli_1.0.zip: 0% 0.00/94.6M [00:00<?, ?B/s] snli_1.0.zip: 0% 16.4k/94.6M [00:00<13:29, 117kB/s] snli_1.0.zip: 0% 32.8k/94.6M [00:00<13:31, 117kB/s] snli_1.0.zip: 0% 65.5k/94.6M [00:00<11:30, 137kB/s] snli_1.0.zip: 0% 98.3k/94.6M [00:00<10:05, 156kB/s] snli_1.0.zip: 0% 147k/94.6M [00:00<08:24, 187kB/s] snli_1.0.zip: 0% 213k/94.6M [00:00<06:53, 228kB/s] snli_1.0.zip: 0% 311k/94.6M

02:24:25
downloading snli_1.0.zip

02:24:25
extracting

02:24:25
Starting Model

02:24:26
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 2 leaked semaphores to clean up at shutdown

02:24:26
len(cache))

02:24:36
Bus error (core dumped)

Any help would greatly be appreciated, thanks so much PL community.

@williamFalcon
Copy link
Contributor

any ddp uses slurm unfortunately (at least at the moment).

@williamFalcon
Copy link
Contributor

sounds like you have a bottleneck. can you add some code and maybe your node config?

@Laksh1997
Copy link

Okay will do soon. Thanks. PL is really good!

@neggert
Copy link
Contributor

neggert commented Jan 21, 2020

Single-node DDP should work fine without slurm, right? We just spawn multiple processes ourselves.

Never seen that error before. Maybe something weird in the image you're using?

@Laksh1997
Copy link

I am using the ufoym/deepo image (the one which has all the deep learning libraries - it's quite popular with 5k stars on github

@Laksh1997
Copy link

@Borda
Copy link
Member

Borda commented Feb 29, 2020

We shall test it via #486

@Borda
Copy link
Member

Borda commented Apr 16, 2020

you may set for single node export SLURM_LOCALID=0
Feel free to reopen if needed 🐰

@Borda Borda closed this as completed Apr 16, 2020
@Laksh1997
Copy link

Hi, I will try ddp2 while setting this env variable cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants