Multi-GPU on AWS p2.8xlarge instance (ddp2 and ddp) #676

aced125 · 2020-01-09T15:21:05Z

As information, AWS p2.8xlarge has 8 K80s all on the same node.

I have tried my model gpus=1 and distributed_backend=None on an AWS p2.xlarge instance (1 K80) and it works.

When I try gpus=8 and distributed_backend='ddp2' on an AWS p2.8xlarge, I get the following error:

  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 335, in fit
task = int(os.environ['SLURM_LOCALID'])
File "/usr/lib/python3.6/os.py", line 669, in __getitem__
raise KeyError(key) from None
KeyError: 'SLURM_LOCALID'

The text was updated successfully, but these errors were encountered:

aced125 · 2020-01-09T15:27:25Z

On inspection of the source code, it looks like ddp2 has to go through slurm, whereas ddp can use multiprocessing. Let's try ddp - I will edit this accordingly.

aced125 · 2020-01-10T02:58:53Z

Okay - so let's forget about ddp2 on AWS as I don't have the expertise to set up a SLURM cluster.

I switched my torchtext BucketIterator to a vanilla torch.utils.data.DataLoader.

The good news: DataParallel works all fine without having to do anything (just set gpus=8 and distributed_backend='dp' in the trainer).

Bad news a) DP seems to be quite slow (almost as slow as CPU!) and the 8 GPUs aren't being utilised that much it seems (https://app.wandb.ai/laksh/Siamese_SNLI/runs/qbsfjywe/system?workspace=)

This might be because I set num_workers=0 in the DataLoader so I will change this to 4 and see what happens tomorrow.

Bad news b) DDP still doesn't work :( I get the following error:


02:22:54
wandb: Run `wandb off` to turn off syncing.

02:24:25
snli_1.0.zip: 0% 0.00/94.6M [00:00<?, ?B/s] snli_1.0.zip: 0% 16.4k/94.6M [00:00<13:29, 117kB/s] snli_1.0.zip: 0% 32.8k/94.6M [00:00<13:31, 117kB/s] snli_1.0.zip: 0% 65.5k/94.6M [00:00<11:30, 137kB/s] snli_1.0.zip: 0% 98.3k/94.6M [00:00<10:05, 156kB/s] snli_1.0.zip: 0% 147k/94.6M [00:00<08:24, 187kB/s] snli_1.0.zip: 0% 213k/94.6M [00:00<06:53, 228kB/s] snli_1.0.zip: 0% 311k/94.6M

02:24:25
downloading snli_1.0.zip

02:24:25
extracting

02:24:25
Starting Model

02:24:26
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 2 leaked semaphores to clean up at shutdown

02:24:26
len(cache))

02:24:36
Bus error (core dumped)

Any help would greatly be appreciated, thanks so much PL community.

williamFalcon · 2020-01-15T00:04:29Z

any ddp uses slurm unfortunately (at least at the moment).

williamFalcon · 2020-01-15T00:05:52Z

sounds like you have a bottleneck. can you add some code and maybe your node config?

Laksh1997 · 2020-01-17T16:34:37Z

Okay will do soon. Thanks. PL is really good!

neggert · 2020-01-21T03:50:07Z

Single-node DDP should work fine without slurm, right? We just spawn multiple processes ourselves.

Never seen that error before. Maybe something weird in the image you're using?

Laksh1997 · 2020-01-21T09:25:07Z

I am using the ufoym/deepo image (the one which has all the deep learning libraries - it's quite popular with 5k stars on github

Laksh1997 · 2020-01-21T09:29:10Z

https://github.com/ufoym/deepo/blob/master/docker/Dockerfile.all-py36-cu101

Borda · 2020-02-29T08:57:29Z

We shall test it via #486

Borda · 2020-04-16T23:25:59Z

you may set for single node export SLURM_LOCALID=0
Feel free to reopen if needed 🐰

Laksh1997 · 2020-05-23T14:21:18Z

Hi, I will try ddp2 while setting this env variable cheers!

aced125 added the bug Something isn't working label Jan 9, 2020

aced125 changed the title ~~Multi-GPU on AWS p2.8xlarge instance~~ Multi-GPU on AWS p2.8xlarge instance (ddp2 and ddp) Jan 9, 2020

Borda closed this as completed Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU on AWS p2.8xlarge instance (ddp2 and ddp) #676

Multi-GPU on AWS p2.8xlarge instance (ddp2 and ddp) #676

aced125 commented Jan 9, 2020

aced125 commented Jan 9, 2020

aced125 commented Jan 10, 2020

williamFalcon commented Jan 15, 2020

williamFalcon commented Jan 15, 2020

Laksh1997 commented Jan 17, 2020

neggert commented Jan 21, 2020

Laksh1997 commented Jan 21, 2020

Laksh1997 commented Jan 21, 2020

Borda commented Feb 29, 2020

Borda commented Apr 16, 2020

Laksh1997 commented May 23, 2020

Multi-GPU on AWS p2.8xlarge instance (ddp2 and ddp) #676

Multi-GPU on AWS p2.8xlarge instance (ddp2 and ddp) #676

Comments

aced125 commented Jan 9, 2020

aced125 commented Jan 9, 2020

aced125 commented Jan 10, 2020

williamFalcon commented Jan 15, 2020

williamFalcon commented Jan 15, 2020

Laksh1997 commented Jan 17, 2020

neggert commented Jan 21, 2020

Laksh1997 commented Jan 21, 2020

Laksh1997 commented Jan 21, 2020

Borda commented Feb 29, 2020

Borda commented Apr 16, 2020

Laksh1997 commented May 23, 2020