NCCL error when using ddp with 2 gpus #3865

kekeblom · 2020-10-05T10:52:25Z

🐛 Bug

I try to run pytorch lighting using ddp with 2 gpus. Running with one gpu works fine. Using fp16 vs not results in the same error. See the stacktrace at the end of the post to see the error. I also tried ddp2 and dp, but both of those fail with a different error.

To Reproduce

Not sure. Let me know what I can do to diagnose.

I'm running my code on a cluster where each gpu is locked to one process. I'm using NCCL version 2.4.8.

I tried pytorch-lightning versions 0.9.0, 0.9.1rc4, 0.10.0rc1. All of them result in the same error. I'm running pytorch version 1.6.

Expected behavior

I expected training to start running smoothly using both gpus.

Environment

CUDA:
- GPU:
- GeForce GTX 1080
- GeForce GTX 1080
- available: True
- version: 10.1
Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.10.0rc1
- tqdm: 4.46.1
System:
- OS: Linux
- architecture:
- 64bit
-
- processor:
- python: 3.7.7
- version: Proposal for help #1 SMP Tue May 12 16:57:42 UTC 2020

Additional context

Stacktrace and error.

initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
INFO:lightning:initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
INFO:lightning:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
Using native 16bit precision.
INFO:lightning:Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
INFO:lightning:initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
lo-s4-039:21587:21587 [0] NCCL INFO Bootstrap : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
lo-s4-039:21587:21587 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
lo-s4-039:21587:21587 [0] NCCL INFO NET/IB : No device found.
lo-s4-039:21587:21587 [0] NCCL INFO NET/Socket : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
NCCL version 2.4.8+cuda10.1
lo-s4-039:21614:21614 [1] NCCL INFO Bootstrap : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
lo-s4-039:21614:21614 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
lo-s4-039:21614:21614 [1] NCCL INFO NET/IB : No device found.
lo-s4-039:21614:21614 [1] NCCL INFO NET/Socket : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
lo-s4-039:21587:21646 [0] NCCL INFO Setting affinity for GPU 0 to 1fd001fd
lo-s4-039:21614:21647 [1] NCCL INFO Setting affinity for GPU 1 to 1fd001fd
lo-s4-039:21587:21646 [0] NCCL INFO Channel 00 :    0   1
lo-s4-039:21587:21646 [0] NCCL INFO Ring 00 : 0[1] -> 1[2] via P2P/IPC
lo-s4-039:21614:21647 [1] NCCL INFO Ring 00 : 1[2] -> 0[1] via P2P/IPC

lo-s4-039:21587:21646 [0] transport/p2p.cc:574 NCCL WARN failed to open CUDA IPC handle : 711 peer mapping resources exhausted
lo-s4-039:21587:21646 [0] NCCL INFO init.cc:669 -> 1
lo-s4-039:21587:21646 [0] NCCL INFO init.cc:815 -> 1
lo-s4-039:21587:21646 [0] NCCL INFO init.cc:951 -> 1
lo-s4-039:21587:21646 [0] NCCL INFO misc/group.cc:69 -> 1 [Async thread]

lo-s4-039:21614:21647 [1] transport/p2p.cc:574 NCCL WARN failed to open CUDA IPC handle : 711 peer mapping resources exhausted
lo-s4-039:21614:21647 [1] NCCL INFO init.cc:669 -> 1
lo-s4-039:21614:21647 [1] NCCL INFO init.cc:815 -> 1
lo-s4-039:21614:21647 [1] NCCL INFO init.cc:951 -> 1
lo-s4-039:21614:21647 [1] NCCL INFO misc/group.cc:69 -> 1 [Async thread]
Traceback (most recent call last):
  File "tools/lightning.py", line 514, in <module>
Traceback (most recent call last):
  File "/cluster/home/user/tracking/tools/lightning.py", line 514, in <module>
    trainer.fit(model)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 451, in fit
    results = self.accelerator_backend.train()
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 140, in train
    trainer.fit(model)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 451, in fit
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 266, in ddp_train
    model = model.configure_ddp(model, device_ids)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 954, in configure_ddp
    results = self.accelerator_backend.train()
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 140, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 266, in ddp_train
    model, device_ids=device_ids, find_unused_parameters=True
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__
    model = model.configure_ddp(model, device_ids)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 954, in configure_ddp
    self.broadcast_bucket_size)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced
    model, device_ids=device_ids, find_unused_parameters=True
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8
    self.broadcast_bucket_size)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8

The text was updated successfully, but these errors were encountered:

github-actions · 2020-10-05T10:53:04Z

Hi! thanks for your contribution!, great first issue!

edenlightning · 2020-10-20T20:44:28Z

mind upgrading to 1.0.2? And try to reproduce using this model-> https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py

awaelchli · 2020-10-21T11:50:05Z

@kekeblom how do you launch your script? how do I reproduce it with the bug report template?

I'm running my code on a cluster where each gpu is locked to one process.

That should be fine, ddp launches 1 process per gpu.

kekeblom · 2020-10-23T11:37:43Z

Actually, I haven't seen this happen again. So I can't reproduce. Might have been the version or then it's somehow related to the gpu that gets scheduled.

Maybe it's worth closing the issue. I'll let you know if I re-encouter the problem.

kekeblom · 2020-10-23T15:22:01Z

Speak of the devil. @awaelchli

Tried it on 1.0.3 with 2 gpus. Modified the bug_report_model.py to have gpus=2, distributed_backend="ddp" in the trainer initializer arguments.

It seems it only happens when running on a machine with GTX 1080 gpus. Machines with GTX 1080 Ti or RTX2080 do not appear to suffer from this issue.

Here is the output of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:0D:00.0 Off |                  N/A |
| 28%   29C    P8     5W / 180W |      0MiB /  8119MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:0F:00.0 Off |                  N/A |
| 28%   31C    P8     5W / 180W |      0MiB /  8119MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here is the stack trace:

Traceback (most recent call last):                                                                                                                            
  File "/cluster/home/user/bug_report.py", line 135, in <module>                                                                                              
Traceback (most recent call last):                                                                                                                            
  File "/cluster/home/user/bug_report.py", line 135, in <module>                                                                                              
    run_test()                                                                                                                                                
  File "/cluster/home/user/bug_report.py", line 130, in run_test                                                                                              
    run_test()                                                                                                                                                
  File "/cluster/home/user/bug_report.py", line 130, in run_test                                                                                              
    trainer.fit(model, train_data, val_data)                                                                                                                  
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit                          
    trainer.fit(model, train_data, val_data)                                                                                                                  
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit                          
    results = self.accelerator_backend.train()                                                                                                                
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train           
    results = self.accelerator_backend.train()                                                                                                                
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train           
    results = self.ddp_train(process_idx=self.task_idx, model=model)                                                                                          
    results = self.ddp_train(process_idx=self.task_idx, model=model)                                                                                          
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 272, in ddp_train       
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 272, in ddp_train       
    model = self.configure_ddp(model, device_ids)                                                                                                             
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 290, in configure_ddp   
    model = self.configure_ddp(model, device_ids)                                                                                                             
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 290, in configure_ddp   
    model, device_ids=device_ids, find_unused_parameters=True                                                                                                 
    model, device_ids=device_ids, find_unused_parameters=True                                                                                                 
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__                         
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__                         
    self.broadcast_bucket_size)                                                                                                                               
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced 
    self.broadcast_bucket_size)                                                                                                                               
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced 
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)                                                                                       
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)                                                                                       
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8

kekeblom added bug Something isn't working help wanted Open to be worked on labels Oct 5, 2020

awaelchli added the distributed Generic distributed-related topic label Oct 5, 2020

williamFalcon added this to the 1.0 milestone Oct 5, 2020

edenlightning assigned williamFalcon Oct 5, 2020

edenlightning added help wanted Open to be worked on and removed help wanted Open to be worked on labels Oct 5, 2020

edenlightning modified the milestones: 1.0, 1.1 Oct 7, 2020

edenlightning modified the milestones: 1.1, 1.0.3 Oct 19, 2020

edenlightning unassigned williamFalcon Oct 20, 2020

edenlightning added the priority: 0 High priority task label Oct 20, 2020

awaelchli mentioned this issue Oct 22, 2020

Set correct device ids in DDP [wip] #4297

Merged

williamFalcon closed this as completed Oct 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL error when using ddp with 2 gpus #3865

NCCL error when using ddp with 2 gpus #3865

kekeblom commented Oct 5, 2020

github-actions bot commented Oct 5, 2020

edenlightning commented Oct 20, 2020

awaelchli commented Oct 21, 2020

kekeblom commented Oct 23, 2020

kekeblom commented Oct 23, 2020 •

edited

Loading

NCCL error when using ddp with 2 gpus #3865

NCCL error when using ddp with 2 gpus #3865

Comments

kekeblom commented Oct 5, 2020

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

github-actions bot commented Oct 5, 2020

edenlightning commented Oct 20, 2020

awaelchli commented Oct 21, 2020

kekeblom commented Oct 23, 2020

kekeblom commented Oct 23, 2020 • edited Loading

kekeblom commented Oct 23, 2020 •

edited

Loading