Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL error when using ddp with 2 gpus #3865

Closed
kekeblom opened this issue Oct 5, 2020 · 5 comments · Fixed by #4297
Closed

NCCL error when using ddp with 2 gpus #3865

kekeblom opened this issue Oct 5, 2020 · 5 comments · Fixed by #4297
Labels
bug Something isn't working distributed Generic distributed-related topic priority: 0 High priority task
Milestone

Comments

@kekeblom
Copy link

kekeblom commented Oct 5, 2020

🐛 Bug

I try to run pytorch lighting using ddp with 2 gpus. Running with one gpu works fine. Using fp16 vs not results in the same error. See the stacktrace at the end of the post to see the error. I also tried ddp2 and dp, but both of those fail with a different error.

To Reproduce

Not sure. Let me know what I can do to diagnose.

I'm running my code on a cluster where each gpu is locked to one process. I'm using NCCL version 2.4.8.

I tried pytorch-lightning versions 0.9.0, 0.9.1rc4, 0.10.0rc1. All of them result in the same error. I'm running pytorch version 1.6.

Expected behavior

I expected training to start running smoothly using both gpus.

Environment

  • CUDA:
    - GPU:
    - GeForce GTX 1080
    - GeForce GTX 1080
    - available: True
    - version: 10.1
  • Packages:
    - numpy: 1.18.1
    - pyTorch_debug: False
    - pyTorch_version: 1.6.0
    - pytorch-lightning: 0.10.0rc1
    - tqdm: 4.46.1
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    -
    - processor:
    - python: 3.7.7
    - version: Proposal for help #1 SMP Tue May 12 16:57:42 UTC 2020

Additional context

Stacktrace and error.

initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
INFO:lightning:initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
INFO:lightning:LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
Using native 16bit precision.
INFO:lightning:Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
INFO:lightning:initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
lo-s4-039:21587:21587 [0] NCCL INFO Bootstrap : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
lo-s4-039:21587:21587 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
lo-s4-039:21587:21587 [0] NCCL INFO NET/IB : No device found.
lo-s4-039:21587:21587 [0] NCCL INFO NET/Socket : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
NCCL version 2.4.8+cuda10.1
lo-s4-039:21614:21614 [1] NCCL INFO Bootstrap : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
lo-s4-039:21614:21614 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
lo-s4-039:21614:21614 [1] NCCL INFO NET/IB : No device found.
lo-s4-039:21614:21614 [1] NCCL INFO NET/Socket : Using [0]fabric:10.204.67.89<0> [1]enp129s0f0:10.204.3.89<0>
lo-s4-039:21587:21646 [0] NCCL INFO Setting affinity for GPU 0 to 1fd001fd
lo-s4-039:21614:21647 [1] NCCL INFO Setting affinity for GPU 1 to 1fd001fd
lo-s4-039:21587:21646 [0] NCCL INFO Channel 00 :    0   1
lo-s4-039:21587:21646 [0] NCCL INFO Ring 00 : 0[1] -> 1[2] via P2P/IPC
lo-s4-039:21614:21647 [1] NCCL INFO Ring 00 : 1[2] -> 0[1] via P2P/IPC

lo-s4-039:21587:21646 [0] transport/p2p.cc:574 NCCL WARN failed to open CUDA IPC handle : 711 peer mapping resources exhausted
lo-s4-039:21587:21646 [0] NCCL INFO init.cc:669 -> 1
lo-s4-039:21587:21646 [0] NCCL INFO init.cc:815 -> 1
lo-s4-039:21587:21646 [0] NCCL INFO init.cc:951 -> 1
lo-s4-039:21587:21646 [0] NCCL INFO misc/group.cc:69 -> 1 [Async thread]

lo-s4-039:21614:21647 [1] transport/p2p.cc:574 NCCL WARN failed to open CUDA IPC handle : 711 peer mapping resources exhausted
lo-s4-039:21614:21647 [1] NCCL INFO init.cc:669 -> 1
lo-s4-039:21614:21647 [1] NCCL INFO init.cc:815 -> 1
lo-s4-039:21614:21647 [1] NCCL INFO init.cc:951 -> 1
lo-s4-039:21614:21647 [1] NCCL INFO misc/group.cc:69 -> 1 [Async thread]
Traceback (most recent call last):
  File "tools/lightning.py", line 514, in <module>
Traceback (most recent call last):
  File "/cluster/home/user/tracking/tools/lightning.py", line 514, in <module>
    trainer.fit(model)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 451, in fit
    results = self.accelerator_backend.train()
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 140, in train
    trainer.fit(model)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 451, in fit
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 266, in ddp_train
    model = model.configure_ddp(model, device_ids)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 954, in configure_ddp
    results = self.accelerator_backend.train()
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 140, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 266, in ddp_train
    model, device_ids=device_ids, find_unused_parameters=True
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__
    model = model.configure_ddp(model, device_ids)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 954, in configure_ddp
    self.broadcast_bucket_size)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced
    model, device_ids=device_ids, find_unused_parameters=True
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8
    self.broadcast_bucket_size)
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8
@kekeblom kekeblom added bug Something isn't working help wanted Open to be worked on labels Oct 5, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2020

Hi! thanks for your contribution!, great first issue!

@awaelchli awaelchli added the distributed Generic distributed-related topic label Oct 5, 2020
@williamFalcon williamFalcon added this to the 1.0 milestone Oct 5, 2020
@edenlightning edenlightning added help wanted Open to be worked on and removed help wanted Open to be worked on labels Oct 5, 2020
@edenlightning edenlightning modified the milestones: 1.0, 1.1 Oct 7, 2020
@edenlightning edenlightning modified the milestones: 1.1, 1.0.3 Oct 19, 2020
@edenlightning edenlightning added the priority: 0 High priority task label Oct 20, 2020
@edenlightning
Copy link
Contributor

mind upgrading to 1.0.2? And try to reproduce using this model-> https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py

@awaelchli
Copy link
Contributor

@kekeblom how do you launch your script? how do I reproduce it with the bug report template?

I'm running my code on a cluster where each gpu is locked to one process.

That should be fine, ddp launches 1 process per gpu.

@kekeblom
Copy link
Author

Actually, I haven't seen this happen again. So I can't reproduce. Might have been the version or then it's somehow related to the gpu that gets scheduled.

Maybe it's worth closing the issue. I'll let you know if I re-encouter the problem.

@kekeblom
Copy link
Author

kekeblom commented Oct 23, 2020

Speak of the devil. @awaelchli

Tried it on 1.0.3 with 2 gpus. Modified the bug_report_model.py to have gpus=2, distributed_backend="ddp" in the trainer initializer arguments.

It seems it only happens when running on a machine with GTX 1080 gpus. Machines with GTX 1080 Ti or RTX2080 do not appear to suffer from this issue.

Here is the output of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:0D:00.0 Off |                  N/A |
| 28%   29C    P8     5W / 180W |      0MiB /  8119MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:0F:00.0 Off |                  N/A |
| 28%   31C    P8     5W / 180W |      0MiB /  8119MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here is the stack trace:

Traceback (most recent call last):                                                                                                                            
  File "/cluster/home/user/bug_report.py", line 135, in <module>                                                                                              
Traceback (most recent call last):                                                                                                                            
  File "/cluster/home/user/bug_report.py", line 135, in <module>                                                                                              
    run_test()                                                                                                                                                
  File "/cluster/home/user/bug_report.py", line 130, in run_test                                                                                              
    run_test()                                                                                                                                                
  File "/cluster/home/user/bug_report.py", line 130, in run_test                                                                                              
    trainer.fit(model, train_data, val_data)                                                                                                                  
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit                          
    trainer.fit(model, train_data, val_data)                                                                                                                  
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit                          
    results = self.accelerator_backend.train()                                                                                                                
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train           
    results = self.accelerator_backend.train()                                                                                                                
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train           
    results = self.ddp_train(process_idx=self.task_idx, model=model)                                                                                          
    results = self.ddp_train(process_idx=self.task_idx, model=model)                                                                                          
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 272, in ddp_train       
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 272, in ddp_train       
    model = self.configure_ddp(model, device_ids)                                                                                                             
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 290, in configure_ddp   
    model = self.configure_ddp(model, device_ids)                                                                                                             
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 290, in configure_ddp   
    model, device_ids=device_ids, find_unused_parameters=True                                                                                                 
    model, device_ids=device_ids, find_unused_parameters=True                                                                                                 
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__                         
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__                         
    self.broadcast_bucket_size)                                                                                                                               
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced 
    self.broadcast_bucket_size)                                                                                                                               
  File "/cluster/home/user/miniconda3/envs/track/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced 
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)                                                                                       
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)                                                                                       
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629403081/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled cuda error, NCCL version 2.4.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants