Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP error after upgrading to v1.0.1 #4171

Closed
sheffier opened this issue Oct 15, 2020 · 3 comments · Fixed by #4297
Closed

DDP error after upgrading to v1.0.1 #4171

sheffier opened this issue Oct 15, 2020 · 3 comments · Fixed by #4297
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@sheffier
Copy link

🐛 Bug

I'll start by saying that before I've upgrade to v1.0.1, I've used v0.9.0 with no apparent DDP issues.

After I've upgraded to v1.0.1, I'm having issues training with multiple GPUs using DDP backend.

Initiating multi-gpu training in the following scenarios will result with an error:

  1. The first GPU (i.e. ID 0) is not included in the GPUs list. for example:
    python main.py --distributed_backend 'ddp' --gpus 1,2,3

  2. The GPUs list is not sequential. For example:
    python main.py --distributed_backend 'ddp' --gpus 0,2,3

The above will result with the following error message:
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1595629427478/work/torch/csrc/cuda/Module.cpp:59

Initiating multi-gpu training in the following scenarios will work as expected:

python main.py --distributed_backend 'ddp' --gpus 2
or
python main.py --distributed_backend 'ddp' --gpus 0,1,2

Environment

  • CUDA:
    - GPU:
    - GeForce GTX 1080 Ti
    - GeForce GTX 1080 Ti
    - GeForce GTX 1080 Ti
    - GeForce GTX 1080 Ti
    - available: True
    - version: 10.2
  • Packages:
    - numpy: 1.19.1
    - pyTorch_debug: False
    - pyTorch_version: 1.6.0
    - pytorch-lightning: 1.0.1
    - tqdm: 4.50.2
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    -
    - processor: x86_64
    - python: 3.7.8
    - version: Names of parameters may benefit from not being abbreviated #119-Ubuntu SMP Tue Sep 8 12:30:01 UTC 2020
@sheffier sheffier added bug Something isn't working help wanted Open to be worked on labels Oct 15, 2020
@williamFalcon
Copy link
Contributor

ok got it. yes, was able to reproduce on my end. it's really hard to test these cases with only 2 GPUs haha. anyhow, this is a bit more involved so it might take a few days. In the meantime set CUDA_VISIBLE_DEVICES to the gpus you need and pass in the number of GPUs.

sorry for the inconvenience!

once it's fixed, we'll ping you here

@sheffier
Copy link
Author

Great, thank!

@edenlightning edenlightning added this to the 1.0.3 milestone Oct 19, 2020
@edenlightning edenlightning added distributed Generic distributed-related topic priority: 0 High priority task and removed priority: 0 High priority task labels Oct 19, 2020
@awaelchli
Copy link
Contributor

duplicate of #3791
I'm working on this but no big breakthrough yet. I'm facing some difficulties because there are several global/env variables that determine the GPU selection. For ddp this is quite difficult to debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants