Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using multi-GPU training reports errors #12213

Closed
1 task done
jcluo1994 opened this issue Oct 10, 2023 · 5 comments
Closed
1 task done

Using multi-GPU training reports errors #12213

jcluo1994 opened this issue Oct 10, 2023 · 5 comments
Labels
question Further information is requested Stale

Comments

@jcluo1994
Copy link

Search before asking

Question

Traceback (most recent call last):
File "train.py", line 647, in
main(opt)
File "train.py", line 536, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 116, in train
with torch_distributed_zero_first(LOCAL_RANK):
File "/opt/conda/envs/train/lib/python3.8/contextlib.py", line 113, in enter
return next(self.gen)
File "/home/bml/yolov5/utils/torch_utils.py", line 92, in torch_distributed_zero_first
dist.barrier(device_ids=[local_rank])
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(args, kwargs)
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from doWait at ../torch/csrc/distributed/c10d/TCPStore.cpp:445 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3e54353617 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, char const
) + 0x68 (0x7f3e5430ea56 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x32c (0x7f3e852c536c in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f3e852c64f2 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x55 (0x7f3e852c6915 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3e8527e161 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3e8527e161 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3e8527e161 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3e8527e161 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xb2 (0x7f3e553460b2 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x203 (0x7f3e5534ba83 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0xf19257 (0x7f3e5535a257 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x21 (0x7f3e5535bf01 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #13: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x3a7 (0x7f3e5535db27 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #14: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0xb25 (0x7f3e5536f7d5 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #15: + 0x55786a2 (0x7f3e852716a2 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x5582cc0 (0x7f3e8527bcc0 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: + 0x5582dc5 (0x7f3e8527bdc5 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #18: + 0x4bae85b (0x7f3e848a785b in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #19: + 0x4bac83c (0x7f3e848a583c in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #20: + 0x1904688 (0x7f3e815fd688 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #21: + 0x558c284 (0x7f3e85285284 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #22: + 0x558d1ed (0x7f3e852861ed in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #23: + 0xc407b8 (0x7f3e9787e7b8 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #24: + 0x3ee82f (0x7f3e9702c82f in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #25: PyCFunction_Call + 0x52 (0x4f5572 in /opt/conda/envs/train/bin/python)
frame #26: _PyObject_MakeTpCall + 0x3bb (0x4e0e1b in /opt/conda/envs/train/bin/python)
frame #27: /opt/conda/envs/train/bin/python() [0x4f531d]
frame #28: _PyEval_EvalFrameDefault + 0x1153 (0x4d9263 in /opt/conda/envs/train/bin/python)
frame #29: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #30: _PyFunction_Vectorcall + 0x19c (0x4e823c in /opt/conda/envs/train/bin/python)
frame #31: PyObject_Call + 0x34e (0x4f76ce in /opt/conda/envs/train/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2073 (0x4da183 in /opt/conda/envs/train/bin/python)
frame #33: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #34: _PyFunction_Vectorcall + 0x19c (0x4e823c in /opt/conda/envs/train/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x1153 (0x4d9263 in /opt/conda/envs/train/bin/python)
frame #36: /opt/conda/envs/train/bin/python() [0x4fc29b]
frame #37: /opt/conda/envs/train/bin/python() [0x562b30]
frame #38: /opt/conda/envs/train/bin/python() [0x4e8cfb]
frame #39: _PyEval_EvalFrameDefault + 0x399 (0x4d84a9 in /opt/conda/envs/train/bin/python)
frame #40: _PyFunction_Vectorcall + 0x106 (0x4e81a6 in /opt/conda/envs/train/bin/python)
frame #41: /opt/conda/envs/train/bin/python() [0x4f5154]
frame #42: _PyEval_EvalFrameDefault + 0x2ab0 (0x4dabc0 in /opt/conda/envs/train/bin/python)
frame #43: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #44: _PyFunction_Vectorcall + 0x19c (0x4e823c in /opt/conda/envs/train/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x399 (0x4d84a9 in /opt/conda/envs/train/bin/python)
frame #46: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #47: _PyFunction_Vectorcall + 0x19c (0x4e823c in /opt/conda/envs/train/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x399 (0x4d84a9 in /opt/conda/envs/train/bin/python)
frame #49: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #50: PyEval_EvalCodeEx + 0x39 (0x585e29 in /opt/conda/envs/train/bin/python)
frame #51: PyEval_EvalCode + 0x1b (0x585deb in /opt/conda/envs/train/bin/python)
frame #52: /opt/conda/envs/train/bin/python() [0x5a5bd1]
frame #53: /opt/conda/envs/train/bin/python() [0x5a4bdf]
frame #54: /opt/conda/envs/train/bin/python() [0x45c538]
frame #55: PyRun_SimpleFileExFlags + 0x340 (0x45c0d9 in /opt/conda/envs/train/bin/python)
frame #56: /opt/conda/envs/train/bin/python() [0x44fe8f]
frame #57: Py_BytesMain + 0x39 (0x579e89 in /opt/conda/envs/train/bin/python)
frame #58: __libc_start_main + 0xf0 (0x7f3ed8afe840 in /lib/x86_64-linux-gnu/libc.so.6)
frame #59: /opt/conda/envs/train/bin/python() [0x579d3d]
. This may indicate a possible application crash on rank 0 or a network set up issue.
[2023-10-10 14:56:33,514] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 83 closing signal SIGTERM
[2023-10-10 14:56:33,629] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 84) of binary: /opt/conda/envs/train/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/train/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/train/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 810, in
main()
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-10-10_14:56:33
host : 14064c861bcc
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 84)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Additional

i use the command in the flowing ,python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1,Thank you for your help.

@jcluo1994 jcluo1994 added the question Further information is requested label Oct 10, 2023
Copy link
Contributor

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@github-actions github-actions bot added the Stale label Nov 10, 2023
@glenn-jocher
Copy link
Member

@jcluo1994 this issue seems to be related to the distributed training setup, specifically the NCCL communicator and the key-value store. You may want to ensure that the communication between the processes is set up correctly, and check the network setup to address any possible issues that may be causing these errors during training. Let me know if you need further assistance with this.

@github-actions github-actions bot removed the Stale label Nov 15, 2023
Copy link
Contributor

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@github-actions github-actions bot added the Stale label Dec 15, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 25, 2023
@ANYMS-A
Copy link

ANYMS-A commented Jun 26, 2024

I met the same issue as you did when I am using deepspeed to finetune the LLM. Have you found any solution for this ?

@glenn-jocher
Copy link
Member

Hello @ANYMS-A,

Thank you for reaching out and sharing your experience. It seems like you're encountering a similar issue with the NCCL communicator and key-value store during multi-GPU training. Let's work together to resolve this.

To better assist you, could you please provide a minimum reproducible code example? This will help us understand the exact setup and conditions under which the issue occurs. You can refer to our guide on creating a minimum reproducible example. This step is crucial for us to reproduce and investigate the bug effectively.

Additionally, please ensure that you are using the latest versions of torch and the YOLOv5 repository. Sometimes, issues are resolved in newer releases, and updating might solve the problem without further troubleshooting.

Here's a quick checklist to help you get started:

  1. Update YOLOv5 and PyTorch:

    git pull https://github.com/ultralytics/yolov5
    pip install --upgrade torch
  2. Verify your command:
    Ensure you are using the recommended DistributedDataParallel (DDP) mode for multi-GPU training:

    python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1
  3. Check network setup:
    Since the error involves network communication, ensure that your network configuration allows for proper communication between GPUs.

If you have already tried these steps and the issue persists, please share the details of your setup and any additional logs or error messages you encounter. This information will be invaluable in diagnosing and resolving the issue.

Thank you for your patience and cooperation. We're here to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

3 participants