Using multi-GPU training reports errors #12213

jcluo1994 · 2023-10-10T08:01:28Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Traceback (most recent call last):
File "train.py", line 647, in
main(opt)
File "train.py", line 536, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 116, in train
with torch_distributed_zero_first(LOCAL_RANK):
File "/opt/conda/envs/train/lib/python3.8/contextlib.py", line 113, in enter
return next(self.gen)
File "/home/bml/yolov5/utils/torch_utils.py", line 92, in torch_distributed_zero_first
dist.barrier(device_ids=[local_rank])
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(args, kwargs)
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from doWait at ../torch/csrc/distributed/c10d/TCPStore.cpp:445 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3e54353617 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, char const) + 0x68 (0x7f3e5430ea56 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x32c (0x7f3e852c536c in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f3e852c64f2 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x55 (0x7f3e852c6915 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3e8527e161 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3e8527e161 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3e8527e161 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f3e8527e161 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId, bool, std::string const&, int) + 0xb2 (0x7f3e553460b2 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x203 (0x7f3e5534ba83 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0xf19257 (0x7f3e5535a257 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x21 (0x7f3e5535bf01 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #13: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x3a7 (0x7f3e5535db27 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #14: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0xb25 (0x7f3e5536f7d5 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #15: + 0x55786a2 (0x7f3e852716a2 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x5582cc0 (0x7f3e8527bcc0 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: + 0x5582dc5 (0x7f3e8527bdc5 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #18: + 0x4bae85b (0x7f3e848a785b in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #19: + 0x4bac83c (0x7f3e848a583c in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #20: + 0x1904688 (0x7f3e815fd688 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #21: + 0x558c284 (0x7f3e85285284 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #22: + 0x558d1ed (0x7f3e852861ed in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #23: + 0xc407b8 (0x7f3e9787e7b8 in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #24: + 0x3ee82f (0x7f3e9702c82f in /opt/conda/envs/train/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #25: PyCFunction_Call + 0x52 (0x4f5572 in /opt/conda/envs/train/bin/python)
frame #26: _PyObject_MakeTpCall + 0x3bb (0x4e0e1b in /opt/conda/envs/train/bin/python)
frame #27: /opt/conda/envs/train/bin/python() [0x4f531d]
frame #28: _PyEval_EvalFrameDefault + 0x1153 (0x4d9263 in /opt/conda/envs/train/bin/python)
frame #29: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #30: _PyFunction_Vectorcall + 0x19c (0x4e823c in /opt/conda/envs/train/bin/python)
frame #31: PyObject_Call + 0x34e (0x4f76ce in /opt/conda/envs/train/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2073 (0x4da183 in /opt/conda/envs/train/bin/python)
frame #33: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #34: _PyFunction_Vectorcall + 0x19c (0x4e823c in /opt/conda/envs/train/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x1153 (0x4d9263 in /opt/conda/envs/train/bin/python)
frame #36: /opt/conda/envs/train/bin/python() [0x4fc29b]
frame #37: /opt/conda/envs/train/bin/python() [0x562b30]
frame #38: /opt/conda/envs/train/bin/python() [0x4e8cfb]
frame #39: _PyEval_EvalFrameDefault + 0x399 (0x4d84a9 in /opt/conda/envs/train/bin/python)
frame #40: _PyFunction_Vectorcall + 0x106 (0x4e81a6 in /opt/conda/envs/train/bin/python)
frame #41: /opt/conda/envs/train/bin/python() [0x4f5154]
frame #42: _PyEval_EvalFrameDefault + 0x2ab0 (0x4dabc0 in /opt/conda/envs/train/bin/python)
frame #43: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #44: _PyFunction_Vectorcall + 0x19c (0x4e823c in /opt/conda/envs/train/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x399 (0x4d84a9 in /opt/conda/envs/train/bin/python)
frame #46: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #47: _PyFunction_Vectorcall + 0x19c (0x4e823c in /opt/conda/envs/train/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x399 (0x4d84a9 in /opt/conda/envs/train/bin/python)
frame #49: _PyEval_EvalCodeWithName + 0x2f1 (0x4d70d1 in /opt/conda/envs/train/bin/python)
frame #50: PyEval_EvalCodeEx + 0x39 (0x585e29 in /opt/conda/envs/train/bin/python)
frame #51: PyEval_EvalCode + 0x1b (0x585deb in /opt/conda/envs/train/bin/python)
frame #52: /opt/conda/envs/train/bin/python() [0x5a5bd1]
frame #53: /opt/conda/envs/train/bin/python() [0x5a4bdf]
frame #54: /opt/conda/envs/train/bin/python() [0x45c538]
frame #55: PyRun_SimpleFileExFlags + 0x340 (0x45c0d9 in /opt/conda/envs/train/bin/python)
frame #56: /opt/conda/envs/train/bin/python() [0x44fe8f]
frame #57: Py_BytesMain + 0x39 (0x579e89 in /opt/conda/envs/train/bin/python)
frame #58: __libc_start_main + 0xf0 (0x7f3ed8afe840 in /lib/x86_64-linux-gnu/libc.so.6)
frame #59: /opt/conda/envs/train/bin/python() [0x579d3d]
. This may indicate a possible application crash on rank 0 or a network set up issue.
[2023-10-10 14:56:33,514] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 83 closing signal SIGTERM
[2023-10-10 14:56:33,629] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 84) of binary: /opt/conda/envs/train/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/train/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/train/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 810, in
main()
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(args, kwargs)
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-10-10_14:56:33
host : 14064c861bcc
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 84)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Additional

i use the command in the flowing ,python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1,Thank you for your help.

github-actions · 2023-11-10T00:19:39Z

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

glenn-jocher · 2023-11-14T16:02:05Z

@jcluo1994 this issue seems to be related to the distributed training setup, specifically the NCCL communicator and the key-value store. You may want to ensure that the communication between the processes is set up correctly, and check the network setup to address any possible issues that may be causing these errors during training. Let me know if you need further assistance with this.

github-actions · 2023-12-15T00:21:57Z

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

ANYMS-A · 2024-06-26T06:20:21Z

I met the same issue as you did when I am using deepspeed to finetune the LLM. Have you found any solution for this ?

glenn-jocher · 2024-06-26T09:52:55Z

Hello @ANYMS-A,

Thank you for reaching out and sharing your experience. It seems like you're encountering a similar issue with the NCCL communicator and key-value store during multi-GPU training. Let's work together to resolve this.

To better assist you, could you please provide a minimum reproducible code example? This will help us understand the exact setup and conditions under which the issue occurs. You can refer to our guide on creating a minimum reproducible example. This step is crucial for us to reproduce and investigate the bug effectively.

Additionally, please ensure that you are using the latest versions of torch and the YOLOv5 repository. Sometimes, issues are resolved in newer releases, and updating might solve the problem without further troubleshooting.

Here's a quick checklist to help you get started:

Update YOLOv5 and PyTorch:

git pull https://github.com/ultralytics/yolov5
pip install --upgrade torch

Verify your command:
Ensure you are using the recommended DistributedDataParallel (DDP) mode for multi-GPU training:

python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1

Check network setup:
Since the error involves network communication, ensure that your network configuration allows for proper communication between GPUs.

If you have already tried these steps and the issue persists, please share the details of your setup and any additional logs or error messages you encounter. This information will be invaluable in diagnosing and resolving the issue.

Thank you for your patience and cooperation. We're here to help!

jcluo1994 added the question Further information is requested label Oct 10, 2023

github-actions bot added the Stale label Nov 10, 2023

github-actions bot removed the Stale label Nov 15, 2023

github-actions bot added the Stale label Dec 15, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using multi-GPU training reports errors #12213

Using multi-GPU training reports errors #12213

jcluo1994 commented Oct 10, 2023

github-actions bot commented Nov 10, 2023

glenn-jocher commented Nov 14, 2023

github-actions bot commented Dec 15, 2023

ANYMS-A commented Jun 26, 2024

glenn-jocher commented Jun 26, 2024

Using multi-GPU training reports errors #12213

Using multi-GPU training reports errors #12213

Comments

jcluo1994 commented Oct 10, 2023

Search before asking

Question

train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-10-10_14:56:33 host : 14064c861bcc rank : 1 (local_rank: 1) exitcode : 1 (pid: 84) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Additional

github-actions bot commented Nov 10, 2023

glenn-jocher commented Nov 14, 2023

github-actions bot commented Dec 15, 2023

ANYMS-A commented Jun 26, 2024

glenn-jocher commented Jun 26, 2024

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-10-10_14:56:33
host : 14064c861bcc
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 84)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html