CUDA error: the launch timed out and was terminated #4851

coallar · 2021-09-18T12:37:10Z

YSYTERM: ubuntu20.04

driver info:

CUDA:0 (NVIDIA GeForce GTX TITAN X, 12204.4375MB)
CUDA:1 (NVIDIA GeForce GTX TITAN X, 12212.875MB)

torch&cuda info
torch.version ====> '1.8.0+cu111'

Conmand： python3 -m torch.distributed.launch --nproc_per_node 2 train.py --batch 32 --data coco.yaml --weights yolov5x.pt --device 0,1 --imgsz 560 --cfg yolov5x.yaml

error：

Image sizes 576 train, 576 val
Using 8 dataloader workers
Logging results to runs/train/exp5
Starting training for 60 epochs...

 Epoch   gpu_mem       box       obj       cls    labels  img_size
  0/59     10.6G    0.1132   0.02993         0        29       576:   3%|██▎                                                                                      | 1/39 [00:10<06:48, 10.75s/it]Reducer buckets have been rebuilt in this iteration.
  0/59     10.6G   0.09944   0.03163         0        19       576: 100%|████████████████████████████████████████████████████████████████████████████████████████| 39/39 [01:38<00:00,  2.54s/it]
           Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:   3%|█▊                                                                       | 1/39 [00:05<03:43,  5.89s/it]Traceback (most recent call last):

File "train.py", line 620, in
main(opt)
File "train.py", line 518, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 312, in train
pred = model(imgs) # forward
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/yolo.py", line 123, in forward
return self.forward_once(x, profile, visualize) # single-scale inference, train
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/yolo.py", line 155, in forward_once
x = m(x) # run
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/common.py", line 137, in forward
return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/common.py", line 103, in forward
return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, *kwargs)
File "/home/zgj/pycharmProject/competetion/datafountain/yolov5/yolov5-new/models/common.py", line 45, in forward
return self.act(self.bn(self.conv(x)))
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 113, in forward
self.num_batches_tracked = self.num_batches_tracked + 1 # type: ignore
RuntimeError: CUDA error: the launch timed out and was terminated
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa2a2d962f2 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x5b (0x7fa2a2d9367b in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x809 (0x7fa2a2fee1f9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fa2a2d7e3a4 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x2f9 (0x7fa316ec0ac9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7fa316eb5a8a in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fa316edcd22 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fa316818df6 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xa2201f (0x7fa316ee001f in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x369f00 (0x7fa316827f00 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x36b16e (0x7fa31682916e in /home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0xfa96c (0x560dc73e296c in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #12: + 0x18f2f5 (0x560dc74772f5 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #13: + 0xfaef8 (0x560dc73e2ef8 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #14: + 0xfd538 (0x560dc73e5538 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #15: + 0xfd5d9 (0x560dc73e55d9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #16: + 0xfd5d9 (0x560dc73e55d9 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #17: PyDict_SetItemString + 0x401 (0x560dc74893d1 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #18: PyImport_Cleanup + 0xa4 (0x560dc75574e4 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #19: Py_FinalizeEx + 0x7a (0x560dc7557a9a in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #20: Py_RunMain + 0x1b8 (0x560dc755c5c8 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #21: Py_BytesMain + 0x39 (0x560dc755c939 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)
frame #22: __libc_start_main + 0xf3 (0x7fa31e2ce0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #23: + 0x1e8f39 (0x560dc74d0f39 in /home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3)

Killing subprocess 160871
Killing subprocess 160872
Traceback (most recent call last):
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/zgj/anaconda3/envs/torch1.8.0py3.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zgj/anaconda3/envs/torch1.8.0py3.8/bin/python3', '-u', 'train.py', '--local_rank=1', '--batch', '32', '--data', 'coco.yaml', '--weights', 'yolov5x.pt', '--device', '0,1', '--imgsz', '560', '--cfg', 'yolov5x.yaml']' died with <Signals.SIGABRT: 6>.

how could I solve this problem

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2021-09-18T12:46:18Z

@coallar your command seems fine though --cfg yolov5x.yaml is redundant with your --weights. For best Multi-GPU performance we always recommend training DDP inside our Docker Image.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

github-actions · 2021-10-19T00:10:59Z

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

coallar added the bug Something isn't working label Sep 18, 2021

github-actions bot added the Stale label Oct 19, 2021

github-actions bot closed this as completed Oct 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: the launch timed out and was terminated #4851

CUDA error: the launch timed out and was terminated #4851

coallar commented Sep 18, 2021

glenn-jocher commented Sep 18, 2021 •

edited

Loading

github-actions bot commented Oct 19, 2021 •

edited by glenn-jocher

Loading

CUDA error: the launch timed out and was terminated #4851

CUDA error: the launch timed out and was terminated #4851

Comments

coallar commented Sep 18, 2021

glenn-jocher commented Sep 18, 2021 • edited Loading

Environments

Status

github-actions bot commented Oct 19, 2021 • edited by glenn-jocher Loading

glenn-jocher commented Sep 18, 2021 •

edited

Loading

github-actions bot commented Oct 19, 2021 •

edited by glenn-jocher

Loading