RuntimeError: CUDA error: unspecified launch failure #1752

jerryWTMH · 2020-12-22T01:14:57Z

❔Question

I used the DDP mode to train my data but whenever I tried the code couldn't run over 30 epochs and it would show RuntimeError: CUDA error: unspecified launch failure. I have tried to train with a single GPU or even redownload the yolov5 repository but the problem still exists.

Here is the information about my training:
training data: 35000 images
validation data: 9100 images
Python: 3.8.3
torch: 1.7.1
CUDA: 10.2

Additional context

The whole content of the error:

Traceback (most recent call last):
  File "yolov5/train.py", line 511, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "yolov5/train.py", line 336, in train
    results, maps, times = test.test(opt.data,
  File "/root/ultrasound_project/Testing/yolov5/test.py", line 110, in test
    t0 += time_synchronized() - t
  File "/root/ultrasound_project/Testing/yolov5/utils/torch_utils.py", line 74, in time_synchronized
    torch.cuda.synchronize() if torch.cuda.is_available() else None
  File "/root/miniconda3/lib/python3.8/site-packages/torch/cuda/__init__.py", line 380, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: unspecified launch failure
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f758b5708b2 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f758b7c2952 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f758b55bb7d in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x312 (0x7f75da243842 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10d::Reducer::~Reducer() + 0x342 (0x7f75da242122 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f75da215fd2 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f75d9c0c926 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x8c1d7f (0x7f75da217d7f in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x2c2b90 (0x7f75d9c18b90 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x2c3cfe (0x7f75d9c19cfe in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1285b5 (0x56179230d5b5 in /root/miniconda3/bin/python)
frame #11: <unknown function> + 0x1c0d74 (0x5617923a5d74 in /root/miniconda3/bin/python)
frame #12: <unknown function> + 0x11c5a6 (0x5617923015a6 in /root/miniconda3/bin/python)
frame #13: <unknown function> + 0x128ac6 (0x56179230dac6 in /root/miniconda3/bin/python)
frame #14: <unknown function> + 0x128a7c (0x56179230da7c in /root/miniconda3/bin/python)
frame #15: PyDict_SetItem + 0x2ac (0x56179235635c in /root/miniconda3/bin/python)
frame #16: PyDict_SetItemString + 0x4f (0x561792356a4f in /root/miniconda3/bin/python)
frame #17: PyImport_Cleanup + 0x9b (0x561792430dfb in /root/miniconda3/bin/python)
frame #18: Py_FinalizeEx + 0x83 (0x5617924311b3 in /root/miniconda3/bin/python)
frame #19: Py_RunMain + 0x110 (0x5617924336a0 in /root/miniconda3/bin/python)
frame #20: Py_BytesMain + 0x39 (0x561792433829 in /root/miniconda3/bin/python)
frame #21: __libc_start_main + 0xe7 (0x7f75e1691b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: <unknown function> + 0x1deb33 (0x5617923c3b33 in /root/miniconda3/bin/python)

The text was updated successfully, but these errors were encountered:

jerryWTMH · 2020-12-22T01:26:35Z

Here is the error for single GPU training:

File "yolov5/train.py", line 511, in <module>
train(hyp, opt, device, tb_writer, wandb)
File "yolov5/train.py", line 336, in train
results, maps, times = test.test(opt.data,
File "/root/ultrasound_project/Testing/yolov5/test.py", line 110, in test
t0 += time_synchronized() - t
File "/root/ultrasound_project/Testing/yolov5/utils/torch_utils.py", line 74, in time_synchronized
torch.cuda.synchronize() if torch.cuda.is_available() else None
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/cuda/__init__.py", line 380, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: unspecified launch failure

I think the problem comes from the torch.cuda.synchronize()
Does anybody meet the same problem as this?

github-actions · 2021-01-22T01:20:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cbiras · 2021-04-21T15:49:45Z

Hello @jerryWTMH ! Did you find a solution to this error?

pederismo · 2021-05-24T14:31:31Z

Hi, I also encounter this issue randomly when trying to run experiments on GPUs that already have processes running on them, but to me it happens for this reason:

Traceback (most recent call last):
  File "../main_swav.py", line 405, in <module>
    main()
  File "../main_swav.py", line 186, in main
    process_group = apex.parallel.create_syncbn_process_group(args.syncbn_process_group_size)
  File "/home2/michele.guerra/VirtualEnvironments/semi-swav/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/parallel/__init__.py", line 89, in create_syncbn_process_group
    cur_group = torch.distributed.new_group(ranks=group_ids)
  File "/home2/michele.guerra/VirtualEnvironments/semi-swav/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2048, in new_group
    barrier()
  File "/home2/michele.guerra/VirtualEnvironments/semi-swav/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1967, in barrier
    work.wait()
RuntimeError: CUDA error: unspecified launch failure
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687

I think the problem comes from the fact we are all using CUDA 10.x, and I don't think they will come back to this issue because it is outdated. We should probably update CUDA to a more recent version

jerryWTMH added the question Further information is requested label Dec 22, 2020

github-actions bot added the Stale label Jan 22, 2021

github-actions bot closed this as completed Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: unspecified launch failure #1752

RuntimeError: CUDA error: unspecified launch failure #1752

jerryWTMH commented Dec 22, 2020 •

edited

Loading

jerryWTMH commented Dec 22, 2020

github-actions bot commented Jan 22, 2021

cbiras commented Apr 21, 2021

pederismo commented May 24, 2021

RuntimeError: CUDA error: unspecified launch failure #1752

RuntimeError: CUDA error: unspecified launch failure #1752

Comments

jerryWTMH commented Dec 22, 2020 • edited Loading

❔Question

Additional context

jerryWTMH commented Dec 22, 2020

github-actions bot commented Jan 22, 2021

cbiras commented Apr 21, 2021

pederismo commented May 24, 2021

jerryWTMH commented Dec 22, 2020 •

edited

Loading