Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: unspecified launch failure #1752

Closed
jerryWTMH opened this issue Dec 22, 2020 · 4 comments
Closed

RuntimeError: CUDA error: unspecified launch failure #1752

jerryWTMH opened this issue Dec 22, 2020 · 4 comments
Labels
question Further information is requested Stale

Comments

@jerryWTMH
Copy link

jerryWTMH commented Dec 22, 2020

❔Question

I used the DDP mode to train my data but whenever I tried the code couldn't run over 30 epochs and it would show RuntimeError: CUDA error: unspecified launch failure. I have tried to train with a single GPU or even redownload the yolov5 repository but the problem still exists.

Here is the information about my training:
training data: 35000 images
validation data: 9100 images
Python: 3.8.3
torch: 1.7.1
CUDA: 10.2

Additional context

The whole content of the error:

Traceback (most recent call last):
  File "yolov5/train.py", line 511, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "yolov5/train.py", line 336, in train
    results, maps, times = test.test(opt.data,
  File "/root/ultrasound_project/Testing/yolov5/test.py", line 110, in test
    t0 += time_synchronized() - t
  File "/root/ultrasound_project/Testing/yolov5/utils/torch_utils.py", line 74, in time_synchronized
    torch.cuda.synchronize() if torch.cuda.is_available() else None
  File "/root/miniconda3/lib/python3.8/site-packages/torch/cuda/__init__.py", line 380, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: unspecified launch failure
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f758b5708b2 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f758b7c2952 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f758b55bb7d in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x312 (0x7f75da243842 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10d::Reducer::~Reducer() + 0x342 (0x7f75da242122 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f75da215fd2 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f75d9c0c926 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x8c1d7f (0x7f75da217d7f in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x2c2b90 (0x7f75d9c18b90 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x2c3cfe (0x7f75d9c19cfe in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1285b5 (0x56179230d5b5 in /root/miniconda3/bin/python)
frame #11: <unknown function> + 0x1c0d74 (0x5617923a5d74 in /root/miniconda3/bin/python)
frame #12: <unknown function> + 0x11c5a6 (0x5617923015a6 in /root/miniconda3/bin/python)
frame #13: <unknown function> + 0x128ac6 (0x56179230dac6 in /root/miniconda3/bin/python)
frame #14: <unknown function> + 0x128a7c (0x56179230da7c in /root/miniconda3/bin/python)
frame #15: PyDict_SetItem + 0x2ac (0x56179235635c in /root/miniconda3/bin/python)
frame #16: PyDict_SetItemString + 0x4f (0x561792356a4f in /root/miniconda3/bin/python)
frame #17: PyImport_Cleanup + 0x9b (0x561792430dfb in /root/miniconda3/bin/python)
frame #18: Py_FinalizeEx + 0x83 (0x5617924311b3 in /root/miniconda3/bin/python)
frame #19: Py_RunMain + 0x110 (0x5617924336a0 in /root/miniconda3/bin/python)
frame #20: Py_BytesMain + 0x39 (0x561792433829 in /root/miniconda3/bin/python)
frame #21: __libc_start_main + 0xe7 (0x7f75e1691b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: <unknown function> + 0x1deb33 (0x5617923c3b33 in /root/miniconda3/bin/python)
@jerryWTMH jerryWTMH added the question Further information is requested label Dec 22, 2020
@jerryWTMH
Copy link
Author

Here is the error for single GPU training:

File "yolov5/train.py", line 511, in <module>
train(hyp, opt, device, tb_writer, wandb)
File "yolov5/train.py", line 336, in train
results, maps, times = test.test(opt.data,
File "/root/ultrasound_project/Testing/yolov5/test.py", line 110, in test
t0 += time_synchronized() - t
File "/root/ultrasound_project/Testing/yolov5/utils/torch_utils.py", line 74, in time_synchronized
torch.cuda.synchronize() if torch.cuda.is_available() else None
File "/root/miniconda3/envs/test/lib/python3.8/site-packages/torch/cuda/__init__.py", line 380, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: unspecified launch failure

I think the problem comes from the torch.cuda.synchronize()
Does anybody meet the same problem as this?

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@cbiras
Copy link

cbiras commented Apr 21, 2021

Hello @jerryWTMH ! Did you find a solution to this error?

@pederismo
Copy link

Hi, I also encounter this issue randomly when trying to run experiments on GPUs that already have processes running on them, but to me it happens for this reason:

Traceback (most recent call last):
  File "../main_swav.py", line 405, in <module>
    main()
  File "../main_swav.py", line 186, in main
    process_group = apex.parallel.create_syncbn_process_group(args.syncbn_process_group_size)
  File "/home2/michele.guerra/VirtualEnvironments/semi-swav/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/parallel/__init__.py", line 89, in create_syncbn_process_group
    cur_group = torch.distributed.new_group(ranks=group_ids)
  File "/home2/michele.guerra/VirtualEnvironments/semi-swav/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2048, in new_group
    barrier()
  File "/home2/michele.guerra/VirtualEnvironments/semi-swav/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1967, in barrier
    work.wait()
RuntimeError: CUDA error: unspecified launch failure
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687

I think the problem comes from the fact we are all using CUDA 10.x, and I don't think they will come back to this issue because it is outdated. We should probably update CUDA to a more recent version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

3 participants