Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train Bug #43

Open
wsy-yjys opened this issue Dec 19, 2023 · 6 comments
Open

Train Bug #43

wsy-yjys opened this issue Dec 19, 2023 · 6 comments

Comments

@wsy-yjys
Copy link

I already use pytorch1.8.0, but still encounter the bug during training, could you give me some help? Thank you

/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [301,0,0], thread: [60,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [301,0,0], thread: [61,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [301,0,0], thread: [62,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [301,0,0], thread: [63,0,0] Assertion `input_val >= zero && input_val <= one` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered
�[32m20231218_233328�[0m �[36medgeyolo.train.loss:371�[0m - �[31m�[1merror msg: CUDA error: device-side assert triggered�[0m
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa857cfe2f2 in /home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fa857cfb67b in /home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7fa857f561f9 in /home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fa857ce63a4 in /home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x6e44ca (0x7fa8cbb0a4ca in /home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x6e4561 (0x7fa8cbb0a561 in /home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #6: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x509306]
frame #7: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4f0360]
frame #8: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4f0427]
frame #9: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4f0427]
frame #10: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4f0427]
frame #11: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x5023c9]
frame #12: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x502019]
frame #13: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x501fdd]
frame #14: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4df468]
frame #15: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x5c8443]
frame #16: _PyEval_EvalFrameDefault + 0x4b37 (0x4ec567 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #17: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4e6b2a]
frame #18: _PyFunction_Vectorcall + 0xd4 (0x4f7e54 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x685 (0x4e80b5 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #20: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4f8123]
frame #21: _PyEval_EvalFrameDefault + 0x3c7 (0x4e7df7 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #22: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4e6b2a]
frame #23: _PyFunction_Vectorcall + 0xd4 (0x4f7e54 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x1231 (0x4e8c61 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #25: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4e6b2a]
frame #26: _PyEval_EvalCodeWithName + 0x47 (0x4e67b7 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #27: PyEval_EvalCodeEx + 0x39 (0x4e6769 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #28: PyEval_EvalCode + 0x1b (0x59466b in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #29: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x5c1dc7]
frame #30: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x5bddd0]
frame #31: PyRun_StringFlags + 0x9b (0x5b59eb in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #32: PyRun_SimpleStringFlags + 0x3b (0x5b56cb in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #33: Py_RunMain + 0x25c (0x5b4f0c in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #34: Py_BytesMain + 0x39 (0x588719 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #35: __libc_start_main + 0xe7 (0x7fa8ce156c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x5885ce]

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f464b8172f2 in /home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f464b81467b in /home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f464ba6f1f9 in /home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f464b7ff3a4 in /home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x6e44ca (0x7f46bf6234ca in /home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x6e4561 (0x7f46bf623561 in /home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #6: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x509306]
frame #7: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4f0360]
frame #8: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4f0427]
frame #9: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4f0427]
frame #10: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4f0427]
frame #11: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x5023c9]
frame #12: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x502019]
frame #13: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x501fdd]
frame #14: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4df468]
frame #15: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x5c8443]
frame #16: _PyEval_EvalFrameDefault + 0x4b37 (0x4ec567 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #17: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4e6b2a]
frame #18: _PyFunction_Vectorcall + 0xd4 (0x4f7e54 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x685 (0x4e80b5 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #20: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4f8123]
frame #21: _PyEval_EvalFrameDefault + 0x3c7 (0x4e7df7 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #22: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4e6b2a]
frame #23: _PyFunction_Vectorcall + 0xd4 (0x4f7e54 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x1231 (0x4e8c61 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #25: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x4e6b2a]
frame #26: _PyEval_EvalCodeWithName + 0x47 (0x4e67b7 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #27: PyEval_EvalCodeEx + 0x39 (0x4e6769 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #28: PyEval_EvalCode + 0x1b (0x59466b in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #29: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x5c1dc7]
frame #30: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x5bddd0]
frame #31: PyRun_StringFlags + 0x9b (0x5b59eb in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #32: PyRun_SimpleStringFlags + 0x3b (0x5b56cb in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #33: Py_RunMain + 0x25c (0x5b4f0c in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #34: Py_BytesMain + 0x39 (0x588719 in /home/wsy/anaconda3/envs/pytorch1.8/bin/python)
frame #35: __libc_start_main + 0xe7 (0x7f46c1c6fc87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: /home/wsy/anaconda3/envs/pytorch1.8/bin/python() [0x5885ce]

Traceback (most recent call last):
  File "/home/wsy/paper/Edgeyolo-231206/train.py", line 16, in <module>
    train("DEFAULT" if args.default else args.cfg)
  File "/home/wsy/paper/Edgeyolo-231206/edgeyolo/train/launch_train.py", line 101, in launch
    mp.start_processes(
  File "/home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/wsy/paper/Edgeyolo-231206/edgeyolo/train/loss.py", line 355, in get_losses
    ) = self.get_assignments(  # noqa
  File "/home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/wsy/paper/Edgeyolo-231206/edgeyolo/train/loss.py", line 553, in get_assignments
    ) = self.dynamic_k_matching(cost, pair_wise_ious, gt_classes, num_gt, fg_mask)
  File "/home/wsy/paper/Edgeyolo-231206/edgeyolo/train/loss.py", line 668, in dynamic_k_matching
    cost[gt_idx], k=dynamic_ks[gt_idx].item(), largest=False
RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/wsy/paper/Edgeyolo-231206/edgeyolo/train/launch_train.py", line 73, in train_single
    trainer.train()
  File "/home/wsy/paper/Edgeyolo-231206/edgeyolo/train/trainer.py", line 499, in train
    train_one_epoch()
  File "/home/wsy/paper/Edgeyolo-231206/edgeyolo/train/trainer.py", line 485, in train_one_epoch
    train_one_iter()
  File "/home/wsy/paper/Edgeyolo-231206/edgeyolo/train/trainer.py", line 460, in train_one_iter
    train_in_iter()
  File "/home/wsy/paper/Edgeyolo-231206/edgeyolo/train/trainer.py", line 410, in train_in_iter
    outputs = self.loss(outputs, (targets, mask_edge))
  File "/home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wsy/paper/Edgeyolo-231206/edgeyolo/train/loss.py", line 241, in forward
    loss, bbox_loss, confidence_loss, class_loss, l1_loss, num_fg = self.get_losses(
  File "/home/wsy/paper/Edgeyolo-231206/edgeyolo/train/loss.py", line 389, in get_losses
    ) = self.get_assignments(  # noqa
  File "/home/wsy/anaconda3/envs/pytorch1.8/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/wsy/paper/Edgeyolo-231206/edgeyolo/train/loss.py", line 492, in get_assignments
    gt_bboxes_per_image = gt_bboxes_per_image.cpu().float()
RuntimeError: CUDA error: device-side assert triggered
@wsy-yjys
Copy link
Author

This is my torch and torchvision version

torch                   1.8.0+cu111
torchvision             0.9.0+cu111

@LSH9832
Copy link
Owner

LSH9832 commented Dec 19, 2023

@LSH9832
Copy link
Owner

LSH9832 commented Dec 19, 2023

I'll try fix this later, during this time you can try to fix the code by yourself as what above says.

@wsy-yjys
Copy link
Author

ok, thank you~

@wsy-yjys
Copy link
Author

RuntimeError: CUDA error: no kernel image is available for execution on the device 

Hi, I meet a new problem, is these means what?

@wsy-yjys
Copy link
Author

/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [301,0,0], thread: [60,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [301,0,0], thread: [61,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [301,0,0], thread: [62,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [301,0,0], thread: [63,0,0] Assertion input_val >= zero && input_val <= one failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered
�[32m20231218_233328�[0m �[36medgeyolo.train.loss:371�[0m - �[31m�[1merror msg: CUDA error: device-side assert triggered�[0m
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):

您好,请问现在这个训练bug解决了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants