Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: reduce failed to synchronize: device-side assert triggered #263

Closed
YourGc opened this issue May 6, 2019 · 9 comments
Closed

Comments

@YourGc
Copy link

YourGc commented May 6, 2019

I am so sorry that facing with this issue while i am train my custom dataset.
CUDA:10.0
and i am using the lastest version! here is my error information:

Model Summary: 222 layers, 6.1626e+07 parameters, 6.1626e+07 gradients

   Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
   0/272      0/1266      7.04      2.11       302      17.3       328        16      2.03
   0/272      1/1266      7.24      2.27       302      17.5       329        16     0.713
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *                                                                    , Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t                                                                     < n_classes` failed.
Traceback (most recent call last):
  File "train.py", line 313, in <module>
    multi_scale=opt.multi_scale,
  File "train.py", line 190, in train
    loss, loss_items = compute_loss(pred, targets, model)
  File "/home/star/Wayne/transportation/ygc/yolov3/utils/utils.py", line 278, in compute_loss
    lxy += (k * h['xy']) * MSE(torch.sigmoid(pi[..., 0:2]), txy[i])  # xy loss
  File "/home/star/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/star/anaconda3/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 443, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
  File "/home/star/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2257, in mse_loss
    ret = torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: reduce failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f2ff22a8441 in /home/star/anaconda3/lib/python3.7/site-packag                                                                    es/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f2ff22a7d7a in /home/star/anaconda3/lib/python3.7/si                                                                    te-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x13652 (0x7f2ff01e5652 in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x50 (0x7f2ff2298ce0 in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/                                                                    libc10.so)
frame #4: <unknown function> + 0x30facb (0x7f2ff0b81acb in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #5: <unknown function> + 0x1423ab (0x7f30315b93ab in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_pytho                                                                    n.so)
frame #6: <unknown function> + 0x6c0a41 (0x7f3031b37a41 in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_pytho                                                                    n.so)
frame #7: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f3031b37b82 in /home/star/anacon                                                                    da3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0xa2 (0x7f3031595d82 in /home/star/anaconda3/lib/python3                                                                    .7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x6b598b (0x7f3031b2c98b in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_pytho                                                                    n.so)
frame #10: <unknown function> + 0x12fe67 (0x7f30315a6e67 in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_pyth                                                                    on.so)
frame #11: <unknown function> + 0x1300be (0x7f30315a70be in /home/star/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_pyth                                                                    on.so)
<omitting python frames>
frame #23: __libc_start_main + 0xf0 (0x7f30411be830 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

#really need your help ,thx

@YourGc
Copy link
Author

YourGc commented May 6, 2019

and I train my dataset with command following:
python train.py --cfg cfg/yolov3.cfg --data data/coco.data --multi-scale

i have modified yolov3.cfg and coco.data

@YourGc
Copy link
Author

YourGc commented May 6, 2019

i'm very confusing that , while i try command: python train.py without assigning cfg file (coco.data is deafult and i have modified that),it works, but when i only modify the classes in cfg file, i meet this issue

@glenn-jocher
Copy link
Member

glenn-jocher commented May 6, 2019

Hello, thank you for your interest in our work! It sounds like you have incorrectly configured your cfg file. Please note that most technical problems are due to:

  • Your changes to the default repository. If your issue is not reproducible in a fresh git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:
sudo rm -rf yolov3  # remove exising repo
git clone https://github.com/ultralytics/yolov3 && cd yolov3 # git clone latest
python3 detect.py  # verify detection
python3 train.py  # verify training (a few batches only)
# CODE TO REPRODUCE YOUR ISSUE HERE
  • Your custom data. If your issue is not reproducible with COCO data we can not debug it. Visit our Custom Training Tutorial for exact details on how to format your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.
  • Your environment. If your issue is not reproducible in a GCP Quickstart Guide VM we can not debug it. Ensure you meet the requirements specified in the README: Unix, MacOS, or Windows with Python >= 3.7, Pytorch >= 1.0, etc.

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

@ibatra
Copy link

ibatra commented May 21, 2019

facing similar issue but when computing mAP

@glenn-jocher
Copy link
Member

This seems to be a PyTorch error related to out of bound values passed to loss functions, i.e. value outside of 0-1 range passed to BCE.
pytorch/pytorch#5560
pytorch/pytorch#14519

And also appearing multiple times in this repository. I will reopen since it seems to not be resolved.
#139
#157
#166

@glenn-jocher glenn-jocher reopened this May 21, 2019
@glenn-jocher
Copy link
Member

glenn-jocher commented May 21, 2019

The two PyTorch issues seem to stem from BCELoss inputs falling outside of the required 0-1 range, however this repo does not use BCELoss, so I am quite mystified (It uses BCEWithLogitsLoss, which is not input constrained). I've also never encountered this error myself.

@lzl4525
Copy link

lzl4525 commented May 30, 2019

facing similar issue but when computing mAP

i face the same issue in computing mAP

@glenn-jocher
Copy link
Member

glenn-jocher commented May 31, 2019 via email

@zhangming8
Copy link

zhangming8 commented Jun 3, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants