Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After interrupting training, load weights/last.pt to continue training #1368

Closed
winnerCR7 opened this issue Jul 3, 2020 · 3 comments
Closed
Labels
bug Something isn't working Stale

Comments

@winnerCR7
Copy link

I stopped after training 95 epochs. I entered python train.py --batch-size 16 --img-size 416 --weights weights/last.pt --data data/bdd100k/bdd100k.data --cfg cfg/yolov3-spp-bdd100k.cfg in the terminal to continue training, and after running a epoch on training set, I reported this error, but I can normally continue training in VScode using the same args. Why is this?

BTW, after resuming training, how should the training record of TensorBoard be restored? I opened the URL and found that it was always the training record before the training was interrupted, it does not seem to be updated.

Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='cfg/yolov3-spp-bdd100k.cfg', data='data/bdd100k/bdd100k.data', device='', epochs=300, evolve=False, freeze_layers=False, img_size=[416], multi_scale=False, name='', nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='weights/last.pt')
Using CUDA Apex device0 _CudaDeviceProperties(name='GeForce RTX 2070', total_memory=7979MB)
device1 _CudaDeviceProperties(name='GeForce GTX 1060 6GB', total_memory=6078MB)

Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/
Model Summary: 225 layers, 6.26218e+07 parameters, 6.26218e+07 gradients
Optimizer groups: 76 .bias, 76 Conv2d.weight, 73 other
Caching labels data/bdd100k/labels/train.npy (69863 found, 0 missing, 0 empty, 1 duplicate, for 69863 images): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 69863/69863 [00:02<00:00, 26837.25it/s]
Caching labels data/bdd100k/labels/val.npy (10000 found, 0 missing, 0 empty, 0 duplicate, for 10000 images): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 26744.37it/s]
Image sizes 416 - 416 train, 416 test
Using 8 dataloader workers
Starting training for 300 epochs...

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
96/299     4.62G      1.88      1.16     0.383      3.42       329       416: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4366/4367 [23:23<00:00,  3.17it/s]Traceback (most recent call last):

File "train.py", line 497, in
train(hyp) # train normally
File "train.py", line 322, in train
scaled_loss.backward()
File "/home/cr7/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/cr7/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.56 GiB (GPU 0; 7.79 GiB total capacity; 2.27 GiB already allocated; 1.57 GiB free; 4.26 GiB reserved in total by PyTorch) (malloc at /opt/conda/conda-bld/pytorch_1591914880026/work/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f1a3129ab5e in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1f39d (0x7f1a314e639d in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_alloc(unsigned long) + 0x5b (0x7f1a314e098b in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0xd767c6 (0x7f1a3246a7c6 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd7af6d (0x7f1a3246ef6d in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xd6dc9a (0x7f1a32461c9a in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xd6f07f (0x7f1a3246307f in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd72cd0 (0x7f1a32466cd0 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x49 (0x7f1a32466f29 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0xdd9880 (0x7f1a324cd880 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0xe1daf8 (0x7f1a32511af8 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #11: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x2fc (0x7f1a32467bdc in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #12: + 0xdd958b (0x7f1a324cd58b in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #13: + 0xe1db54 (0x7f1a32511b54 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #14: + 0x29dee26 (0x7f1a5b288e26 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: + 0x2a2e634 (0x7f1a5b2d8634 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x378 (0x7f1a5aea0ff8 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: + 0x2ae7df5 (0x7f1a5b391df5 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f1a5b38f0f3 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #19: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7f1a5b38fed2 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f1a5b388549 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f1a5e8d8638 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #22: + 0xc819d (0x7f1a613f919d in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #23: + 0x76db (0x7f1a7a1d86db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #24: clone + 0x3f (0x7f1a79f0188f in /lib/x86_64-linux-gnu/libc.so.6)

@winnerCR7 winnerCR7 added the bug Something isn't working label Jul 3, 2020
@github-actions
Copy link

github-actions bot commented Jul 3, 2020

Hello @winnerCR7, thank you for your interest in our work! Ultralytics has open-sourced YOLOv5 at https://github.com/ultralytics/yolov5, featuring faster, lighter and more accurate object detection. YOLOv5 is recommended for all new projects.

To continue with this repo, please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

  • Cloud-based AI systems operating on hundreds of HD video streams in realtime.
  • Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
  • Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

@github-actions
Copy link

github-actions bot commented Aug 3, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale label Aug 3, 2020
@github-actions github-actions bot closed this as completed Aug 8, 2020
@glenn-jocher
Copy link
Member

@winnerCR7 it looks like you're encountering a CUDA out of memory error as you try to resume training after stopping at epoch 95. This issue can often be resolved by reducing the batch size or image size during the continuation of training. It's great to hear that you were able to continue training in VScode using the same arguments.

As for restoring the training record of TensorBoard after resuming training, you can try launching TensorBoard with the --reload flag to refresh the data for the updated training. For example, you can start TensorBoard with "tensorboard --logdir=runs --reload" to view the updated training records.

Feel free to adjust the batch size or image size as needed to prevent the CUDA out of memory issue, and best of luck with your continued training. The YOLO community and the Ultralytics team are here to support you throughout the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

2 participants