How to save weights & resume training? #1117

whoafridi · 2020-10-11T13:46:53Z

❔Question

The question is very straightforward - How do i save weights in drive & further resume train from the previous trained weights (like previous yolov3/v4) ??
Is it possible with yolov5 ?

Additional context

I don't find any clue . please give me resources. It''ll be very helpful for me.

github-actions · 2020-10-11T13:47:38Z

Hello @whoafridi, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

Cloud-based AI systems operating on hundreds of HD video streams in realtime.
Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

glenn-jocher · 2020-10-11T14:11:11Z

@whoafridi when you start training with any command your experiment is saved in yolov5/runs/exp.... If your training is interrupted for any reason, the following command will resume your partially completed training from the most recently updated experiment:

python train.py --resume

or from a specific experiment:

python train.py --resume runs/exp17/weights/last.pt

whoafridi · 2020-10-11T14:18:38Z

Great. But I don't have GPU so, there is any option to save the weights in google drive instead of that particular folder??
It'll very much needed. Thank you @glenn-jocher

glenn-jocher · 2020-10-11T14:41:56Z

Your hardware is irrelevant for logging. See train.py argparser for logging to arbitrary destinations:

yolov5/train.py

Line 403 in 10c85bf

    
           parser.add_argument('--logdir', type=str, default='runs/', help='logging directory')

whoafridi · 2020-10-11T14:59:53Z

sure . Thanks for giving this. @glenn-jocher Thank you again

glenn-jocher · 2020-10-11T15:27:07Z

@whoafridi see #640 (comment) for specific example of checkpointing to google drive from colab notebook.

whoafridi · 2020-10-11T15:30:59Z

Sure .

alicera · 2020-10-27T03:11:50Z

how to resume with "python -m torch.distributed.launch --nproc_per_node 2 train.py --resume runs/exp4/weights/last.pt"

the log
"""
23 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
24 [17, 20, 23] 1 16182 models.yolo.Detect [1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 191 layers, 7.25509e+06 parameters, 7.25509e+06 gradients

Transferred 370/370 items from runs/exp4/weights/last.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other

Traceback (most recent call last):
File "train.py", line 460, in
train(hyp, opt, device, tb_writer)
File "train.py", line 138, in train
shutil.copytree(wdir, wdir.parent / f'weights_backup_epoch{start_epoch - 1}') # save previous weights
File "/opt/conda/lib/python3.6/shutil.py", line 321, in copytree
os.makedirs(dst)
File "/opt/conda/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: 'runs/exp4/weights_backup_epoch54'
"""

I delete the weights_backup_epoch54 and it will be create again

Pluto1314 · 2020-10-28T06:17:20Z

@alicera HI, I have same problem.

Xunius · 2021-08-16T13:19:41Z

@whoafridi when you start training with any command your experiment is saved in yolov5/runs/exp.... If your training is interrupted for any reason, the following command will resume your partially completed training from the most recently updated experiment:
python train.py --resume
or from a specific experiment:
python train.py --resume runs/exp17/weights/last.pt

@glenn-jocher Thanks for the info. Does python train.py --resume runs/exp17/weights/last.pt also pick up the latest value of learning rate, or we start from the initial lr again?

glenn-jocher · 2021-08-16T13:25:58Z

@Xunius 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs defined at training start (default=300), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.

If your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck and let us know if you have any other questions!

aegonwolf · 2022-02-12T15:28:44Z

@glenn-jocher
I hope it's ok if I pick this up even though I'm not OP

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:
python train.py --weights path/to/best.pt  # start from pretrained model
Good luck and let us know if you have any other questions!

When I use the above to continue training on a model that I've already fully trained it works if I use weights of a yolo5s model but if I use a yolo5m model I do get a shape mismatch.
As far as I can tell this shouldn't happen as it just loads the model in train.py, is there some other argument I have to specify such that pretrained does not assume yolo5s?

Sorry if this is a really silly question!

glenn-jocher · 2022-02-12T17:34:19Z

@aegonwolf 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible to produce the problem
✅ Complete – Provide all parts someone else needs to reproduce the problem
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

For Ultralytics to provide assistance your code should also be:

✅ Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
✅ Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

wtjasmine · 2023-11-23T08:42:46Z

Hello @glenn-jocher,

I'm wondering if it's feasible to manually choose and save a checkpoint weight during training. For instance, if the model is trained for 100 epochs, and I specifically want to save the model weight of epoch 85, even if it's not the best or final epoch. Appreciate your assistance. Thank you.

glenn-jocher · 2023-11-23T11:59:32Z

@wtjasmine 👋 Yes, it is definitely possible to manually save model checkpoints during training. To achieve this, you can add a callback function to save the model's weights at the end of each epoch.

Here's an example of how you can achieve this using PyTorch:

import torch

# Define your YOLOv5 model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s')

# Set up your optimizer and scheduler
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

# Train your model
for epoch in range(100):
    train(...)
    val_loss = validate(...)

    # Save the model's weights every few epochs
    if epoch == 85:
        torch.save(model.state_dict(), f"epoch_{epoch}_weights.pt")

    # Adjust learning rate
    scheduler.step()

In this example, the model's weights will be saved at the end of the 85th epoch. You can adjust the condition to save the weights at any epoch you desire.

Remember to refer to the PyTorch documentation for more details on how to implement callbacks and save model weights.

Let me know if you need further assistance!

whoafridi added the question Further information is requested label Oct 11, 2020

whoafridi closed this as completed Oct 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to save weights & resume training? #1117

How to save weights & resume training? #1117

whoafridi commented Oct 11, 2020

github-actions bot commented Oct 11, 2020 •

edited by glenn-jocher

Loading

glenn-jocher commented Oct 11, 2020

whoafridi commented Oct 11, 2020

glenn-jocher commented Oct 11, 2020

whoafridi commented Oct 11, 2020

glenn-jocher commented Oct 11, 2020

whoafridi commented Oct 11, 2020

alicera commented Oct 27, 2020 •

edited

Loading

Pluto1314 commented Oct 28, 2020

Xunius commented Aug 16, 2021

glenn-jocher commented Aug 16, 2021 •

edited

Loading

aegonwolf commented Feb 12, 2022

Start from Pretrained

glenn-jocher commented Feb 12, 2022 •

edited

Loading

wtjasmine commented Nov 23, 2023

glenn-jocher commented Nov 23, 2023

How to save weights & resume training? #1117

How to save weights & resume training? #1117

Comments

whoafridi commented Oct 11, 2020

❔Question

Additional context

github-actions bot commented Oct 11, 2020 • edited by glenn-jocher Loading

glenn-jocher commented Oct 11, 2020

whoafridi commented Oct 11, 2020

glenn-jocher commented Oct 11, 2020

whoafridi commented Oct 11, 2020

glenn-jocher commented Oct 11, 2020

whoafridi commented Oct 11, 2020

alicera commented Oct 27, 2020 • edited Loading

Pluto1314 commented Oct 28, 2020

Xunius commented Aug 16, 2021

glenn-jocher commented Aug 16, 2021 • edited Loading

Resume Single-GPU

Resume Multi-GPU

Start from Pretrained

aegonwolf commented Feb 12, 2022

Start from Pretrained

glenn-jocher commented Feb 12, 2022 • edited Loading

How to create a Minimal, Reproducible Example

wtjasmine commented Nov 23, 2023

glenn-jocher commented Nov 23, 2023

github-actions bot commented Oct 11, 2020 •

edited by glenn-jocher

Loading

alicera commented Oct 27, 2020 •

edited

Loading

glenn-jocher commented Aug 16, 2021 •

edited

Loading

glenn-jocher commented Feb 12, 2022 •

edited

Loading