Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to save weights & resume training? #1117

Closed
whoafridi opened this issue Oct 11, 2020 · 15 comments
Closed

How to save weights & resume training? #1117

whoafridi opened this issue Oct 11, 2020 · 15 comments
Labels
question Further information is requested

Comments

@whoafridi
Copy link

❔Question

The question is very straightforward - How do i save weights in drive & further resume train from the previous trained weights (like previous yolov3/v4) ??
Is it possible with yolov5 ?

Additional context

I don't find any clue . please give me resources. It''ll be very helpful for me.

@whoafridi whoafridi added the question Further information is requested label Oct 11, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Oct 11, 2020

Hello @whoafridi, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook Open In Colab, Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

  • Cloud-based AI systems operating on hundreds of HD video streams in realtime.
  • Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
  • Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

@glenn-jocher
Copy link
Member

@whoafridi when you start training with any command your experiment is saved in yolov5/runs/exp.... If your training is interrupted for any reason, the following command will resume your partially completed training from the most recently updated experiment:

python train.py --resume

or from a specific experiment:

python train.py --resume runs/exp17/weights/last.pt

@whoafridi
Copy link
Author

Great. But I don't have GPU so, there is any option to save the weights in google drive instead of that particular folder??
It'll very much needed. Thank you @glenn-jocher

@glenn-jocher
Copy link
Member

Your hardware is irrelevant for logging. See train.py argparser for logging to arbitrary destinations:

yolov5/train.py

Line 403 in 10c85bf

parser.add_argument('--logdir', type=str, default='runs/', help='logging directory')

@whoafridi
Copy link
Author

sure . Thanks for giving this. @glenn-jocher Thank you again

@glenn-jocher
Copy link
Member

@whoafridi see #640 (comment) for specific example of checkpointing to google drive from colab notebook.

@whoafridi
Copy link
Author

Sure .

@alicera
Copy link

alicera commented Oct 27, 2020

how to resume with "python -m torch.distributed.launch --nproc_per_node 2 train.py --resume runs/exp4/weights/last.pt"

the log
"""
23 -1 1 1248768 models.common.BottleneckCSP [512, 512, 1, False]
24 [17, 20, 23] 1 16182 models.yolo.Detect [1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 191 layers, 7.25509e+06 parameters, 7.25509e+06 gradients

Transferred 370/370 items from runs/exp4/weights/last.pt
Optimizer groups: 62 .bias, 70 conv.weight, 59 other

Traceback (most recent call last):
File "train.py", line 460, in
train(hyp, opt, device, tb_writer)
File "train.py", line 138, in train
shutil.copytree(wdir, wdir.parent / f'weights_backup_epoch{start_epoch - 1}') # save previous weights
File "/opt/conda/lib/python3.6/shutil.py", line 321, in copytree
os.makedirs(dst)
File "/opt/conda/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: 'runs/exp4/weights_backup_epoch54'
"""

I delete the weights_backup_epoch54 and it will be create again

@Pluto1314
Copy link

@alicera HI, I have same problem.

@Xunius
Copy link

Xunius commented Aug 16, 2021

@whoafridi when you start training with any command your experiment is saved in yolov5/runs/exp.... If your training is interrupted for any reason, the following command will resume your partially completed training from the most recently updated experiment:

python train.py --resume

or from a specific experiment:

python train.py --resume runs/exp17/weights/last.pt

@glenn-jocher Thanks for the info. Does python train.py --resume runs/exp17/weights/last.pt also pick up the latest value of learning rate, or we start from the initial lr again?

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 16, 2021

@Xunius 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs defined at training start (default=300), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.

LR Curves

If your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck and let us know if you have any other questions!

@aegonwolf
Copy link

@glenn-jocher
I hope it's ok if I pick this up even though I'm not OP

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck and let us know if you have any other questions!

When I use the above to continue training on a model that I've already fully trained it works if I use weights of a yolo5s model but if I use a yolo5m model I do get a shape mismatch.
As far as I can tell this shouldn't happen as it just loads the model in train.py, is there some other argument I have to specify such that pretrained does not assume yolo5s?

Sorry if this is a really silly question!

@glenn-jocher
Copy link
Member

glenn-jocher commented Feb 12, 2022

@aegonwolf 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible to produce the problem
  • Complete – Provide all parts someone else needs to reproduce the problem
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

For Ultralytics to provide assistance your code should also be:

  • Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
  • Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

@wtjasmine
Copy link

Hello @glenn-jocher,

I'm wondering if it's feasible to manually choose and save a checkpoint weight during training. For instance, if the model is trained for 100 epochs, and I specifically want to save the model weight of epoch 85, even if it's not the best or final epoch. Appreciate your assistance. Thank you.

@glenn-jocher
Copy link
Member

@wtjasmine 👋 Yes, it is definitely possible to manually save model checkpoints during training. To achieve this, you can add a callback function to save the model's weights at the end of each epoch.

Here's an example of how you can achieve this using PyTorch:

import torch

# Define your YOLOv5 model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s')

# Set up your optimizer and scheduler
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

# Train your model
for epoch in range(100):
    train(...)
    val_loss = validate(...)

    # Save the model's weights every few epochs
    if epoch == 85:
        torch.save(model.state_dict(), f"epoch_{epoch}_weights.pt")

    # Adjust learning rate
    scheduler.step()

In this example, the model's weights will be saved at the end of the 85th epoch. You can adjust the condition to save the weights at any epoch you desire.

Remember to refer to the PyTorch documentation for more details on how to implement callbacks and save model weights.

Let me know if you need further assistance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants