-
-
Notifications
You must be signed in to change notification settings - Fork 16k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to save weights & resume training? #1117
Comments
Hello @whoafridi, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom model or data training question, please note Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:
For more information please visit https://www.ultralytics.com. |
@whoafridi when you start training with any command your experiment is saved in python train.py --resume or from a specific experiment: python train.py --resume runs/exp17/weights/last.pt |
Great. But I don't have GPU so, there is any option to save the weights in google drive instead of that particular folder?? |
Your hardware is irrelevant for logging. See train.py argparser for logging to arbitrary destinations: Line 403 in 10c85bf
|
sure . Thanks for giving this. @glenn-jocher Thank you again |
@whoafridi see #640 (comment) for specific example of checkpointing to google drive from colab notebook. |
Sure . |
how to resume with "python -m torch.distributed.launch --nproc_per_node 2 train.py --resume runs/exp4/weights/last.pt" the log Transferred 370/370 items from runs/exp4/weights/last.pt Traceback (most recent call last): I delete the weights_backup_epoch54 and it will be create again |
@alicera HI, I have same problem. |
@glenn-jocher Thanks for the info. Does |
@Xunius 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of If your training was interrupted for any reason you may continue where you left off using the Resume Single-GPUYou may not change settings when resuming, and no additional arguments other than python train.py --resume # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt # specify resume checkpoint Resume Multi-GPUMulti-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs: python -m torch.distributed.run --nproc_per_node 8 train.py --resume # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt # specify resume checkpoint Start from PretrainedIf you would like to start training from a fully trained model, use the python train.py --weights path/to/best.pt # start from pretrained model Good luck and let us know if you have any other questions! |
@glenn-jocher
When I use the above to continue training on a model that I've already fully trained it works if I use weights of a yolo5s model but if I use a yolo5m model I do get a shape mismatch. Sorry if this is a really silly question! |
@aegonwolf 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem. How to create a Minimal, Reproducible ExampleWhen asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:
For Ultralytics to provide assistance your code should also be:
If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem. Thank you! 😃 |
Hello @glenn-jocher, I'm wondering if it's feasible to manually choose and save a checkpoint weight during training. For instance, if the model is trained for 100 epochs, and I specifically want to save the model weight of epoch 85, even if it's not the best or final epoch. Appreciate your assistance. Thank you. |
@wtjasmine 👋 Yes, it is definitely possible to manually save model checkpoints during training. To achieve this, you can add a callback function to save the model's weights at the end of each epoch. Here's an example of how you can achieve this using PyTorch: import torch
# Define your YOLOv5 model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s')
# Set up your optimizer and scheduler
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# Train your model
for epoch in range(100):
train(...)
val_loss = validate(...)
# Save the model's weights every few epochs
if epoch == 85:
torch.save(model.state_dict(), f"epoch_{epoch}_weights.pt")
# Adjust learning rate
scheduler.step() In this example, the model's weights will be saved at the end of the 85th epoch. You can adjust the condition to save the weights at any epoch you desire. Remember to refer to the PyTorch documentation for more details on how to implement callbacks and save model weights. Let me know if you need further assistance! |
❔Question
The question is very straightforward - How do i save weights in drive & further resume train from the previous trained weights (like previous yolov3/v4) ??
Is it possible with yolov5 ?
Additional context
I don't find any clue . please give me resources. It''ll be very helpful for me.
The text was updated successfully, but these errors were encountered: