Unable to Train from 'best.pt' weights #11974

aravindchakravarti · 2023-08-11T06:14:54Z

Sorry, I am asking same question as before. I read all the posted previously. I too have same problem and unable to solve.

I am training yolov5 in Google Colab. Since, the time in Google colab is limited, and since I don't know how many epochs I may train the network to achieve desired performance; I want to train step by step.

I started training yolov5 (nano) model from scratch. I trained for 100 epochs, saved the results + weights in google drive,
!python train.py --img 640 --epochs 100 --data data.yaml --weights ' ' --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25
Then used the best.pt in above step to train network again for another 100 epochs.
!python train.py --img 640 --epochs 100 --data data.yaml --weights path/to/best.pt --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25
Repeat the step 2 until get desired mAP.

Unfortunately, whenever I want to continue training (i.e., using previous best.pt) I am seeing that, yolov5 is training is starting from fresh. Can you please help?!

Originally posted by @aravindchakravarti in #7343 (comment)

The text was updated successfully, but these errors were encountered:

github-actions · 2023-08-11T06:15:25Z

👋 Hello @aravindchakravarti, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics

glenn-jocher · 2023-08-11T08:23:45Z

Hi @aravindchakravarti,

I understand your frustration. When training YOLOv5, if you want to continue training from a previously saved checkpoint (best.pt in your case), you need to provide the --resume flag along with the checkpoint path.

Here's an example command to continue training:

!python train.py --img 640 
                --epochs 100 
                --data data.yaml 
                --weights path/to/best.pt 
                --cfg yolov5n.yaml 
                --batch-size 90 
                --name person_detection 
                --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ 
                --patience 25 
                --resume

With the --resume flag, YOLOv5 will load the checkpoint and continue training from where it left off. Make sure to specify the correct path to the best.pt file.

I hope this helps! Let me know if you have any further questions.

aravindchakravarti · 2023-08-11T09:30:30Z

@glenn-jocher Thanks for the quick response.

Actually not a frustration! It's a pleasure to use production ready code!!

--resume only works when set epochs are not completed already. Right? Say for example, I wanted to train for 500 epochs, during training process, after some epochs if suddenly server disconnects/power shuts down etc, then I can use --resume to resume from the epoch at which disturbance had happened.

My only issue/problem is upfront I will not be knowing how many epochs I may train! For example, in few projects, I was able to get desired mAP as soon as 150 epochs and in some other project, it took more than 1500 epochs.

So,

Let's assume that I start training with --epochs 100 to start with.
!python train.py --img 640 --epochs 100 --data data.yaml --weights '' --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25
Then after 100 epochs, I may not be satisfied with results (i.e., mAP). So, I decide to train another 100 epoch. So, natually I would like to use best.pt from above (basically load the Yolov5 with best.pt from previous run) and then continue training from there.
When I do step 2, using command
!python train.py --img 640 --epochs 100 --data data.yaml --weights path/to/best.pt --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25

The 'mAP' is resetting close to zero. May be some issue with higher learning rate?! Because I read that --resume feature will give the exact learning rate which was previously calculated; where as command in step 2, may restart higher learning rate thinking it is a fresh training!

Also,
I tried your suggested
!python train.py --img 640 --epochs 100 --data data.yaml --weights path/to/best.pt --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25 --resume
I got AssertionError: runs/train/..../last.pt training to 100 epochs finished, nothing to resume

glenn-jocher · 2023-08-11T11:54:40Z

@aravindchakravarti Thanks for the clarification!

I apologize for the confusion. You are correct that --resume is typically used when training is interrupted and you want to continue training from the last saved checkpoint.

To achieve what you want, you can use the --resume flag along with the --epochs flag set to the desired total number of training epochs. This way, if you want to train for a specific number of epochs (e.g., 500), you can start training with a smaller number of epochs (e.g., 100), and then resume training and continue for the remaining epochs.

Here's an example command:

!python train.py --img 640 
                --epochs 500 
                --data data.yaml 
                --weights '' 
                --cfg yolov5n.yaml 
                --batch-size 90 
                --name person_detection 
                --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ 
                --patience 25

After training for the initial 100 epochs, you can then continue training with the same command but with the --resume flag added:

!python train.py --img 640 
                --epochs 500 
                --data data.yaml 
                --weights path/to/best.pt 
                --cfg yolov5n.yaml 
                --batch-size 90 
                --name person_detection 
                --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ 
                --patience 25 
                --resume

Regarding the issue you encountered with AssertionError: runs/train/..../last.pt training to 100 epochs finished, nothing to resume, this can happen if the training process was not interrupted and completed the specified number of epochs without being stopped. In that case, there is no need to use --resume since there is no interrupted training session.

I hope this helps! Let me know if you have any further questions.

aravindchakravarti · 2023-08-11T11:57:25Z

Thanks @glenn-jocher for clarifying!!!

glenn-jocher · 2023-08-11T13:24:24Z

Hi @aravindchakravarti,

I'm glad I could provide some clarification for you! If you have any more questions or need further assistance, feel free to ask. Happy training with YOLOv5!

Muhammad-ismail786 · 2024-02-22T11:01:52Z

I am training yolov8 in Google Colab. Since, the time in Google colab is limited, and since I don't know how many epochs I may train now we want to continue the training from the previous epoches so is there any flag for yolov8

glenn-jocher · 2024-03-15T02:52:18Z

@Muhammad-ismail786 hi there!

To continue training from a previous checkpoint in YOLOv8, you can simply use the --weights flag pointing to your last saved checkpoint when starting the training process again.

For example:

!python train.py --img 640 --batch 16 --epochs 100 --data your_dataset.yaml --weights /path/to/your/last_checkpoint.pt

This will automatically resume training from where it left off, using the specified checkpoint. No additional flags are needed for this operation in YOLOv8.

Happy training! 😊

aravindchakravarti closed this as completed Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Train from 'best.pt' weights #11974

Unable to Train from 'best.pt' weights #11974

aravindchakravarti commented Aug 11, 2023

github-actions bot commented Aug 11, 2023

glenn-jocher commented Aug 11, 2023

aravindchakravarti commented Aug 11, 2023

glenn-jocher commented Aug 11, 2023

aravindchakravarti commented Aug 11, 2023

glenn-jocher commented Aug 11, 2023

Muhammad-ismail786 commented Feb 22, 2024

glenn-jocher commented Mar 15, 2024

Unable to Train from 'best.pt' weights #11974

Unable to Train from 'best.pt' weights #11974

Comments

aravindchakravarti commented Aug 11, 2023

github-actions bot commented Aug 11, 2023

Requirements

Environments

Status

Introducing YOLOv8 🚀

glenn-jocher commented Aug 11, 2023

aravindchakravarti commented Aug 11, 2023

glenn-jocher commented Aug 11, 2023

aravindchakravarti commented Aug 11, 2023

glenn-jocher commented Aug 11, 2023

Muhammad-ismail786 commented Feb 22, 2024

glenn-jocher commented Mar 15, 2024