Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Train from 'best.pt' weights #11974

Closed
aravindchakravarti opened this issue Aug 11, 2023 · 8 comments
Closed

Unable to Train from 'best.pt' weights #11974

aravindchakravarti opened this issue Aug 11, 2023 · 8 comments

Comments

@aravindchakravarti
Copy link

Hi @glenn-jocher,

Sorry, I am asking same question as before. I read all the posted previously. I too have same problem and unable to solve.

I am training yolov5 in Google Colab. Since, the time in Google colab is limited, and since I don't know how many epochs I may train the network to achieve desired performance; I want to train step by step.

  • I started training yolov5 (nano) model from scratch. I trained for 100 epochs, saved the results + weights in google drive,
    !python train.py --img 640 --epochs 100 --data data.yaml --weights ' ' --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25
  • Then used the best.pt in above step to train network again for another 100 epochs.
    !python train.py --img 640 --epochs 100 --data data.yaml --weights path/to/best.pt --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25
  • Repeat the step 2 until get desired mAP.

Unfortunately, whenever I want to continue training (i.e., using previous best.pt) I am seeing that, yolov5 is training is starting from fresh. Can you please help?!

image

Originally posted by @aravindchakravarti in #7343 (comment)

@github-actions
Copy link
Contributor

👋 Hello @aravindchakravarti, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics

@glenn-jocher
Copy link
Member

Hi @aravindchakravarti,

I understand your frustration. When training YOLOv5, if you want to continue training from a previously saved checkpoint (best.pt in your case), you need to provide the --resume flag along with the checkpoint path.

Here's an example command to continue training:

!python train.py --img 640 
                --epochs 100 
                --data data.yaml 
                --weights path/to/best.pt 
                --cfg yolov5n.yaml 
                --batch-size 90 
                --name person_detection 
                --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ 
                --patience 25 
                --resume

With the --resume flag, YOLOv5 will load the checkpoint and continue training from where it left off. Make sure to specify the correct path to the best.pt file.

I hope this helps! Let me know if you have any further questions.

@aravindchakravarti
Copy link
Author

@glenn-jocher Thanks for the quick response.

Actually not a frustration! It's a pleasure to use production ready code!!

--resume only works when set epochs are not completed already. Right? Say for example, I wanted to train for 500 epochs, during training process, after some epochs if suddenly server disconnects/power shuts down etc, then I can use --resume to resume from the epoch at which disturbance had happened.

My only issue/problem is upfront I will not be knowing how many epochs I may train! For example, in few projects, I was able to get desired mAP as soon as 150 epochs and in some other project, it took more than 1500 epochs.

So,

  1. Let's assume that I start training with --epochs 100 to start with.
    !python train.py --img 640 --epochs 100 --data data.yaml --weights '' --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25

  2. Then after 100 epochs, I may not be satisfied with results (i.e., mAP). So, I decide to train another 100 epoch. So, natually I would like to use best.pt from above (basically load the Yolov5 with best.pt from previous run) and then continue training from there.
    When I do step 2, using command
    !python train.py --img 640 --epochs 100 --data data.yaml --weights path/to/best.pt --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25

The 'mAP' is resetting close to zero. May be some issue with higher learning rate?! Because I read that --resume feature will give the exact learning rate which was previously calculated; where as command in step 2, may restart higher learning rate thinking it is a fresh training!

Also,
I tried your suggested
!python train.py --img 640 --epochs 100 --data data.yaml --weights path/to/best.pt --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25 --resume
I got AssertionError: runs/train/..../last.pt training to 100 epochs finished, nothing to resume

@glenn-jocher
Copy link
Member

@aravindchakravarti Thanks for the clarification!

I apologize for the confusion. You are correct that --resume is typically used when training is interrupted and you want to continue training from the last saved checkpoint.

To achieve what you want, you can use the --resume flag along with the --epochs flag set to the desired total number of training epochs. This way, if you want to train for a specific number of epochs (e.g., 500), you can start training with a smaller number of epochs (e.g., 100), and then resume training and continue for the remaining epochs.

Here's an example command:

!python train.py --img 640 
                --epochs 500 
                --data data.yaml 
                --weights '' 
                --cfg yolov5n.yaml 
                --batch-size 90 
                --name person_detection 
                --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ 
                --patience 25

After training for the initial 100 epochs, you can then continue training with the same command but with the --resume flag added:

!python train.py --img 640 
                --epochs 500 
                --data data.yaml 
                --weights path/to/best.pt 
                --cfg yolov5n.yaml 
                --batch-size 90 
                --name person_detection 
                --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ 
                --patience 25 
                --resume

Regarding the issue you encountered with AssertionError: runs/train/..../last.pt training to 100 epochs finished, nothing to resume, this can happen if the training process was not interrupted and completed the specified number of epochs without being stopped. In that case, there is no need to use --resume since there is no interrupted training session.

I hope this helps! Let me know if you have any further questions.

@aravindchakravarti
Copy link
Author

Thanks @glenn-jocher for clarifying!!!

@glenn-jocher
Copy link
Member

Hi @aravindchakravarti,

I'm glad I could provide some clarification for you! If you have any more questions or need further assistance, feel free to ask. Happy training with YOLOv5!

@Muhammad-ismail786
Copy link

I am training yolov8 in Google Colab. Since, the time in Google colab is limited, and since I don't know how many epochs I may train now we want to continue the training from the previous epoches so is there any flag for yolov8

@glenn-jocher
Copy link
Member

@Muhammad-ismail786 hi there!

To continue training from a previous checkpoint in YOLOv8, you can simply use the --weights flag pointing to your last saved checkpoint when starting the training process again.

For example:

!python train.py --img 640 --batch 16 --epochs 100 --data your_dataset.yaml --weights /path/to/your/last_checkpoint.pt

This will automatically resume training from where it left off, using the specified checkpoint. No additional flags are needed for this operation in YOLOv8.

Happy training! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants