Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue training on the best.pt after running all EPochs successfully #7343

Closed
1 task done
dhruvildave22 opened this issue Apr 8, 2022 · 24 comments
Closed
1 task done
Labels
question Further information is requested

Comments

@dhruvildave22
Copy link

Search before asking

Question

Hello!
Can anybody resolve my confusion about continuing training on the best.pt after running all EPochs successfully?

I have trained my yolov5s.pt models with coco + custom dataset for 65 EPOCHS and it create a best.pt models. I want to run more EPochs on this to get better precision and recall, so i started training best.pt which already has 65 Epochs. Earlier i thought that If i start training on best.pt, it should run from 66th EPoch but after seeing results of precision and recall value, I assume that it starts fresh again.
I tried --resume but it can be only used if the training gets interrupted.
Is there any way where i can start the training from where i finished with best.pt?

Thanks
Dhruvil Dave

Additional

No response

@dhruvildave22 dhruvildave22 added the question Further information is requested label Apr 8, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Apr 8, 2022

👋 Hello @dhruvildave22, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 8, 2022

@dhruvildave22 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs defined at training start (default=300), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.

LR Curves

If your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck 🍀 and let us know if you have any other questions!

@dhruvildave22
Copy link
Author

@glenn-jocher So, If i train a model with 70 epochs and after successful run it creates a best.pt but the precision is not very good so i have to train more on best.pt.
My question here is
If i have to run 70 more epochs, will the best.pt contain all the training of 70 epochs which were done earlier and when i restart training on it and have to add --epochs 70 --weights best.pt or i have to do --epochs 140

Thanks
Dhruvil Dave

@VYRION-Ai
Copy link

@glenn-jocher So, If i train a model with 70 epochs and after successful run it creates a best.pt but the precision is not very good so i have to train more on best.pt. My question here is If i have to run 70 more epochs, will the best.pt contain all the training of 70 epochs which were done earlier and when i restart training on it and have to add --epochs 70 --weights best.pt or i have to do --epochs 140

Thanks Dhruvil Dave

`--epochs 70 --weights best.pt

@javiercalle97
Copy link

And is it possible to restart a training that has stopped due to patience? I set up the patience in 20 but looking at the graph map50:95. I think it is already improving but needs more than 20 epochs.

I have tried to resume the learning by increasing the patience value, but this error appears:

assert start_epoch > 0, f'{weights} training to {epochs} epochs is finished, nothing to resume.'
AssertionError: runs/train/exp6/weights/last.pt training to 300 epochs is finished, nothing to resume.

but the last epoch was 160.

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 8, 2022

@javiercalle97 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs defined at training start (default=300), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.

LR Curves

If your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck 🍀 and let us know if you have any other questions!

@dhruvildave22
Copy link
Author

Thanks @glenn-jocher,
Closing the issue 🙂

@619732457
Copy link

Hello, I want to continue adding 500 more epochs on the data that has already completed 500epoch training, but not restart a new training. I read the above information carefully, but still don't quite understand how to do it.
For example, I now have epochs 0-499 and I want to get epochs of 500-999.

@glenn-jocher
Copy link
Member

glenn-jocher commented May 8, 2022

@619732457 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs defined at training start (default=300), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.

Screenshot 2022-04-10 at 11 11 51

If your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck 🍀 and let us know if you have any other questions!

@Transigent
Copy link
Contributor

This is confusing because there is more than one answer on StackOverflow like this one https://stackoverflow.com/a/72922281/521342
that says that by changing the number of epochs in the opt.yaml file and using the 'resume' syntax, the run can be resumed to do more epochs.

@glenn-jocher indicates in the response above that the LR schedulers are designed to taper off to minimum at the epoch count specified in the original training command.

It is still not clear what the impact is on the scheduler values if the opt.yaml file is modified then resumed.

@arielkantorovich
Copy link

I don't get a concrete answer, if I continue to train for example another 200 epochs from best.pt weights I see again mAP is very low, there is no option to continue training more epochs? although the program finish I can use --resume only if happen to interrupt.

@glenn-jocher
Copy link
Member

@arielkantorovich hello! I understand the confusion regarding resuming training for additional epochs in YOLOv5.
I apologize for any ambiguity in my previous response.

The number of epochs in the opt.yaml file is only used during the initial training process when python train.py is first executed. Once the training process finishes and you have the best.pt or last.pt weights file, you cannot simply resume training with additional epochs by modifying the opt.yaml file. The LR schedulers in YOLOv5 are designed to taper off to a minimum value at the epoch count specified in the original training command, and it is not recommended to modify this value.

You can resume training from the most recently saved checkpoint file using --resume path/to/last.pt. However, this only resumes training from where it left off and does not add any additional epochs. If you want to add more epochs to an already-trained model, you can start a new training run from best.pt by using the --weights path/to/best.pt argument.

I hope this clears up any confusion. Let us know if you have any further questions or concerns!

@arielkantorovich
Copy link

Thank you for your reply, When I train from weights/last.pt I see that my mAP is very low and does not really continue to increase from the last results why is it happening?
I was expected to continue around the same results I will be more concrete, I train 200 epochs I got last.pt around 0.251 mAP now I train again with the flags --weights last.pt my train starts with mAP 0.0447

@glenn-jocher
Copy link
Member

@arielkantorovich hello!

Thank you for reaching out. It is not uncommon to experience a decrease in mAP when resuming training from the last.pt checkpoint file. This is because the LR schedulers are designed to gradually decrease the learning rate over the course of the epochs during the training process. When you resume training from last.pt, the LR scheduler has already tapered off the learning rate to minimal levels, and as a result, the network might struggle to learn effectively from the additional training epochs and your mAP may drop.

You could try reducing the learning rate --lr, or increasing the number of warmup steps --warmup-epochs during the resumed training process. Another option is to start a new training run from the best.pt checkpoint file instead of resuming from last.pt by using the --weights path/to/best.pt argument, which will initialize the training process with the best weights achieved during the previous training sessions.

I hope this helps. Let us know if you have any other questions or concerns.

@aravindchakravarti
Copy link

Hi @glenn-jocher,

Sorry, I am asking same question as above. I read all the posts above. I too have same problem and unable to solve.

I am training yolov5 in Google Colab. Since, the time in Google colab is limited, and since I don't know how many epochs I may train the network to achieve desired performance; I want to train step by step.

  • I started training yolov5 (nano) model from scratch. I trained for 100 epochs, saved the results + weights in google drive,
    !python train.py --img 640 --epochs 100 --data data.yaml --weights ' ' --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25
  • Then used the best.pt in above step to train network again for another 100 epochs.
    !python train.py --img 640 --epochs 100 --data data.yaml --weights path/to/best.pt --cfg yolov5n.yaml --batch-size 90 --name person_detection --project /content/drive/MyDrive/YoloV5_Ultralytics/results/ --patience 25
  • Repeat the step 2 until get desired mAP.

Unfortunately, whenever I want to continue training (i.e., using previous best.pt) I am seeing that, yolov5 is training is starting from fresh. Can you please help?!

image

@glenn-jocher
Copy link
Member

@aravindchakravarti 👋

Thank you for detailing your steps. It seems like the issue you're facing might be related to resuming training with the best.pt weights. When you resume training using --weights path/to/best.pt, YOLOv5 should continue training from the previously saved checkpoint.

However, it's essential to ensure that the best.pt file is being saved correctly. You could verify this by inspecting the file size and modified timestamps of the best.pt file to confirm that it is being updated at each checkpoint.

Additionally, when you start a new training run using best.pt, it starts from the most recent best.pt weights, not from the beginning. If you find that your mAP is not improving when resuming from the best.pt, you could consider adjusting the learning rate --lr and other hyperparameters to fine-tune the training process.

I hope this information helps. Please feel free to reach out if you have any further questions or issues!

@Varun3713-creator
Copy link

@glenn-jocher hello
i have trained yolov9 model on my dataset by using
!python train.py
--batch 16 --epochs 50 --img 640 --device 0 --min-items 0 --close-mosaic 15
--data /content/drive/MyDrive/mango_readyallclasses/data.yaml
--weights {HOME}/weights/gelan-c.pt
--cfg models/detect/gelan-c.yaml
--hyp hyp.scratch-high.yaml

Now i want to add more epochs to get more mAP,can you share the code to resume the training from here for yolov9.please reply as soon as possible

@glenn-jocher
Copy link
Member

Hello @Varun3713-creator,

Thank you for reaching out! It looks like you're interested in resuming your training to add more epochs and improve your mAP. To do this, you can use the --resume flag with the path to your last checkpoint. Here’s how you can do it:

  1. Ensure you have the latest version: First, make sure you are using the latest versions of torch and the YOLOv5 repository. You can update your repository with the following commands:

    git pull
    pip install -r requirements.txt
  2. Resume Training: To resume training from your last checkpoint, you can use the --resume flag. This will continue training from where it left off, using the most recent checkpoint saved in your training directory. Here’s an example command:

    !python train.py --resume /content/drive/MyDrive/mango_readyallclasses/weights/last.pt

If you encounter any issues or if the training does not resume as expected, please provide a minimum reproducible code example so we can investigate further. You can find more information on how to create a minimum reproducible example here. This will help us better understand the problem and provide a more accurate solution.

Feel free to reach out if you have any other questions or need further assistance. Happy training! 🚀

@Varun3713-creator
Copy link

Varun3713-creator commented Jun 28, 2024 via email

@glenn-jocher
Copy link
Member

Hello @Varun3713-creator,

Thank you for your follow-up question! I understand the confusion regarding specifying the number of epochs when resuming training.

When you use the --resume flag, YOLOv5 automatically continues training from the last saved checkpoint, including the number of epochs left to train. You do not need to specify the number of epochs again. The training will resume from where it left off, continuing until it reaches the total number of epochs specified in the original training command.

For example, if you initially set --epochs 50 and the training was interrupted at epoch 30, using --resume will continue training from epoch 30 to epoch 50.

If you want to extend the total number of epochs beyond the original setting, you can modify the opt.yaml file or specify the new total number of epochs directly in the command line. Here’s how you can do it:

  1. Modify the Total Epochs: If you initially trained for 50 epochs and now want to train for an additional 50 epochs (total 100 epochs), you can specify the new total number of epochs when resuming:
    !python train.py --resume /content/drive/MyDrive/mango_readyallclasses/weights/last.pt --epochs 100

This command will continue training from the last checkpoint and extend the total number of epochs to 100.

Please ensure you are using the latest versions of torch and the YOLOv5 repository to avoid any compatibility issues. If you encounter any further issues, providing a minimum reproducible code example will help us investigate and provide a more accurate solution. You can find more information on creating a minimum reproducible example here.

Feel free to reach out if you have any other questions or need further assistance. Happy training! 🚀

@Varun3713-creator
Copy link

Varun3713-creator commented Jul 1, 2024 via email

@Varun3713-creator
Copy link

Varun3713-creator commented Jul 2, 2024 via email

@Glenn
Copy link

Glenn commented Jul 2, 2024

Can you guys please stop tagging me I'm not the right Glenn - and I keep having to unsubscribe to your threads!!

@glenn-jocher
Copy link
Member

Hello @Glenn,

Thank you for bringing this to our attention. We apologize for the inconvenience caused by the incorrect tagging. We'll ensure to be more careful in the future to avoid tagging the wrong person.

To the community, please note that the correct Glenn Jocher is not involved in this thread. Let's focus on addressing the issue at hand.


@Varun3713-creator,

Thank you for your patience. The error you're encountering, "local variable 'epoch' referenced before assignment," suggests there might be an issue with the way the resume functionality is being handled in your specific setup.

To help us investigate further, could you please provide a minimum reproducible code example? This will allow us to better understand the context and pinpoint the issue. You can find guidance on creating a minimum reproducible example here.

In the meantime, please ensure you are using the latest versions of torch and the YOLOv5 repository. You can update your repository with the following commands:

git pull
pip install -r requirements.txt

If the issue persists, you might want to try the following workaround to manually specify the number of epochs when resuming training:

  1. Manually Adjust Epochs: Instead of using the --resume flag, you can manually load the weights and specify the total number of epochs. Here’s an example command:
    !python train.py --weights /content/drive/MyDrive/mango_readyallclasses/weights/last.pt --epochs 100 --data /content/drive/MyDrive/mango_readyallclasses/data.yaml --cfg models/detect/gelan-c.yaml --hyp hyp.scratch-high.yaml --batch 16 --img 640 --device 0 --min-items 0 --close-mosaic 15

This command will start a new training session using the weights from last.pt and continue training for a total of 100 epochs.

Please let us know if this resolves your issue or if you need further assistance. We're here to help! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

10 participants