Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What hyperparams do I need to tune when I want to continue a previous training? #9257

Closed
1 task done
haimat opened this issue Sep 2, 2022 · 6 comments
Closed
1 task done
Labels
question Further information is requested Stale

Comments

@haimat
Copy link
Contributor

haimat commented Sep 2, 2022

Search before asking

Question

Sometimes I want to continue training using the best.pt model from a previous YOLOv5 training run. However, everytime I do so after only 2 or 3 epochs in the new training the model performance drops down quite a bit, often even nearly down to below 0.1, even though it has been 0.5 in best.pt from the previous training.

I assume that is because of the learning rate being too high. But that way I loose nearly all the training work stored in best.pt, which is obviously not what I want. So I guess I need to tweak the hyperparams for the second training.

Could you please advice, what hyperparams in particular I would need to tweak, and in which direction (up or down), when I want to fine tune a model, i.e. continue from the best.pt file from a previous training session?

Additional

As an example, let's have a look at the following training performance, showing the mAP value of my model during 500 epochs:

image

Looking at the linear line it seems mAP performance of this model can be improved even further, let's say for another 500 training epochs. However, every time I continue training from best.pt of that training from the image above, within the first 3-5 epochs or so mAP drops down 0.05 or something like that, then it taks some 100s more epochs to get up again. In the end, after 500 training epochs, I am close to where I have been in the first training.

Thus I am basically starting again from the start and loosing many many training epochs. So how can I start from that good mAP value of the first training run and continue from there?

@haimat haimat added the question Further information is requested label Sep 2, 2022
@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 2, 2022

@haimat 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs defined at training start (default=300), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.

Screenshot 2022-04-10 at 11 11 51

If your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck 🍀 and let us know if you have any other questions!

@haimat
Copy link
Contributor Author

haimat commented Sep 2, 2022

@glenn-jocher Thanks, but this does not answer my question. I know about what you wrote, but what I don't know is how exactly the hyperparams influence the LR. As described in my use case, my question is: what hyperparams do I need to modify, and in whch way, if I want to do a 2nd training using the best.pt file from a previous training?

@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 4, 2022

@haimat you don't need to modify anything, you can start a second training on any dataset from previously trained weights on any other dataset.

You can choose to experiment with hyperparameter variations, but of course I can't advise on this, the experimentation is on you. If you want an automated way of evolving hyperparameters see our Hyperparameter Evolution tutorial below.

If you're just asking how to modify LR these values are here:

lr0: 0.01 # initial learning rate (SGD=1E-2, Adam=1E-3)
lrf: 0.01 # final OneCycleLR learning rate (lr0 * lrf)

Tutorials

Good luck 🍀 and let us know if you have any other questions!

@haimat
Copy link
Contributor Author

haimat commented Sep 4, 2022

@glenn-jocher Hi Glenn, thanks for your response. In particular I would be interested to know how the first few parameters influence training:

lr0: 0.01  # initial learning rate (SGD=1E-2, Adam=1E-3)
lrf: 0.01  # final OneCycleLR learning rate (lr0 * lrf)
momentum: 0.937  # SGD momentum/Adam beta1
weight_decay: 0.0005  # optimizer weight decay 5e-4
warmup_epochs: 3.0  # warmup epochs (fractions ok)
warmup_momentum: 0.8  # warmup initial momentum
warmup_bias_lr: 0.1  # warmup initial bias lr

I see their comments. but they are very brief. Is there some more documentation on them anywhere?

@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2022

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@github-actions github-actions bot added the Stale label Oct 5, 2022
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 16, 2022
@glenn-jocher
Copy link
Member

@haimat Yes, the hyperparameters you provided play crucial roles in the training process. Here's a brief overview:

  • lr0: This is the initial learning rate, and its value largely depends on the optimizer being used. For Stochastic Gradient Descent (SGD), the typical value is 1E-2, while for Adam, it is 1E-3. This parameter determines the size of the steps taken to update the model's weights during training.

  • lrf: This is the final OneCycleLR learning rate, which is used in learning rate schedules. It is calculated as the product of lr0 and lrf.

  • momentum: For SGD, this parameter represents the momentum, which determines the contribution of the accumulated gradient in updating the weights. The typical value is 0.937 for SGD and 0.9 for Adam.

  • weight_decay: This is the optimizer weight decay, and it helps in preventing the model from overfitting to the training data. The typical value is 5e-4.

  • warmup_epochs: This parameter represents the number of warmup epochs. During warmup, the learning rate is gradually increased from a small value to its initial value. The fraction value is also acceptable.

  • warmup_momentum: This is the initial momentum used during warmup.

  • warmup_bias_lr: This parameter represents the initial bias learning rate during warmup.

For more details and advanced guidance on these hyperparameters and their effects on training, you can refer to our documentation for YOLOv5.

I hope this provides a clearer understanding of how these hyperparameters influence the training process. Let me know if you have any more questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

2 participants