Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-contained checkpoint --resume #8839

Merged
merged 14 commits into from
Aug 3, 2022
Merged

Conversation

glenn-jocher
Copy link
Member

@glenn-jocher glenn-jocher commented Aug 2, 2022

Make checkpoints fully contained, allow resume directly from any checkpoint.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Improved checkpoint handling and enhanced URL checks in training scripts.

📊 Key Changes

  • Hyperparameters (hyp) are now saved with checkpoint for improved reproducibility.
  • Added is_url function check to verify if a given string is a valid URL.
  • Simplified argument parsing in parse_opt function.
  • Enhanced resume function to better handle interrupted training sessions and to deal with URL data sources correctly.
  • Updated error messages and logging for clarity when resuming training.

🎯 Purpose & Impact

  • The ability to save hyperparameters directly with the checkpoints ensures that all necessary information for reproducing a training session is contained within the checkpoint itself. 🔄
  • The improved URL check (is_url) provides more robust validation and ensures that training data can be correctly accessed from remote sources. 🌍
  • Simplifying the argument parser makes the code more readable and maintainable. 🧹
  • Better resume functionality aids users in restarting interrupted training sessions without hassle, thereby saving time and computational resources. ⏱
  • Clearer error messaging and logging enhance the user's experience by providing better guidance and debugging information. 📝

@glenn-jocher glenn-jocher self-assigned this Aug 2, 2022
@glenn-jocher
Copy link
Member Author

@kalenmike ok I think has all we need. The idea is you can take one of these new fully-contained partially-trained checkpoints, download it, and then python train.py --resume path/to/partially_trained_checkpoint.pt.

It will create save_dir and resume training from where it left off automatically. For HUB I think what we need on the server/API side is that when ultralytics.start(key) sees an existing partially trained model in the database that it downloads it's last.pt and then runs python train.py --resume last.pt.

So for the user the command to start a new training or resume an existing training would be the same, and it would be up to us to handle them correctly depending on the model status in the database (fully or partially trained).

@glenn-jocher glenn-jocher changed the title Single checkpoint resume Self-contained checkpoint --resume Aug 2, 2022
@glenn-jocher glenn-jocher merged commit a75a110 into master Aug 3, 2022
@glenn-jocher glenn-jocher deleted the update/resume_from_checkpoint branch August 3, 2022 19:28
ctjanuhowski pushed a commit to ctjanuhowski/yolov5 that referenced this pull request Sep 8, 2022
* Single checkpoint resume

* Update train.py

* Add hyp

* Add hyp

* Add hyp

* FIX

* avoid resume on url data

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid resume on url data

* avoid resume on url data

* Update

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant