Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement --save-period locally #5047

Merged
merged 1 commit into from
Oct 5, 2021
Merged

Implement --save-period locally #5047

merged 1 commit into from
Oct 5, 2021

Conversation

glenn-jocher
Copy link
Member

@glenn-jocher glenn-jocher commented Oct 5, 2021

This PR adds a new training argument --save-period to save training checkpoints every x epochs. To save training every 50 epochs for example:

python train.py --save-period 50  # saves epoch50.pt, epoch100.pt, epoch150.pt, ... etc.

This saved checkpoints in addition to existing last.pt and best.pt checkpoints and does not affect their behavior. Default value is -1, i.e. disabled.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Implementing periodic checkpoint saving during model training in YOLOv5.

📊 Key Changes

  • Added code to save model checkpoints periodically, controlled by a new command-line argument called --save-period.
  • Organized Weights & Biases (W&B) related arguments under a separate comment in the argument parser for better readability.

🎯 Purpose & Impact

  • This change provides flexibility by allowing users to save model checkpoints at specified intervals (--save-period), improving resource management and providing intermediate points to resume training if necessary.
  • Cleanup and organization of W&B related code enhances maintainability and clarity for developers working with these options.
  • Users now have the option to potentially avoid large gaps in recovery points during long training runs, reducing the risk of significant data loss in case of unexpected interruptions. 🛡️

This PR adds a new training argument `--save-period` to save training checkpoints every `x` epochs. To save training every 50 epochs for example:
```
python train.py --save-period 50  # saves epoch50.pt, epoch100.pt, epoch150.pt, ... etc.
```

This saved checkpoints in addition to existing last.pt and best.pt checkpoints and does not affect their behavior. Default value is -1, i.e. disabled.
@glenn-jocher glenn-jocher linked an issue Oct 5, 2021 that may be closed by this pull request
@glenn-jocher
Copy link
Member Author

@AyushExel FYI this PR applies the W&B --save_period argument (renamed to --save-period no underscore to conform to the rest of the YOLO arguments but turns into underscore by argparser) to local checkpointing as well. Should not affect W&B ops.

@glenn-jocher
Copy link
Member Author

Verified correct operation in Colab.

NOTE: 'Epoch' checkpoints are not optimizer-stripped, so will be larger than last.pt and best.pt upon training completion. This is intended behavior to allow them to be used to --resume a run, i.e. python train.py --resume path/to/epoch50.pt.

Screen Shot 2021-10-04 at 6 45 42 PM

@glenn-jocher glenn-jocher merged commit 5afc9c2 into master Oct 5, 2021
@glenn-jocher glenn-jocher deleted the update/save_period branch October 5, 2021 01:48
@AyushExel
Copy link
Contributor

@glenn-jocher awesome. Now, both cloud based and local checkpointing are aligned!!
Just to confirm, last.pt will still be the latest checkpoint updated after after every epoch right?

@glenn-jocher
Copy link
Member Author

@AyushExel yes that's correct. last.pt and best.pt are not affected by this change.

BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
This PR adds a new training argument `--save-period` to save training checkpoints every `x` epochs. To save training every 50 epochs for example:
```
python train.py --save-period 50  # saves epoch50.pt, epoch100.pt, epoch150.pt, ... etc.
```

This saved checkpoints in addition to existing last.pt and best.pt checkpoints and does not affect their behavior. Default value is -1, i.e. disabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Periodical Weight Save
2 participants