Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix learning rate gap on resume #9468

Merged
merged 10 commits into from
Apr 2, 2024
Merged

Fix learning rate gap on resume #9468

merged 10 commits into from
Apr 2, 2024

Conversation

Laughing-q
Copy link
Member

@Laughing-q Laughing-q commented Apr 1, 2024

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Improvements in training scheduling and more informative logging.

πŸ“Š Key Changes

  • Optimizer step warning suppression moved and simplified within the training loop.
  • Scheduling based on time now occurs directly after an epoch ends rather than inside a warning suppression block.
  • Introduction of a more descriptive logging prefix (EarlyStopping: ) for better clarity on training halts due to lack of improvement.

🎯 Purpose & Impact

  • Enhanced Efficiency πŸƒβ€β™‚οΈ: Moving the learning rate scheduler step outside of a warning suppression block streamlines the training process, potentially reducing confusion and improving code maintainability.
  • Time-Based Scheduling πŸ•’: Adjusting the scheduler after each epoch based on the average epoch duration and desired training time allows for more precise control over the training duration, making it easier to fit training into specific time slots.
  • Improved User Experience πŸ“ˆ: Introducing a colored "EarlyStopping:" prefix to logs makes it immediately clear when training has been halted due to no observed improvement, enhancing user understanding and control over the training process.

These changes aim to simplify the training loop for better performance, more precise time management, and clearer communication of important training events, all of which contribute to a more efficient and user-friendly training experience.

Copy link

codecov bot commented Apr 1, 2024

Codecov Report

Attention: Patch coverage is 45.45455% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 76.67%. Comparing base (e5f4f5c) to head (02a652e).

❗ Current head 02a652e differs from pull request most recent head e8cfdd6. Consider uploading reports for the commit e8cfdd6 to get more accurate results

Files Patch % Lines
ultralytics/engine/trainer.py 44.44% 5 Missing ⚠️
ultralytics/utils/torch_utils.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #9468       +/-   ##
===========================================
+ Coverage   37.99%   76.67%   +38.67%     
===========================================
  Files         121      120        -1     
  Lines       15277    15175      -102     
===========================================
+ Hits         5805    11635     +5830     
+ Misses       9472     3540     -5932     
Flag Coverage Ξ”
Benchmarks 36.30% <9.09%> (?)
GPU 38.23% <45.45%> (+0.23%) ⬆️
Tests 71.93% <45.45%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

β˜” View full report in Codecov by Sentry.
πŸ“’ Have feedback on the report? Share it here.

@Laughing-q
Copy link
Member Author

Laughing-q commented Apr 1, 2024

@glenn-jocher this PR fixed the lr issue when resuming. I tested a training locally with interrupted at epoch 5/10 then resume the training.
vABnaDnDsK

Also I tested time training by:

yolo detect train data=runs/data/coco.yaml time=0.01
yolo detect train data=runs/data/coco.yaml time=0.02
yolo detect train data=runs/data/coco.yaml time=0.05

For quick verification, I manually set the train set to coco/val2017 which contains 5000 images. And all these three training behave exactly the same before and after this fix.

  • yolo detect train data=runs/data/coco.yaml time=0.01 finished the training at the middle of the first epoch.
  • yolo detect train data=runs/data/coco.yaml time=0.02 finished the training right after 1 epoch.
  • yolo detect train data=runs/data/coco.yaml time=0.05 finished the training right after 2 epoch.

@Laughing-q
Copy link
Member Author

@glenn-jocher please take a look and check if I break anything. Thanks!

@glenn-jocher glenn-jocher changed the title attempt to fix lr issue when resuming Attempt to fix lr issue when resuming Apr 1, 2024
@glenn-jocher
Copy link
Member

@Laughing-q that would be super nice we could get rid of all that timed training logic, but I think I had to include it for a reason. One test for timed training is we need to run with less time and more time than the default epochs count. I'll run some tests on this.

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 1, 2024

Oh I think I remember now, we need to update the LR scheduler to hit lrf on epochs epoch, I think this logic handles that as epochs keeps updating during the training.

If we don't update the LR scheduler then timed training won't be optimal as the LR scheduler will be fixed to hit lrf on the default epochs instead of the updated epochs.

So timed training does 3 things at the end of each epoch:

  1. Check the time that's been spent on the past epochs and figure out an average time per epoch.
  2. Compute new epochs to end training perfectly on the requested time.
  3. Update the LR scheduler to hit final lr on the new final epochs.

@Burhan-Q Burhan-Q added the enhancement New feature or request label Apr 1, 2024
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
@glenn-jocher
Copy link
Member

@Laughing-q ok I moved the scheduler.step() line to the beginning of the train loop instead of the end. The new LR will always be 1 step ahead of the previous one, but I think this fine. Resume looks good now. I CTRL+C a run 3 times and resumed 3 times here:

Screenshot 2024-04-01 at 23 37 32 Screenshot 2024-04-01 at 23 38 01 Screenshot 2024-04-01 at 23 38 33

What do you think?

@Laughing-q
Copy link
Member Author

@glenn-jocher Looks good to me!
Actually this is interesting, moving scheduler.step() to the beginning seems to be the correct one. I launched four training to test the lr change:

  • main branch, default settings vs warmup_epochs=0(to check out what the lr looks like if scheduler starts working at the beginning.)
    image

  • current branch, default settings vs warmup_epochs=0, which does scheduler.step() at the beginning.
    bWlth7HKgV

@glenn-jocher
Copy link
Member

Nice!!

Eunchan24 and others added 3 commits April 2, 2024 11:52
Co-authored-by: Lakshantha Dissanayake <lakshanthad@yahoo.com>
Co-authored-by: RizwanMunawar <chr043416@gmail.com>
Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Co-authored-by: UltralyticsAssistant <web@ultralytics.com>
Co-authored-by: gs80140 <gs80140@users.noreply.github.com>
@glenn-jocher glenn-jocher changed the title Attempt to fix lr issue when resuming ultralytics 8.1.42 attempt to fix lr issue when resuming Apr 2, 2024
@glenn-jocher glenn-jocher changed the title ultralytics 8.1.42 attempt to fix lr issue when resuming ultralytics 8.1.42 learning-rate resume fix Apr 2, 2024
@glenn-jocher glenn-jocher changed the title ultralytics 8.1.42 learning-rate resume fix Fix learning rate gap on resume Apr 2, 2024
@glenn-jocher glenn-jocher merged commit 1e547e6 into main Apr 2, 2024
10 checks passed
@glenn-jocher glenn-jocher deleted the fix-scheduler branch April 2, 2024 09:55
@glenn-jocher
Copy link
Member

@Laughing-q PR merged!

hmurari pushed a commit to hmurari/ultralytics that referenced this pull request Apr 17, 2024
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Co-authored-by: UltralyticsAssistant <web@ultralytics.com>
Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Co-authored-by: EunChan Kim <eunchan@hanyang.ac.kr>
Co-authored-by: Lakshantha Dissanayake <lakshanthad@yahoo.com>
Co-authored-by: RizwanMunawar <chr043416@gmail.com>
Co-authored-by: gs80140 <gs80140@users.noreply.github.com>
gkinman pushed a commit to Octasic/ultralytics that referenced this pull request May 30, 2024
Signed-off-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Co-authored-by: UltralyticsAssistant <web@ultralytics.com>
Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Co-authored-by: EunChan Kim <eunchan@hanyang.ac.kr>
Co-authored-by: Lakshantha Dissanayake <lakshanthad@yahoo.com>
Co-authored-by: RizwanMunawar <chr043416@gmail.com>
Co-authored-by: gs80140 <gs80140@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants