Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Early Stopping for DDP training #8345

Merged
merged 10 commits into from
Jun 29, 2022

Conversation

giacomoguiduzzi
Copy link
Contributor

@giacomoguiduzzi giacomoguiduzzi commented Jun 26, 2022

This edit correctly uses the broadcast_object_list function to send slave processes a bool so to end the training phase if the variable is True, thus allowing the master process to destroy the process group and terminate.

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Implementation of a unified early stopping mechanism for both single-GPU and DDP training in the YOLOv5 model.

πŸ“Š Key Changes

  • πŸ€– Added a single stop flag alongside the existing EarlyStopping object for better early stop control.
  • πŸ”„ The early stopping check now updates the stop flag which is used consistently across the training loop.
  • πŸš€ Optimized the training loop to use the new stop flag, removing previous early stop code for single-GPU and DDP.
  • πŸ“‘ Implemented broadcasting the stop flag in DDP (Distributed Data Parallel) training to ensure all processes stop simultaneously.

🎯 Purpose & Impact

  • πŸŽ“ Provides a cleaner, more consistent early stopping implementation for developers and researchers.
  • 🌐 Enhances the integrity of distributed training by ensuring all training processes stop at the same time when the criteria are met.
  • πŸƒβ€β™‚οΈ May speed up training by preventing unnecessary epochs, leading to faster experimentation cycles for users.
  • πŸ” Ensures more robust training across different training setups, potentially improving the reliability of model training for users.

This edit correctly uses the broadcast_object_list() function to send slave processes a boolean so to end the training phase if the variable is True, thus allowing the master process to destroy the process group and terminate.
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ‘‹ Hello @giacomoguiduzzi, thank you for submitting a YOLOv5 πŸš€ PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • βœ… Verify your PR is up-to-date with upstream/master. If your PR is behind upstream/master an automatic GitHub Actions merge may be attempted by writing /rebase in a new comment, or by running the following code, replacing 'feature' with the name of your local branch:
git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
# git checkout feature  # <--- replace 'feature' with local branch name
git merge upstream/master
git push -u origin -f
  • βœ… Verify all Continuous Integration (CI) checks are passing.
  • βœ… Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 27, 2022

@giacomoguiduzzi thanks for the PR! Is there any way to drop some or all of this code into the stopper() method itself? The class can access the global RANK variable by defining it in utils/torch_utils if required, i.e.:

RANK = int(os.getenv('RANK', -1))

@giacomoguiduzzi
Copy link
Contributor Author

@giacomoguiduzzi thanks for the PR! Is there any way to drop some or all of this code into the stopper() method itself? The class can access the global RANK variable by defining it in utils/torch_utils if required, i.e.:

RANK = int(os.getenv('RANK', -1))

Hi @glenn-jocher, no problem! I think if you wanted to drop this code into the EarlyStopper __call__ method you'd need every device to call that function. At the moment, on line 438 the instruction stop = stopper(epoch=epoch, fitness=fi) is inside an if branch that is only executed by the master process. If you move this call outside of the if branch, so that every device calls the stopper() method, then it is possible for every device to execute the broadcast_object_list() function inside the EarlyStopper class. Looking at the definition of the stopper variable on line 299 it looks like every process has it defined already. I can see though that the EarlyStopper class uses the fi variable that is currently computed only by the master device as it is the result of the evaluation phase.

Summing up, I think it could be possible but it is necessary to broadcast the fi variable in the __call__ method, so that every process is able to compute it's stop boolean value. If this is done it is not necessary to broadcast the stop variable anymore as the slave processes are now synchronized and every one of them will have stop = True if it is the case, thus every process will terminate with the master.

Let me know if you want me to look into it.

@glenn-jocher
Copy link
Member

@giacomoguiduzzi I've cleaned up the PR a bit while maintaining the functionality, I think. Can you test on your side to verify that everything still works correctly? If it all looks good after your review I will proceed to merge. Thanks!

@glenn-jocher glenn-jocher self-assigned this Jun 28, 2022
This cleans up the definition of broadcast_list and removes the requirement for clear() afterward.
@glenn-jocher
Copy link
Member

@giacomoguiduzzi further cleaned up in 58bc763. I think this is ok but have not tested with DDP earlystopping.

@giacomoguiduzzi
Copy link
Contributor Author

Hi @glenn-jocher, I've just tested your edits and the early stopping feature is working as intended. You're right, my code wasn't very pythonic...

@glenn-jocher glenn-jocher merged commit 6935a54 into ultralytics:master Jun 29, 2022
@glenn-jocher glenn-jocher removed the TODO label Jun 29, 2022
@glenn-jocher
Copy link
Member

@giacomoguiduzzi got it! PR is merged. Thank you for your contributions to YOLOv5 πŸš€ and Vision AI ⭐

@giacomoguiduzzi giacomoguiduzzi deleted the early_stopping_fix branch June 29, 2022 10:45
@giacomoguiduzzi giacomoguiduzzi restored the early_stopping_fix branch June 29, 2022 10:45
ctjanuhowski pushed a commit to ctjanuhowski/yolov5 that referenced this pull request Sep 8, 2022
* Implementation of Early Stopping for DDP training

This edit correctly uses the broadcast_object_list() function to send slave processes a boolean so to end the training phase if the variable is True, thus allowing the master process to destroy the process group and terminate.

* Update train.py

* Update train.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update train.py

* Update train.py

* Update train.py

* Further cleanup

This cleans up the definition of broadcast_list and removes the requirement for clear() afterward.

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants