Implementation of Early Stopping for DDP training #8345

giacomoguiduzzi · 2022-06-26T17:55:50Z

This edit correctly uses the broadcast_object_list function to send slave processes a bool so to end the training phase if the variable is True, thus allowing the master process to destroy the process group and terminate.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Implementation of a unified early stopping mechanism for both single-GPU and DDP training in the YOLOv5 model.

📊 Key Changes

🤖 Added a single stop flag alongside the existing EarlyStopping object for better early stop control.
🔄 The early stopping check now updates the stop flag which is used consistently across the training loop.
🚀 Optimized the training loop to use the new stop flag, removing previous early stop code for single-GPU and DDP.
📡 Implemented broadcasting the stop flag in DDP (Distributed Data Parallel) training to ensure all processes stop simultaneously.

🎯 Purpose & Impact

🎓 Provides a cleaner, more consistent early stopping implementation for developers and researchers.
🌐 Enhances the integrity of distributed training by ensuring all training processes stop at the same time when the criteria are met.
🏃‍♂️ May speed up training by preventing unnecessary epochs, leading to faster experimentation cycles for users.
🔍 Ensures more robust training across different training setups, potentially improving the reliability of model training for users.

This edit correctly uses the broadcast_object_list() function to send slave processes a boolean so to end the training phase if the variable is True, thus allowing the master process to destroy the process group and terminate.

github-actions

👋 Hello @giacomoguiduzzi, thank you for submitting a YOLOv5 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

✅ Verify your PR is up-to-date with upstream/master. If your PR is behind upstream/master an automatic GitHub Actions merge may be attempted by writing /rebase in a new comment, or by running the following code, replacing 'feature' with the name of your local branch:

git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
# git checkout feature  # <--- replace 'feature' with local branch name
git merge upstream/master
git push -u origin -f

✅ Verify all Continuous Integration (CI) checks are passing.
✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

glenn-jocher · 2022-06-27T14:17:48Z

@giacomoguiduzzi thanks for the PR! Is there any way to drop some or all of this code into the stopper() method itself? The class can access the global RANK variable by defining it in utils/torch_utils if required, i.e.:

RANK = int(os.getenv('RANK', -1))

giacomoguiduzzi · 2022-06-28T07:56:34Z

@giacomoguiduzzi thanks for the PR! Is there any way to drop some or all of this code into the stopper() method itself? The class can access the global RANK variable by defining it in utils/torch_utils if required, i.e.:
RANK = int(os.getenv('RANK', -1))

Hi @glenn-jocher, no problem! I think if you wanted to drop this code into the EarlyStopper __call__ method you'd need every device to call that function. At the moment, on line 438 the instruction stop = stopper(epoch=epoch, fitness=fi) is inside an if branch that is only executed by the master process. If you move this call outside of the if branch, so that every device calls the stopper() method, then it is possible for every device to execute the broadcast_object_list() function inside the EarlyStopper class. Looking at the definition of the stopper variable on line 299 it looks like every process has it defined already. I can see though that the EarlyStopper class uses the fi variable that is currently computed only by the master device as it is the result of the evaluation phase.

Summing up, I think it could be possible but it is necessary to broadcast the fi variable in the __call__ method, so that every process is able to compute it's stop boolean value. If this is done it is not necessary to broadcast the stop variable anymore as the slave processes are now synchronized and every one of them will have stop = True if it is the case, thus every process will terminate with the master.

Let me know if you want me to look into it.

for more information, see https://pre-commit.ci

glenn-jocher · 2022-06-28T16:33:57Z

@giacomoguiduzzi I've cleaned up the PR a bit while maintaining the functionality, I think. Can you test on your side to verify that everything still works correctly? If it all looks good after your review I will proceed to merge. Thanks!

This cleans up the definition of broadcast_list and removes the requirement for clear() afterward.

glenn-jocher · 2022-06-28T16:41:01Z

@giacomoguiduzzi further cleaned up in 58bc763. I think this is ok but have not tested with DDP earlystopping.

giacomoguiduzzi · 2022-06-29T09:48:57Z

Hi @glenn-jocher, I've just tested your edits and the early stopping feature is working as intended. You're right, my code wasn't very pythonic...

glenn-jocher · 2022-06-29T10:42:12Z

@giacomoguiduzzi got it! PR is merged. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐

* Implementation of Early Stopping for DDP training This edit correctly uses the broadcast_object_list() function to send slave processes a boolean so to end the training phase if the variable is True, thus allowing the master process to destroy the process group and terminate. * Update train.py * Update train.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update train.py * Update train.py * Update train.py * Further cleanup This cleans up the definition of broadcast_list and removes the requirement for clear() afterward. Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Implementation of Early Stopping for DDP training

6dcc47b

This edit correctly uses the broadcast_object_list() function to send slave processes a boolean so to end the training phase if the variable is True, thus allowing the master process to destroy the process group and terminate.

github-actions bot reviewed Jun 26, 2022

View reviewed changes

Merge branch 'master' into early_stopping_fix

17c56a8

glenn-jocher and others added 7 commits June 28, 2022 18:05

Merge branch 'master' into early_stopping_fix

9f6d318

Update train.py

4aa4305

Update train.py

953aaa3

[pre-commit.ci] auto fixes from pre-commit.com hooks

0e87ad6

for more information, see https://pre-commit.ci

Update train.py

d6ad680

Update train.py

39c1f11

Update train.py

227a77a

glenn-jocher self-assigned this Jun 28, 2022

glenn-jocher added the TODO label Jun 28, 2022

Further cleanup

58bc763

This cleans up the definition of broadcast_list and removes the requirement for clear() afterward.

glenn-jocher merged commit 6935a54 into ultralytics:master Jun 29, 2022

glenn-jocher removed the TODO label Jun 29, 2022

giacomoguiduzzi deleted the early_stopping_fix branch June 29, 2022 10:45

giacomoguiduzzi restored the early_stopping_fix branch June 29, 2022 10:45

Hojland mentioned this pull request Oct 17, 2022

feat/bump Go-Autonomous/yolov5#15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of Early Stopping for DDP training #8345

Implementation of Early Stopping for DDP training #8345

giacomoguiduzzi commented Jun 26, 2022 •

edited by UltralyticsAssistant

Loading

github-actions bot left a comment

glenn-jocher commented Jun 27, 2022 •

edited

Loading

giacomoguiduzzi commented Jun 28, 2022

glenn-jocher commented Jun 28, 2022

glenn-jocher commented Jun 28, 2022

giacomoguiduzzi commented Jun 29, 2022

glenn-jocher commented Jun 29, 2022

Implementation of Early Stopping for DDP training #8345

Implementation of Early Stopping for DDP training #8345

Conversation

giacomoguiduzzi commented Jun 26, 2022 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

github-actions bot left a comment

Choose a reason for hiding this comment

glenn-jocher commented Jun 27, 2022 • edited Loading

giacomoguiduzzi commented Jun 28, 2022

glenn-jocher commented Jun 28, 2022

glenn-jocher commented Jun 28, 2022

giacomoguiduzzi commented Jun 29, 2022

glenn-jocher commented Jun 29, 2022

giacomoguiduzzi commented Jun 26, 2022 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Jun 27, 2022 •

edited

Loading