Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tensor hooks and 10.0 gradient clipping #8598

Merged
merged 16 commits into from
Aug 1, 2022

Conversation

UnglvKitDe
Copy link
Contributor

@UnglvKitDe UnglvKitDe commented Jul 16, 2022

Improves the stability issues as described in #8578

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Improving training stability with gradient regularization and NaN mitigation.

πŸ“Š Key Changes

  • 🧊 Added a hook to convert NaN values to zero during training, reinforcing model stability.
  • βœ‚οΈ Introduced gradient clipping to prevent excessively large gradients, which can corrupt model weights.

🎯 Purpose & Impact

  • πŸ›‘οΈ The NaN conversion to zero avoids potential crashes or instability in training by handling undefined numerical values gracefully.
  • πŸ’ͺ Gradient clipping safeguards the training process by keeping gradients within a manageable range, promoting healthier and more stable weight updates.
  • βš™οΈ These enhancements are geared towards making the YOLOv5 training process more robust and reliable, beneficial for users looking to train models without encountering numerical issues.

@glenn-jocher glenn-jocher linked an issue Jul 16, 2022 that may be closed by this pull request
2 tasks
@UnglvKitDe
Copy link
Contributor Author

@glenn-jocher I have to change the part with the hook, that is not so right I noticed. Srry.

@glenn-jocher
Copy link
Member

@glenn-jocher I have to change the part with the hook, that is not so right I noticed. Srry.

Is this PR ready to test or do you still need to make changes?

@UnglvKitDe
Copy link
Contributor Author

@glenn-jocher Yes now, I removed the unnecessary line of code. It would not have changed, but it is not necessary. Thanks a lot!

@UnglvKitDe
Copy link
Contributor Author

@glenn-jocher A few insights: I've been running ~50 trainings on my private dataset with this new branch the last few days. So far it looks like this version is much more stable & the rSeed plays a less important role. No unstable trainings as described above have occurred so far. Before sometimes 3 out of 9 trainings were unstable. I know that is just my dataset and not a general test :)

@glenn-jocher
Copy link
Member

@UnglvKitDe I'd like to close this out but we need a few changes. If this improves default trainings with little cost/resources penalty then we should enable it by default. We want to avoid adding argparser arguments if at all possible, we already have too many. Can you profile training with this branch and with master to compare mAP and training time and CUDA memory utilization?

@glenn-jocher
Copy link
Member

@UnglvKitDe also why do we have to unscale optimizer first? This seems like a costly operation, wouldn't it make more sense to scale the gradient clipping value by the scaler settings?

@UnglvKitDe
Copy link
Contributor Author

@UnglvKitDe I'd like to close this out but we need a few changes. If this improves default trainings with little cost/resources penalty then we should enable it by default. We want to avoid adding argparser arguments if at all possible, we already have too many. Can you profile training with this branch and with master to compare mAP and training time and CUDA memory utilization?

@glenn-jocher At the moment I unfortunately don't have the resource to run a full coco training. My own dataset has only a few hundred images, coco ~120k. Srry! Can you test this? If I can help you in any other way, please let me know :)

@UnglvKitDe
Copy link
Contributor Author

@UnglvKitDe also why do we have to unscale optimizer first? This seems like a costly operation, wouldn't it make more sense to scale the gradient clipping value by the scaler settings?

@glenn-jocher As described here, it is necessary to unscale them because otherwise you will get the wrong gradients. II did not notice any significant longer training on my data set. But (as said above) I have only a small dataset. Maybe it also improves the training and leads to better results (by a more stable training).

@glenn-jocher
Copy link
Member

@UnglvKitDe thanks, I'll review the link and run some tests this weekend.

@glenn-jocher
Copy link
Member

@UnglvKitDe it looks like the PR does two separate things, first replace NaN with 0.0 and then separately clip gradients. Do you know which change resulted into the improvements you saw? Are you able to determine which was more important?

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 1, 2022

@UnglvKitDe I'm testing in Colab now. I see same time but slightly worse mAP for PR. I wonder if the clipping is too tight. We have 10.0 now, maybe put 30.0 or 100.0?

!git clone https://github.com/UnglvKitDe/yolov5-1 -b fix/grad_inf_nan  # clone
%cd yolov5-1
%pip install -qr requirements.txt  # install
!python train.py --img 640 --batch 16 --epochs 30 --data coco128.yaml --weights yolov5s.pt --cache

%cd ..
!git clone https://github.com/ultralytics/yolov5
%cd yolov5
!python train.py --img 640 --batch 16 --epochs 30 --data coco128.yaml --weights yolov5s.pt --cache

@glenn-jocher
Copy link
Member

@UnglvKitDe confirm raising the clip max to 100.0 solves the issue of lower mAP. You think this might be too high?

@glenn-jocher glenn-jocher removed the TODO label Aug 1, 2022
@UnglvKitDe
Copy link
Contributor Author

@UnglvKitDe it looks like the PR does two separate things, first replace NaN with 0.0 and then separately clip gradients. Do you know which change resulted into the improvements you saw? Are you able to determine which was more important?
@glenn-jocher For me, both in combination have improved the ressults.

@UnglvKitDe
Copy link
Contributor Author

@glenn-jocher About your changes: I don't know to what extent the parameters may be None. Actually, this makes no sense. But in a forum this was advised once, but it can also be an old & solved bug. I have seen values for max_value from 2.0 to 5.0. 10 was already chosen very high by me. I find 100, is already very very high...

@UnglvKitDe
Copy link
Contributor Author

@glenn-jocher I think that depends on the context, which height is needed, so I wanted to keep it configurable. I agree with you though that we have too many argparser arguments. Maybe we put this in a config file (hyp*.yaml or model *.yaml ?).

@glenn-jocher glenn-jocher changed the title Add tensor hooks and gradient clipping #8578 Add tensor hooks and 10.0 gradient clipping Aug 1, 2022
@glenn-jocher glenn-jocher merged commit 0669f1b into ultralytics:master Aug 1, 2022
@glenn-jocher
Copy link
Member

@UnglvKitDe reduced gradient clipping to 10.0 after some experiments found not much difference with master.

PR is merged. Thank you for your contributions to YOLOv5 πŸš€ and Vision AI ⭐

@UnglvKitDe
Copy link
Contributor Author

@glenn-jocher Thx :)

@glenn-jocher
Copy link
Member

@UnglvKitDe I observed erratic training behavior (green line) with the nan_to_num hook in classifier branch (I added it there also), so I'm going to remove it from master.

Screen Shot 2022-08-01 at 9 21 26 PM

@UnglvKitDe
Copy link
Contributor Author

UnglvKitDe commented Aug 4, 2022

@glenn-jocher Mh interesting, I have never seen anything like this. You have taken my code except for the None check I see. I do not know if this is the reason. But actually none of the weights should be None. And if they are, an exception should be thrown...

ctjanuhowski pushed a commit to ctjanuhowski/yolov5 that referenced this pull request Sep 8, 2022
* Add tensor hooks and gradient clipping ultralytics#8578

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove retain_grad(), because its not necessary

* Update train.py

* Simplify

* Update train.py

* Update train.py

* Update train.py

* Update train.py

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
@glenn-jocher
Copy link
Member

@UnglvKitDe I agree, the None checks should not be necessary, and the issue should be investigated further to ensure all weights are properly initialized and exceptions are handled appropriately. Thank you for bringing this to my attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NaNs and INFs in gradient values
2 participants