-
-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaNs and INFs in gradient values #8578
Comments
@UnglvKitDe thanks for the bug report! So you are saying that |
@glenn-jocher Yes, in general such steps should be skipped with Infs or NaN values. However, it leads to making the training very unstable. An example from the training with my private dataset: A training without AMP (same split, same RandomSeeds) works without problems. In the course of this, I looked at the gradients and noticed that the training starts to become unstable as soon as infs appear in the gradients (NaNs are mostly only the result of CUDNN when you have infs further ahead in the gradients, at least that's my analysis now). Then I had a look (also with the above-mentioned code) at a training of Scratch with COCO128. And there you get infs directly at the beginning of the very first gradient calculation. This can not be correct. However, I do not know yet completely why. |
@glenn-jocher One more info: the results from above are ~10% worse than if you have a stable training. Sometimes it is even enough to change the learning rate minimally. |
@UnglvKitDe ok so these are numerical issues due to AMP effects. Maybe the initial lr0 is too high. We a warmup that starts the 2 lr0's at 0.0 but the bias lr starts very high, maybe we should lower it: Lines 337 to 338 in f8722b4
Alternatively maybe we should put a ceiling on the loss values |
yolov5/data/hyps/hyp.scratch-low.yaml Line 12 in f8722b4
|
@glenn-jocher Yes, it will be due to AMP, there are also numerous issues about it e.g.: pytorch/pytorch#40497. I'm currently trying to see if gradient clipping with backward hooks helps. Lowering the learning rate at the beginning could also help. I'll try it out afterwards. |
@UnglvKitDe got it! Let me know your results after you run your experiments. |
@glenn-jocher I have done some tests. It makes the training much more stable if you add gradient clipping (as is also recommended here). Additionally you can use tensor hooks and replace the INF values e.g. with 0. I can create a PR, then we can discuss this in detail? Reducing the learning rate did not improve it. |
@UnglvKitDe got it! Yes please submit a PR. Did you notice any speed reductions for the gradient clipping code? |
@glenn-jocher On my private data set, it led to a 0.5% improvement (besides the general more stable training). Unfortunately I can't test it on the original COCO dataset, I don't have the resources for that at the moment... Srry! |
@UnglvKitDe oh don't worry, I can test it on COCO once you submit the PR! |
@glenn-jocher Thx :) |
* Add tensor hooks and gradient clipping #8578 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove retain_grad(), because its not necessary * Update train.py * Simplify * Update train.py * Update train.py * Update train.py * Update train.py Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
* Add tensor hooks and gradient clipping ultralytics#8578 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove retain_grad(), because its not necessary * Update train.py * Simplify * Update train.py * Update train.py * Update train.py * Update train.py Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
@UnglvKitDe you're welcome! If you have any more questions or need further assistance, feel free to ask. Good luck with the PR! |
Search before asking
YOLOv5 Component
Training, Validation
Bug
The problem is that you get NaN values and Infinity values in the gradient. This makes the training unstable.
To reproduce the problem, you can set torch.autograd.set_detect_anomaly(True) before the training start and the gradients e.g. by hooks or you can search the gradients by for-loop by inserting this:
before the optimizer step:
yolov5/train.py
Line 364 in f8722b4
I get the problem on my private dataset. To reproduce the error, it is enough to train the first mini-batch from coco128.
Environment
YOLOv5 torch 1.11 (cuda 11.3) and 1.12 (cuda 11.6)
Minimal Reproducible Example
python train.py --batch-size 8 --weights '' --hyp data/hyps/hyp.scratch-high.yaml --cfg yolov5l6.yaml
Additional
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: