Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaNs and INFs in gradient values #8578

Closed
2 tasks done
UnglvKitDe opened this issue Jul 14, 2022 · 13 comments · Fixed by #8598 or #8826
Closed
2 tasks done

NaNs and INFs in gradient values #8578

UnglvKitDe opened this issue Jul 14, 2022 · 13 comments · Fixed by #8598 or #8826
Labels
bug Something isn't working

Comments

@UnglvKitDe
Copy link
Contributor

UnglvKitDe commented Jul 14, 2022

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training, Validation

Bug

The problem is that you get NaN values and Infinity values in the gradient. This makes the training unstable.
To reproduce the problem, you can set torch.autograd.set_detect_anomaly(True) before the training start and the gradients e.g. by hooks or you can search the gradients by for-loop by inserting this:

scaler.unscale_(optimizer)
for name, p in model.named_parameters():
    if p.grad is not None and torch.logical_or(p.grad.isnan().any(), p.grad.isinf().any()):
                        raise ValueError(f'Found error in parameters in input {name}'
                                         f'\n-->Inf: {p.grad.isinf().any()}\n-->NaN: {p.grad.isnan().any()}'
                                         f'\n-->Grad_type: {p.grad.dtype}\n-->Grad: {p.grad[p.grad.isinf()]}')

before the optimizer step:

yolov5/train.py

Line 364 in f8722b4

scaler.step(optimizer) # optimizer.step

I get the problem on my private dataset. To reproduce the error, it is enough to train the first mini-batch from coco128.

Environment

YOLOv5 torch 1.11 (cuda 11.3) and 1.12 (cuda 11.6)

Minimal Reproducible Example

python train.py --batch-size 8 --weights '' --hyp data/hyps/hyp.scratch-high.yaml --cfg yolov5l6.yaml

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@UnglvKitDe UnglvKitDe added the bug Something isn't working label Jul 14, 2022
@glenn-jocher
Copy link
Member

@UnglvKitDe thanks for the bug report! So you are saying that scaler does not solve this? I thought that NaN's and Inf's are unrecoverable, i.e. the loss will eventually be entirely NaN's if they are introduced in any place. Is this what you are seeing? Or are you saying that there are NaN's in the loss but the training still proceeds normally?

@UnglvKitDe
Copy link
Contributor Author

@glenn-jocher Yes, in general such steps should be skipped with Infs or NaN values. However, it leads to making the training very unstable. An example from the training with my private dataset:
results

A training without AMP (same split, same RandomSeeds) works without problems. In the course of this, I looked at the gradients and noticed that the training starts to become unstable as soon as infs appear in the gradients (NaNs are mostly only the result of CUDNN when you have infs further ahead in the gradients, at least that's my analysis now). Then I had a look (also with the above-mentioned code) at a training of Scratch with COCO128. And there you get infs directly at the beginning of the very first gradient calculation. This can not be correct. However, I do not know yet completely why.
Maybe the initial values of the weights are too low.

@UnglvKitDe
Copy link
Contributor Author

@glenn-jocher One more info: the results from above are ~10% worse than if you have a stable training. Sometimes it is even enough to change the learning rate minimally.

@glenn-jocher
Copy link
Member

@UnglvKitDe ok so these are numerical issues due to AMP effects. Maybe the initial lr0 is too high. We a warmup that starts the 2 lr0's at 0.0 but the bias lr starts very high, maybe we should lower it:

yolov5/train.py

Lines 337 to 338 in f8722b4

# bias lr falls from 0.1 to lr0, all other lrs rise from 0.0 to lr0
x['lr'] = np.interp(ni, xi, [hyp['warmup_bias_lr'] if j == 0 else 0.0, x['initial_lr'] * lf(epoch)])

Alternatively maybe we should put a ceiling on the loss values

@glenn-jocher
Copy link
Member

warmup_bias_lr: 0.1 # warmup initial bias lr

@UnglvKitDe
Copy link
Contributor Author

@glenn-jocher Yes, it will be due to AMP, there are also numerous issues about it e.g.: pytorch/pytorch#40497. I'm currently trying to see if gradient clipping with backward hooks helps. Lowering the learning rate at the beginning could also help. I'll try it out afterwards.

@glenn-jocher
Copy link
Member

@UnglvKitDe got it! Let me know your results after you run your experiments.

@UnglvKitDe
Copy link
Contributor Author

@glenn-jocher I have done some tests. It makes the training much more stable if you add gradient clipping (as is also recommended here). Additionally you can use tensor hooks and replace the INF values e.g. with 0. I can create a PR, then we can discuss this in detail? Reducing the learning rate did not improve it.

@glenn-jocher
Copy link
Member

@UnglvKitDe got it! Yes please submit a PR.

Did you notice any speed reductions for the gradient clipping code?

@UnglvKitDe
Copy link
Contributor Author

UnglvKitDe commented Jul 16, 2022

@glenn-jocher On my private data set, it led to a 0.5% improvement (besides the general more stable training). Unfortunately I can't test it on the original COCO dataset, I don't have the resources for that at the moment... Srry!

@glenn-jocher
Copy link
Member

@UnglvKitDe oh don't worry, I can test it on COCO once you submit the PR!

@UnglvKitDe
Copy link
Contributor Author

@glenn-jocher Thx :)

@glenn-jocher glenn-jocher linked a pull request Jul 16, 2022 that will close this issue
@glenn-jocher glenn-jocher removed the TODO label Jul 30, 2022
glenn-jocher added a commit that referenced this issue Aug 1, 2022
* Add tensor hooks and gradient clipping #8578

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove retain_grad(), because its not necessary

* Update train.py

* Simplify

* Update train.py

* Update train.py

* Update train.py

* Update train.py

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
@glenn-jocher glenn-jocher reopened this Aug 1, 2022
@glenn-jocher glenn-jocher linked a pull request Aug 1, 2022 that will close this issue
@glenn-jocher glenn-jocher changed the title NaNs and INFs in gradienten values NaNs and INFs in gradient values Aug 1, 2022
ctjanuhowski pushed a commit to ctjanuhowski/yolov5 that referenced this issue Sep 8, 2022
* Add tensor hooks and gradient clipping ultralytics#8578

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove retain_grad(), because its not necessary

* Update train.py

* Simplify

* Update train.py

* Update train.py

* Update train.py

* Update train.py

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
@glenn-jocher
Copy link
Member

@UnglvKitDe you're welcome! If you have any more questions or need further assistance, feel free to ask. Good luck with the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants