NaNs and INFs in gradient values #8578

UnglvKitDe · 2022-07-14T19:05:31Z

Search before asking

I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training, Validation

Bug

The problem is that you get NaN values and Infinity values in the gradient. This makes the training unstable.
To reproduce the problem, you can set torch.autograd.set_detect_anomaly(True) before the training start and the gradients e.g. by hooks or you can search the gradients by for-loop by inserting this:

scaler.unscale_(optimizer)
for name, p in model.named_parameters():
    if p.grad is not None and torch.logical_or(p.grad.isnan().any(), p.grad.isinf().any()):
                        raise ValueError(f'Found error in parameters in input {name}'
                                         f'\n-->Inf: {p.grad.isinf().any()}\n-->NaN: {p.grad.isnan().any()}'
                                         f'\n-->Grad_type: {p.grad.dtype}\n-->Grad: {p.grad[p.grad.isinf()]}')

before the optimizer step:

yolov5/train.py

Line 364 in f8722b4

scaler.step(optimizer) # optimizer.step

I get the problem on my private dataset. To reproduce the error, it is enough to train the first mini-batch from coco128.

Environment

YOLOv5 torch 1.11 (cuda 11.3) and 1.12 (cuda 11.6)

Minimal Reproducible Example

python train.py --batch-size 8 --weights '' --hyp data/hyps/hyp.scratch-high.yaml --cfg yolov5l6.yaml

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2022-07-15T11:51:20Z

@UnglvKitDe thanks for the bug report! So you are saying that scaler does not solve this? I thought that NaN's and Inf's are unrecoverable, i.e. the loss will eventually be entirely NaN's if they are introduced in any place. Is this what you are seeing? Or are you saying that there are NaN's in the loss but the training still proceeds normally?

UnglvKitDe · 2022-07-15T12:25:18Z

@glenn-jocher Yes, in general such steps should be skipped with Infs or NaN values. However, it leads to making the training very unstable. An example from the training with my private dataset:

A training without AMP (same split, same RandomSeeds) works without problems. In the course of this, I looked at the gradients and noticed that the training starts to become unstable as soon as infs appear in the gradients (NaNs are mostly only the result of CUDNN when you have infs further ahead in the gradients, at least that's my analysis now). Then I had a look (also with the above-mentioned code) at a training of Scratch with COCO128. And there you get infs directly at the beginning of the very first gradient calculation. This can not be correct. However, I do not know yet completely why.
Maybe the initial values of the weights are too low.

UnglvKitDe · 2022-07-15T12:27:41Z

@glenn-jocher One more info: the results from above are ~10% worse than if you have a stable training. Sometimes it is even enough to change the learning rate minimally.

glenn-jocher · 2022-07-15T12:30:50Z

@UnglvKitDe ok so these are numerical issues due to AMP effects. Maybe the initial lr0 is too high. We a warmup that starts the 2 lr0's at 0.0 but the bias lr starts very high, maybe we should lower it:

yolov5/train.py

Lines 337 to 338 in f8722b4

    
           # bias lr falls from 0.1 to lr0, all other lrs rise from 0.0 to lr0 
        
           x['lr'] = np.interp(ni, xi, [hyp['warmup_bias_lr'] if j == 0 else 0.0, x['initial_lr'] * lf(epoch)])

Alternatively maybe we should put a ceiling on the loss values

glenn-jocher · 2022-07-15T12:31:54Z

yolov5/data/hyps/hyp.scratch-low.yaml

Line 12 in f8722b4

warmup_bias_lr: 0.1 # warmup initial bias lr

UnglvKitDe · 2022-07-15T12:37:33Z

@glenn-jocher Yes, it will be due to AMP, there are also numerous issues about it e.g.: pytorch/pytorch#40497. I'm currently trying to see if gradient clipping with backward hooks helps. Lowering the learning rate at the beginning could also help. I'll try it out afterwards.

glenn-jocher · 2022-07-15T12:39:49Z

@UnglvKitDe got it! Let me know your results after you run your experiments.

UnglvKitDe · 2022-07-16T19:06:28Z

@glenn-jocher I have done some tests. It makes the training much more stable if you add gradient clipping (as is also recommended here). Additionally you can use tensor hooks and replace the INF values e.g. with 0. I can create a PR, then we can discuss this in detail? Reducing the learning rate did not improve it.

glenn-jocher · 2022-07-16T19:49:25Z

@UnglvKitDe got it! Yes please submit a PR.

Did you notice any speed reductions for the gradient clipping code?

UnglvKitDe · 2022-07-16T19:53:35Z

@glenn-jocher On my private data set, it led to a 0.5% improvement (besides the general more stable training). Unfortunately I can't test it on the original COCO dataset, I don't have the resources for that at the moment... Srry!

glenn-jocher · 2022-07-16T19:57:06Z

@UnglvKitDe oh don't worry, I can test it on COCO once you submit the PR!

UnglvKitDe · 2022-07-16T20:42:42Z

@glenn-jocher Thx :)

* Add tensor hooks and gradient clipping #8578 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove retain_grad(), because its not necessary * Update train.py * Simplify * Update train.py * Update train.py * Update train.py * Update train.py Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

* Add tensor hooks and gradient clipping ultralytics#8578 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove retain_grad(), because its not necessary * Update train.py * Simplify * Update train.py * Update train.py * Update train.py * Update train.py Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

glenn-jocher · 2023-11-15T09:33:38Z

@UnglvKitDe you're welcome! If you have any more questions or need further assistance, feel free to ask. Good luck with the PR!

UnglvKitDe added the bug Something isn't working label Jul 14, 2022

glenn-jocher added the TODO label Jul 15, 2022

UnglvKitDe added a commit to UnglvKitDe/yolov5-1 that referenced this issue Jul 16, 2022

Add tensor hooks and gradient clipping ultralytics#8578

ae398f7

UnglvKitDe mentioned this issue Jul 16, 2022

Add tensor hooks and 10.0 gradient clipping #8598

Merged

glenn-jocher linked a pull request Jul 16, 2022 that will close this issue

Add tensor hooks and 10.0 gradient clipping #8598

Merged

glenn-jocher removed the TODO label Jul 30, 2022

glenn-jocher closed this as completed in #8598 Aug 1, 2022

glenn-jocher reopened this Aug 1, 2022

glenn-jocher linked a pull request Aug 1, 2022 that will close this issue

Remove hook torch.nan_to_num(x) #8826

Merged

glenn-jocher changed the title ~~NaNs and INFs in gradienten values~~ NaNs and INFs in gradient values Aug 1, 2022

glenn-jocher closed this as completed in #8826 Aug 1, 2022

saikat-roy mentioned this issue Oct 1, 2023

Valid accuracy MIC-DKFZ/MedNeXt#10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaNs and INFs in gradient values #8578

NaNs and INFs in gradient values #8578

UnglvKitDe commented Jul 14, 2022 •

edited

Loading

glenn-jocher commented Jul 15, 2022

UnglvKitDe commented Jul 15, 2022

UnglvKitDe commented Jul 15, 2022

glenn-jocher commented Jul 15, 2022

glenn-jocher commented Jul 15, 2022

UnglvKitDe commented Jul 15, 2022

glenn-jocher commented Jul 15, 2022

UnglvKitDe commented Jul 16, 2022

glenn-jocher commented Jul 16, 2022

UnglvKitDe commented Jul 16, 2022 •

edited

Loading

glenn-jocher commented Jul 16, 2022

UnglvKitDe commented Jul 16, 2022

glenn-jocher commented Nov 15, 2023

NaNs and INFs in gradient values #8578

NaNs and INFs in gradient values #8578

Comments

UnglvKitDe commented Jul 14, 2022 • edited Loading

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

glenn-jocher commented Jul 15, 2022

UnglvKitDe commented Jul 15, 2022

UnglvKitDe commented Jul 15, 2022

glenn-jocher commented Jul 15, 2022

glenn-jocher commented Jul 15, 2022

UnglvKitDe commented Jul 15, 2022

glenn-jocher commented Jul 15, 2022

UnglvKitDe commented Jul 16, 2022

glenn-jocher commented Jul 16, 2022

UnglvKitDe commented Jul 16, 2022 • edited Loading

glenn-jocher commented Jul 16, 2022

UnglvKitDe commented Jul 16, 2022

glenn-jocher commented Nov 15, 2023

UnglvKitDe commented Jul 14, 2022 •

edited

Loading

UnglvKitDe commented Jul 16, 2022 •

edited

Loading