When I was training, the loss value changed to nan #329

AllenGitHub1 · 2024-04-07T03:45:49Z

When I was training, the initial evaluation indicators were all normal, but after 30 rounds, the loss value became nan, and the map and other indicators also became 0. May I ask what is causing this? I used YOLOv9-c

AllenGitHub1 · 2024-04-07T03:47:24Z

Is this caused by gradient explosion? Would it be better to use a small model like v9-s?

ikranergiz · 2024-04-07T10:16:03Z

Same here.

pomoron · 2024-04-30T20:20:48Z

I was trying between YOLOv8 and v9 and I found similar issues there
ultralytics/ultralytics#280
It boils down to two problems (I have come across so far):

Automatic Mixed Precision (AMP) clashed with CUDA 11.x (#280 of YOLOv8)
The solution I guess is to disable it. In YOLOv9 I don't think you can parse in --amp False to disable it.
My (not so wise) workaround is to go to def check_amp(model): in <path>/utils/general.py
in line 587 (or thereabout) it returns a True or False based on a check.
I force return False to shut AMP down
This takes more VRAM to train your model (coz AMP is designed to optimise the use of memory) but hopefully at least solves a problem...
the inf loss due to no prediction and divide by zero
In the loss function (YOLOv8 #490 and #1618) there is a variable target_scores_sum that could become 0 when nothing was predicted.
It then becomes a denominator for the cls loss and beyond and you get a divide by zero.
When one loss goes to infinity, the overall loss goes to infinity and subsequent loss calculation basically becomes pointless...
In <path>/utils/segment/loss_tal.py, in line 209 I made the change:

        # target_scores_sum = target_scores.sum()
        target_scores_sum = max(target_scores.sum(), 1)

At least now the loss can be divided by 1.
This solves loss_tal.py which is used by segment/train.py but not loss_tal_dual.py by segment/train_dual.py

Conclusion: I can train both with single training. For YOLOv9 I can't train the ones with auxiliary branches because the loss boom out to infinity still isn't solved.
Admin... please help if you can

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When I was training, the loss value changed to nan #329

When I was training, the loss value changed to nan #329

AllenGitHub1 commented Apr 7, 2024

AllenGitHub1 commented Apr 7, 2024

ikranergiz commented Apr 7, 2024

pomoron commented Apr 30, 2024

When I was training, the loss value changed to nan #329

When I was training, the loss value changed to nan #329

Comments

AllenGitHub1 commented Apr 7, 2024

AllenGitHub1 commented Apr 7, 2024

ikranergiz commented Apr 7, 2024

pomoron commented Apr 30, 2024