Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When I was training, the loss value changed to nan #329

Open
AllenGitHub1 opened this issue Apr 7, 2024 · 3 comments
Open

When I was training, the loss value changed to nan #329

AllenGitHub1 opened this issue Apr 7, 2024 · 3 comments

Comments

@AllenGitHub1
Copy link

When I was training, the initial evaluation indicators were all normal, but after 30 rounds, the loss value became nan, and the map and other indicators also became 0. May I ask what is causing this? I used YOLOv9-c
53576c09030ed5dfab1b07cf2a36ca8
b3c41ef26e4965ebeb644475c85e397

@AllenGitHub1
Copy link
Author

Is this caused by gradient explosion? Would it be better to use a small model like v9-s?

@ikranergiz
Copy link

Same here.

@pomoron
Copy link

pomoron commented Apr 30, 2024

I was trying between YOLOv8 and v9 and I found similar issues there
ultralytics/ultralytics#280
It boils down to two problems (I have come across so far):

  • Automatic Mixed Precision (AMP) clashed with CUDA 11.x (#280 of YOLOv8)
    The solution I guess is to disable it. In YOLOv9 I don't think you can parse in --amp False to disable it.
    My (not so wise) workaround is to go to def check_amp(model): in <path>/utils/general.py
    in line 587 (or thereabout) it returns a True or False based on a check.
    I force return False to shut AMP down
    This takes more VRAM to train your model (coz AMP is designed to optimise the use of memory) but hopefully at least solves a problem...

  • the inf loss due to no prediction and divide by zero
    In the loss function (YOLOv8 #490 and #1618) there is a variable target_scores_sum that could become 0 when nothing was predicted.
    It then becomes a denominator for the cls loss and beyond and you get a divide by zero.
    When one loss goes to infinity, the overall loss goes to infinity and subsequent loss calculation basically becomes pointless...
    In <path>/utils/segment/loss_tal.py, in line 209 I made the change:

        # target_scores_sum = target_scores.sum()
        target_scores_sum = max(target_scores.sum(), 1)

At least now the loss can be divided by 1.
This solves loss_tal.py which is used by segment/train.py but not loss_tal_dual.py by segment/train_dual.py

Conclusion: I can train both with single training. For YOLOv9 I can't train the ones with auxiliary branches because the loss boom out to infinity still isn't solved.
Admin... please help if you can

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants