fix nan/inf loss #490

Laughing-q · 2023-01-19T07:30:46Z

@AyushExel @glenn-jocher this PR fixed the nan loss issue. And the cause is that the target_scores_sum we're using in loss calculation could be 0 if there're no objects in targets(empty labels, background).

test command:

yolo detect train model=yolov8n.pt data=custom.yaml imgsz=640 rect=True

before fix(losses are nan and mAP stays 0.15):

after fix:

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Enhanced stability in loss calculation during model training.

📊 Key Changes

Modified the calculation of target_scores_sum by ensuring it's never less than 1.

🎯 Purpose & Impact

🛠 Prevents division by zero errors when calculating loss, increasing the robustness of training.
⚖️ Contributors and users of the Ultralytics framework can expect more stable training sessions, particularly in edge cases with sparse detection targets.

hdnh2006 · 2023-01-19T09:30:04Z

@Laughing-q I am still getting nan in my training. It seems for validation is solved:

After running pip install --upgrade ultralytics I get the following:

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: ultralytics in ********/.virtualenvs/ultralytics/lib/python3.8/site-packages (8.0.10)

AyushExel · 2023-01-19T09:37:52Z

@hdnh2006 wait for a while till this gets merged. We'll release the updated package later today

CoderYiFei · 2023-01-19T09:56:35Z

well done @Laughing-q
I always get nan value in training dataset , and valid dataset.
thanks very much.
I am waiting for the master' s code, that fixed the nan value bug.

glenn-jocher · 2023-01-19T13:06:11Z

@Laughing-q do you know what a typical value of target_scores_sum is? Could this change affect COCO training results?

Should we add a smaller value for a protected divide, i.e. x / (target_scores_sum + eps) where eps might be something like 1e-6 or 1e-3?

glenn-jocher · 2023-01-19T13:23:25Z

@Laughing-q I debugged this value on COCO128 and it's very large, i.e. target_scores_sum = 1000 at batch-size 16, so I think this is fine to max at 1.

Laughing-q · 2023-01-19T13:36:17Z

@glenn-jocher oh yes, I just want to tell you this.

fix

0a4834a

This was linked to issues Jan 19, 2023

Nan loss during first epoch in custom dataset training #186

Closed

nan report in box_class cls_class and dfl_loss when train custom dataset #280

Closed

All losses went to nan and precision & recall went low & static #461

Closed

This was referenced Jan 19, 2023

Nan loss during first epoch in custom dataset training #186

Closed

nan report in box_class cls_class and dfl_loss when train custom dataset #280

Closed

All losses went to nan and precision & recall went low & static #461

Closed

glenn-jocher changed the base branch from main to updates January 19, 2023 13:23

Merge branch 'updates' into fix_nan_loss

4cecd09

glenn-jocher merged commit a94656d into updates Jan 19, 2023

glenn-jocher deleted the fix_nan_loss branch January 19, 2023 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix nan/inf loss #490

fix nan/inf loss #490

Laughing-q commented Jan 19, 2023 •

edited by UltralyticsAssistant

Loading

hdnh2006 commented Jan 19, 2023

AyushExel commented Jan 19, 2023

CoderYiFei commented Jan 19, 2023

glenn-jocher commented Jan 19, 2023 •

edited

Loading

glenn-jocher commented Jan 19, 2023

Laughing-q commented Jan 19, 2023

fix nan/inf loss #490

fix nan/inf loss #490

Conversation

Laughing-q commented Jan 19, 2023 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

hdnh2006 commented Jan 19, 2023

AyushExel commented Jan 19, 2023

CoderYiFei commented Jan 19, 2023

glenn-jocher commented Jan 19, 2023 • edited Loading

glenn-jocher commented Jan 19, 2023

Laughing-q commented Jan 19, 2023

Laughing-q commented Jan 19, 2023 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Jan 19, 2023 •

edited

Loading