Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix nan/inf loss #490

Merged
merged 2 commits into from
Jan 19, 2023
Merged

fix nan/inf loss #490

merged 2 commits into from
Jan 19, 2023

Conversation

Laughing-q
Copy link
Member

@Laughing-q Laughing-q commented Jan 19, 2023

@AyushExel @glenn-jocher this PR fixed the nan loss issue. And the cause is that the target_scores_sum we're using in loss calculation could be 0 if there're no objects in targets(empty labels, background).

test command:

yolo detect train model=yolov8n.pt data=custom.yaml imgsz=640 rect=True

before fix(losses are nan and mAP stays 0.15):
pic-selected-230119-1453-35
after fix:
pic-selected-230119-1521-59

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Enhanced stability in loss calculation during model training.

📊 Key Changes

  • Modified the calculation of target_scores_sum by ensuring it's never less than 1.

🎯 Purpose & Impact

  • 🛠 Prevents division by zero errors when calculating loss, increasing the robustness of training.
  • ⚖️ Contributors and users of the Ultralytics framework can expect more stable training sessions, particularly in edge cases with sparse detection targets.

@hdnh2006
Copy link
Contributor

@Laughing-q I am still getting nan in my training. It seems for validation is solved:
image

After running pip install --upgrade ultralytics I get the following:

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: ultralytics in ********/.virtualenvs/ultralytics/lib/python3.8/site-packages (8.0.10)

@AyushExel
Copy link
Contributor

@hdnh2006 wait for a while till this gets merged. We'll release the updated package later today

@CoderYiFei
Copy link

well done @Laughing-q
I always get nan value in training dataset , and valid dataset.
thanks very much.
I am waiting for the master' s code, that fixed the nan value bug.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 19, 2023

@Laughing-q do you know what a typical value of target_scores_sum is? Could this change affect COCO training results?

Should we add a smaller value for a protected divide, i.e. x / (target_scores_sum + eps) where eps might be something like 1e-6 or 1e-3?

@glenn-jocher
Copy link
Member

@Laughing-q I debugged this value on COCO128 and it's very large, i.e. target_scores_sum = 1000 at batch-size 16, so I think this is fine to max at 1.

@glenn-jocher glenn-jocher changed the base branch from main to updates January 19, 2023 13:23
@glenn-jocher glenn-jocher merged commit a94656d into updates Jan 19, 2023
@glenn-jocher glenn-jocher deleted the fix_nan_loss branch January 19, 2023 13:23
@Laughing-q
Copy link
Member Author

@glenn-jocher oh yes, I just want to tell you this.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants