Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding the recall and precision curve #12627

Closed
1 task done
mansi-aggarwal-2504 opened this issue Jan 15, 2024 · 6 comments
Closed
1 task done

Understanding the recall and precision curve #12627

mansi-aggarwal-2504 opened this issue Jan 15, 2024 · 6 comments
Labels
question Further information is requested Stale

Comments

@mansi-aggarwal-2504
Copy link

mansi-aggarwal-2504 commented Jan 15, 2024

Search before asking

Question

Hello!

Thank you for the great architecture.

I wanted to understand the ap_per_class in the metrics.py module, particularly the following snippet:

 # Recall
recall = tpc / (n_l + eps)  # recall curve
r[ci] = np.interp(-px, -conf[i], recall[:, 0], left=0)  # negative x, xp because xp decreases

# Precision
precision = tpc / (tpc + fpc)  # precision curve
p[ci] = np.interp(-px, -conf[i], precision[:, 0], left=1)  # p at pr_score

Could someone help me understand the interpolation? If we want the absolute (?) recall and precision, can I simply use recall and precision?

Additional

I just have one class in my custom dataset and multiple instances of that class in each image. Say, there are 100 images and 10 instances (arbitrary numbers) of the class in each image, I just want to know how many particles got detected out of these 1k instances, how many of the detections are correct, what we missed, etc. Basically, the recall and precision of the model for a given dataset (at a given confidence).

Thanks.

@mansi-aggarwal-2504 mansi-aggarwal-2504 added the question Further information is requested label Jan 15, 2024
@glenn-jocher
Copy link
Member

@mansi-aggarwal-2504 hello!

Thank you for your kind words and for reaching out with your question.

The interpolation in the ap_per_class function is used to estimate the precision and recall values at each confidence level. This is necessary because during evaluation, we calculate precision and recall at discrete confidence thresholds, but we want to estimate these metrics across all possible thresholds.

For your use case, if you're interested in the overall recall and precision of your model on the dataset at a specific confidence threshold, you can indeed use the recall and precision arrays directly. These arrays give you the recall and precision at each confidence level that was used during evaluation.

To get the overall metrics for your dataset, you would look at the values in these arrays corresponding to your chosen confidence threshold. If you want to calculate these metrics at a specific confidence threshold that wasn't directly evaluated during testing, you can use the interpolated values, which is what the code snippet you provided is doing.

For more detailed information on how to interpret and use these metrics, please refer to our documentation at https://docs.ultralytics.com/yolov5/.

I hope this helps, and if you have any more questions, feel free to ask. Happy detecting! 😊

@mansi-aggarwal-2504
Copy link
Author

mansi-aggarwal-2504 commented Jan 30, 2024

Hello,
I have a follow up question to this. I noticed that my detections in test.py are getting capped at 300, so I changed the NMS settings.
Here are the old results for a test dataset (test dataset 1):
Screenshot 2024-01-30 at 11 03 55 am

And when I increased the max detections (test dataset 1):
Screenshot 2024-01-30 at 11 06 31 am

Here, I see a jump in recall but not a significant change in F1. Is the confidence threshold the major driver here? How could I interpret this better, and enhance my overall F1 score?

(side note and query: The results are from a model trained on more data. I increased the dataset as I wasn't getting a good recall, so I thought that increasing data would enhance feature extraction and help increase the recall. However, the confidence threshold for recall curve doesn't increase (highest recall being achieved at a low confidence).

EDIT:
I also found that max det doesn't put an upper bound necessarily - changing it from 300 to 1000 is making it predict exactly 1000 objects strictly, which also a caveat as for some test datasets, I have more objects in an image (~800) but for some, I don't (I have only 1 class).
Number of objects in ground truth images vs predictions when I set max det to 1000 (test dataset 2):

Screenshot 2024-01-30 at 11 48 20 am

Results when it was default 300 max det (test dataset 2):
Screenshot 2024-01-30 at 11 49 33 am

Results when I updated max det to 1000 (test dataset 2):

Screenshot 2024-01-30 at 11 50 53 am

My recall increased but why is my precision intact - it is picking a lot more particles. Could someone help me understand this behaviou?

@glenn-jocher
Copy link
Member

Hello again @mansi-aggarwal-2504-2504,

It's great to see you're diving deep into the performance of your model!

When you increase the max_det (maximum detections), you're allowing the model to output more predictions per image. This can lead to an increase in recall because the model has the opportunity to detect more true positives that it might have missed with a lower max_det. However, if the additional detections are mostly correct, your precision may not drop significantly, which is why you might not see a large change in the F1 score.

The confidence threshold is indeed a major driver in determining precision and recall. A lower threshold generally increases recall (more detections are considered), but it can also decrease precision (more false positives). Conversely, a higher threshold can increase precision (fewer false positives) but decrease recall (fewer detections overall).

To enhance your overall F1 score, you could:

  1. Adjust the confidence threshold to find a better balance between precision and recall.
  2. Improve the quality of your training data or use data augmentation to help the model generalize better.
  3. Experiment with different model architectures or hyperparameters.

Regarding the side note, adding more data can indeed improve recall if the additional data helps the model learn to detect objects it previously missed. However, the highest recall at a low confidence threshold suggests that your model may be outputting many detections with low confidence that happen to be correct. This could be a sign that your model is uncertain and could benefit from further training or data.

For the EDIT part, the max_det setting is a hard upper limit on the number of detections the model can output. If you set it to 1000 and there are fewer than 1000 objects, the model will not necessarily output 1000 detections; it will output as many detections as it finds up to that limit. If you have more objects in an image than the max_det, some objects may not be detected because of this limit.

If your precision remains unchanged after increasing max_det, it suggests that the additional detections are mostly correct. However, if you're seeing exactly 1000 detections for every image, it could be an indication that the max_det limit is being reached and the model would predict more detections if allowed. This could artificially inflate your recall and precision metrics, as the model is not being allowed to make all the predictions it would naturally make.

To better understand your model's behavior, you might want to look at the distribution of confidence scores for the detections and see if there's a natural cutoff point that could inform a more appropriate max_det setting or confidence threshold for your use case.

I hope this clarifies your questions. Keep up the good work! 😊

@ashwin-999
Copy link

ashwin-999 commented Feb 29, 2024

Hello @glenn-jocher thanks for this awesome yolov5 repo!

I am working on a custom dataset, but am hoping I can reuse the ap_per_class function for evaluation. However, am a bit confused while interpreting metrics logged. Searching thru the list of issues I thought it'd be best to use this issue perhaps?

Below are 3 screenshots of metric logs. I can wrap my head around the scenario where P, R are both 0 (case 1).

I am a bit confused when the metrics I get says P=1 and R=0 (case 2 & 3). Also, in case 2 and 3, how should I interpret when mAPs zero vs non-zero?

case 1:
image

case 2:
image

case 3:
image

Copy link
Contributor

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@glenn-jocher
Copy link
Member

Hello @ashwin-999!

Thank you for your kind words and reaching out with your evaluation questions 🙌.

In evaluating object detection models like YOLOv5, Precision (P), Recall (R), and mAP (mean Average Precision) are important metrics. Here's a brief overview to help interpret your cases:

  • Case 1: P=0, R=0: This typically means that the model did not predict any objects, or all predicted objects were incorrect. Therefore, both precision and recall are 0.

  • Cases 2 & 3: P=1, R=0: This situation occurs when your model's predictions are all correct (hence P=1), but it failed to detect many or all of the actual objects present (thus R=0). It's a sign of very selective predictions where the few made are accurate, but many true positives are missed.

  • mAP Differences: mAP considers both precision and recall across multiple thresholds. A zero mAP in this context could indicate that, despite the correct predictions at specific confidence levels (leading to P=1), the overall performance across all thresholds is poor, possibly due to missing many true positives (hence R=0). Non-zero mAP suggests some balance of correct detections and misses over the range of thresholds, but still indicates room for improvement, especially in detecting more true positives to increase recall.

Interpreting these metrics effectively depends on analyzing them together and considering the balance (or imbalance) between detecting correctly (precision) and detecting most or all true objects (recall). Both high precision with low recall or vice versa usually indicate an area for model improvement.

The key is finding a balance that suits your application's needs, where sometimes detecting all objects (higher recall) may be more critical than being highly accurate in a fewer number of detections (precision), and other times, the reverse might be true.

I hope this clarifies the metrics a bit! Keep experimenting and fine-tuning your model for the best balance that fits your use case. Happy detecting! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

3 participants