-
-
Notifications
You must be signed in to change notification settings - Fork 16.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible Evaluation Error in val.py #4251
Comments
@Johnathan-Xie this is good investigative work! I'll definitely look into this. Yes the AP vector should remain unchanged in your experiment, if it is changing then something is wrong. We have a second competition related to this mAP misalignment between pycocotools and YOLOv5 mAP in #2258, if your discovery narrows the gap you may be eligible for prize money. |
@Johnathan-Xie if I try to reproduce your results I also see differences in the AP vector:
|
@Johnathan-Xie I experimented with creating a new process_batch() function based off of our confusion matrix code. This function returns identical APs across the vector no matter how the vector is constructed, but results in lower overall mAP, which can't be right since we need to raise our mAP to meet pycocotools mAP.
def process_batch_new(detections, labels, iouv):
"""
Return intersection-over-union (Jaccard index) of boxes.
Both sets of boxes are expected to be in (x1, y1, x2, y2) format.
Arguments:
detections (Array[N, 6]), x1, y1, x2, y2, conf, class
labels (Array[M, 5]), class, x1, y1, x2, y2
Returns:
correct (Array[N, 10]), for 10 IoU levels
"""
correct = torch.zeros(detections.shape[0], iouv.shape[0], dtype=torch.bool, device=iouv.device)
gt_classes = labels[:, 0].int()
detection_classes = detections[:, 5].int()
iou = box_iou(labels[:, 1:], detections[:, :4])
good = (iou > iouv[0]) & (gt_classes.view(-1, 1) == detection_classes)
x = torch.where(good)
if x[0].shape[0]:
matches = torch.cat((torch.stack(x, 1), iou[x[0], x[1]][:, None]), 1).cpu().numpy()
if x[0].shape[0] > 1:
matches = matches[matches[:, 2].argsort()[::-1]]
matches = matches[np.unique(matches[:, 1], return_index=True)[1]]
matches = matches[matches[:, 2].argsort()[::-1]]
matches = matches[np.unique(matches[:, 0], return_index=True)[1]]
matches = torch.Tensor(matches).to(iouv.device)
m0, m1, iou = matches.T # label_index, detection_index, iou
correct[m1.long()] = iou[:, None] > iouv
return correct |
@glenn-jocher Hi, why do you comment line 72 of process_batch in the newest val.py: |
@wingvortex hi thanks for the question. When we uncomment this line we see lower mAP, though we want to move in the opposite direction as our repo mAP consistently trails pycocotools mAP (see #2258), indicating that we are underestimating the true mAP we report on custom datasets. |
Two changes provided 1. Added limit on the maximum number of detections for each image likewise pycocotools 2. Rework process_batch function Changes ultralytics#2 solved issue ultralytics#4251 I also independently encountered the problem described in issue ultralytics#4251 that the values for the same thresholds do not match when changing the limits in the torch.linspace function. These changes solve this problem. Currently during validation yolov5x.pt model the following results were obtained: from yolov5 validation Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 157/157 [01:07<00:00, 2.33it/s] all 5000 36335 0.743 0.626 0.682 0.506 from pycocotools Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.505 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.685 These results are very close, although not completely pass the competition issue ultralytics#2258. I think it's problem with false positive bboxes matched ignored criteria, but this is not actual for custom datasets and does not require an additional solution.
* Improve mAP0.5-0.95 Two changes provided 1. Added limit on the maximum number of detections for each image likewise pycocotools 2. Rework process_batch function Changes #2 solved issue #4251 I also independently encountered the problem described in issue #4251 that the values for the same thresholds do not match when changing the limits in the torch.linspace function. These changes solve this problem. Currently during validation yolov5x.pt model the following results were obtained: from yolov5 validation Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 157/157 [01:07<00:00, 2.33it/s] all 5000 36335 0.743 0.626 0.682 0.506 from pycocotools Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.505 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.685 These results are very close, although not completely pass the competition issue #2258. I think it's problem with false positive bboxes matched ignored criteria, but this is not actual for custom datasets and does not require an additional solution. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove line to retain pycocotools results * Update val.py * Update val.py * Remove to device op * Higher precision int conversion * Update val.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Improve mAP0.5-0.95 Two changes provided 1. Added limit on the maximum number of detections for each image likewise pycocotools 2. Rework process_batch function Changes ultralytics#2 solved issue ultralytics#4251 I also independently encountered the problem described in issue ultralytics#4251 that the values for the same thresholds do not match when changing the limits in the torch.linspace function. These changes solve this problem. Currently during validation yolov5x.pt model the following results were obtained: from yolov5 validation Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 157/157 [01:07<00:00, 2.33it/s] all 5000 36335 0.743 0.626 0.682 0.506 from pycocotools Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.505 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.685 These results are very close, although not completely pass the competition issue ultralytics#2258. I think it's problem with false positive bboxes matched ignored criteria, but this is not actual for custom datasets and does not require an additional solution. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove line to retain pycocotools results * Update val.py * Update val.py * Remove to device op * Higher precision int conversion * Update val.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Improve mAP0.5-0.95 Two changes provided 1. Added limit on the maximum number of detections for each image likewise pycocotools 2. Rework process_batch function Changes ultralytics#2 solved issue ultralytics#4251 I also independently encountered the problem described in issue ultralytics#4251 that the values for the same thresholds do not match when changing the limits in the torch.linspace function. These changes solve this problem. Currently during validation yolov5x.pt model the following results were obtained: from yolov5 validation Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 157/157 [01:07<00:00, 2.33it/s] all 5000 36335 0.743 0.626 0.682 0.506 from pycocotools Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.505 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.685 These results are very close, although not completely pass the competition issue ultralytics#2258. I think it's problem with false positive bboxes matched ignored criteria, but this is not actual for custom datasets and does not require an additional solution. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove line to retain pycocotools results * Update val.py * Update val.py * Remove to device op * Higher precision int conversion * Update val.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
I believe there is a slight error in the current validation score that may be slightly lowering the mAP IOU=0.5:0.95, so making the fix should give a slight raise to that main metric.
Proof of error
When cloning and running this code "python val.py --data coco.yaml --img 640 --conf 0.001 --iou 0.65 --weights yolov5s.pt" command on the official repository, I added a print statement to print out the AP at each of the 10 IOU thresholds which gives:
[ 0.54585 0.51816 0.49107 0.45745 0.42016 0.3753 0.31443 0.23312 0.13052 0.021785]
Now if I run the exact same command, but this time I change the iou threshold to be only 6 points, 0.7 - 0.95, I receive this output
[ 0.44055 0.38887 0.32283 0.23714 0.13167 0.021903]
If the code were correct, then the last 6 values of the 10 point and the values of the 6 point metric should match, rather we see a slightly higher value for the 6 point metric.
Explanation of Error
Lines 63-72 of val.py shows
The code chooses all predictions above a certain AP for a given class and then iterates through them to record which detections they match with. While this gives an accurate measurement for AP50, I believe this will choose matches in a random order. This means that if two predictions with IOU > 0.5 say 0.6 and 0.7 match both to the same target the 0.6 one could be chosen over the 0.7 one. Then considering that the target would be in detected_set, the 0.7 one would not replace. When testing mAP @ 0.5:0.95 the result may be slightly lower than what should actually be reported because of occasional lower IOU matches when there is actually a higher IOU prediction for that specific target.
I am not certain what the best way to go about fixing this would be, as I'm not too familiar with how it may integrate with the rest of the testing code. Also I'm not sure if this will affect official results as those seem to be evaluated using COCO tools?
The text was updated successfully, but these errors were encountered: