Possible Evaluation Error in val.py #4251

Johnathan-Xie · 2021-08-01T01:49:34Z

I believe there is a slight error in the current validation score that may be slightly lowering the mAP IOU=0.5:0.95, so making the fix should give a slight raise to that main metric.

Proof of error
When cloning and running this code "python val.py --data coco.yaml --img 640 --conf 0.001 --iou 0.65 --weights yolov5s.pt" command on the official repository, I added a print statement to print out the AP at each of the 10 IOU thresholds which gives:
[ 0.54585 0.51816 0.49107 0.45745 0.42016 0.3753 0.31443 0.23312 0.13052 0.021785]

Now if I run the exact same command, but this time I change the iou threshold to be only 6 points, 0.7 - 0.95, I receive this output
[ 0.44055 0.38887 0.32283 0.23714 0.13167 0.021903]
If the code were correct, then the last 6 values of the 10 point and the values of the 6 point metric should match, rather we see a slightly higher value for the 6 point metric.

Explanation of Error
Lines 63-72 of val.py shows

        ious, i = box_iou(predictions[pi, 0:4], labels[ti, 1:5]).max(1)  # best ious, indices
        detected_set = set()
        for j in (ious > iouv[0]).nonzero():
            d = ti[i[j]]  # detected label
            if d.item() not in detected_set:
                detected_set.add(d.item())
                detected.append(d)  # append detections
                correct[pi[j]] = ious[j] > iouv  # iou_thres is 1xn
                if len(detected) == nl:  # all labels already located in image
                    break

The code chooses all predictions above a certain AP for a given class and then iterates through them to record which detections they match with. While this gives an accurate measurement for AP50, I believe this will choose matches in a random order. This means that if two predictions with IOU > 0.5 say 0.6 and 0.7 match both to the same target the 0.6 one could be chosen over the 0.7 one. Then considering that the target would be in detected_set, the 0.7 one would not replace. When testing mAP @ 0.5:0.95 the result may be slightly lower than what should actually be reported because of occasional lower IOU matches when there is actually a higher IOU prediction for that specific target.

I am not certain what the best way to go about fixing this would be, as I'm not too familiar with how it may integrate with the rest of the testing code. Also I'm not sure if this will affect official results as those seem to be evaluated using COCO tools?

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2021-08-01T10:26:47Z

@Johnathan-Xie this is good investigative work! I'll definitely look into this. Yes the AP vector should remain unchanged in your experiment, if it is changing then something is wrong.

We have a second competition related to this mAP misalignment between pycocotools and YOLOv5 mAP in #2258, if your discovery narrows the gap you may be eligible for prize money.

glenn-jocher · 2021-08-01T13:39:03Z

@Johnathan-Xie if I try to reproduce your results I also see differences in the AP vector:

# DEFAULT
!python val.py --weights yolov5s.pt --data coco128.yaml --img 640 --iou 0.65 --half

[    0.78668     0.75755     0.71975     0.66421     0.60833     0.53343     0.44166     0.33731     0.20313    0.020986] # linspace(0.5, 0.95, 10)
[                                                    0.64106     0.55289     0.45326     0.34195     0.20313    0.020986] # linspace(0.7, 0.95, 6)

glenn-jocher · 2021-08-01T13:45:53Z

@Johnathan-Xie I experimented with creating a new process_batch() function based off of our confusion matrix code. This function returns identical APs across the vector no matter how the vector is constructed, but results in lower overall mAP, which can't be right since we need to raise our mAP to meet pycocotools mAP.

# PROPOSED
[    0.69652     0.68366     0.67644      0.6617     0.62043     0.55521     0.45455     0.34195     0.20313    0.020986] # linspace(0.5, 0.95, 10)
[                                                    0.62043     0.55521     0.45455     0.34195     0.20313    0.020986] # linspace(0.7, 0.95, 6)

def process_batch_new(detections, labels, iouv):
    """
    Return intersection-over-union (Jaccard index) of boxes.
    Both sets of boxes are expected to be in (x1, y1, x2, y2) format.
    Arguments:
        detections (Array[N, 6]), x1, y1, x2, y2, conf, class
        labels (Array[M, 5]), class, x1, y1, x2, y2
    Returns:
        correct (Array[N, 10]), for 10 IoU levels
    """
    correct = torch.zeros(detections.shape[0], iouv.shape[0], dtype=torch.bool, device=iouv.device)
    gt_classes = labels[:, 0].int()
    detection_classes = detections[:, 5].int()
    iou = box_iou(labels[:, 1:], detections[:, :4])

    good = (iou > iouv[0]) & (gt_classes.view(-1, 1) == detection_classes)
    x = torch.where(good)
    if x[0].shape[0]:
        matches = torch.cat((torch.stack(x, 1), iou[x[0], x[1]][:, None]), 1).cpu().numpy()
        if x[0].shape[0] > 1:
            matches = matches[matches[:, 2].argsort()[::-1]]
            matches = matches[np.unique(matches[:, 1], return_index=True)[1]]
            matches = matches[matches[:, 2].argsort()[::-1]]
            matches = matches[np.unique(matches[:, 0], return_index=True)[1]]
        matches = torch.Tensor(matches).to(iouv.device)
        m0, m1, iou = matches.T  # label_index, detection_index, iou
        correct[m1.long()] = iou[:, None] > iouv
    return correct

wingvortex · 2021-08-20T07:04:34Z

@glenn-jocher Hi, why do you comment line 72 of process_batch in the newest val.py:
# matches = matches[matches[:, 2].argsort()[::-1]]
It seems it has to uncomment this line to sort the matches once more, otherwise, it is possible to discard the higher iou pred

glenn-jocher · 2021-08-26T23:15:13Z

@wingvortex hi thanks for the question. When we uncomment this line we see lower mAP, though we want to move in the opposite direction as our repo mAP consistently trails pycocotools mAP (see #2258), indicating that we are underestimating the true mAP we report on custom datasets.

Two changes provided 1. Added limit on the maximum number of detections for each image likewise pycocotools 2. Rework process_batch function Changes ultralytics#2 solved issue ultralytics#4251 I also independently encountered the problem described in issue ultralytics#4251 that the values for the same thresholds do not match when changing the limits in the torch.linspace function. These changes solve this problem. Currently during validation yolov5x.pt model the following results were obtained: from yolov5 validation Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 157/157 [01:07<00:00, 2.33it/s] all 5000 36335 0.743 0.626 0.682 0.506 from pycocotools Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.505 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.685 These results are very close, although not completely pass the competition issue ultralytics#2258. I think it's problem with false positive bboxes matched ignored criteria, but this is not actual for custom datasets and does not require an additional solution.

* Improve mAP0.5-0.95 Two changes provided 1. Added limit on the maximum number of detections for each image likewise pycocotools 2. Rework process_batch function Changes #2 solved issue #4251 I also independently encountered the problem described in issue #4251 that the values for the same thresholds do not match when changing the limits in the torch.linspace function. These changes solve this problem. Currently during validation yolov5x.pt model the following results were obtained: from yolov5 validation Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 157/157 [01:07<00:00, 2.33it/s] all 5000 36335 0.743 0.626 0.682 0.506 from pycocotools Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.505 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.685 These results are very close, although not completely pass the competition issue #2258. I think it's problem with false positive bboxes matched ignored criteria, but this is not actual for custom datasets and does not require an additional solution. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove line to retain pycocotools results * Update val.py * Update val.py * Remove to device op * Higher precision int conversion * Update val.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Improve mAP0.5-0.95 Two changes provided 1. Added limit on the maximum number of detections for each image likewise pycocotools 2. Rework process_batch function Changes ultralytics#2 solved issue ultralytics#4251 I also independently encountered the problem described in issue ultralytics#4251 that the values for the same thresholds do not match when changing the limits in the torch.linspace function. These changes solve this problem. Currently during validation yolov5x.pt model the following results were obtained: from yolov5 validation Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 157/157 [01:07<00:00, 2.33it/s] all 5000 36335 0.743 0.626 0.682 0.506 from pycocotools Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.505 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.685 These results are very close, although not completely pass the competition issue ultralytics#2258. I think it's problem with false positive bboxes matched ignored criteria, but this is not actual for custom datasets and does not require an additional solution. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove line to retain pycocotools results * Update val.py * Update val.py * Remove to device op * Higher precision int conversion * Update val.py Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Johnathan-Xie added the bug Something isn't working label Aug 1, 2021

glenn-jocher added the TODO label Aug 1, 2021

glenn-jocher assigned Johnathan-Xie and glenn-jocher Aug 1, 2021

glenn-jocher mentioned this issue Aug 1, 2021

Update AP calculation #4260

Merged

glenn-jocher linked a pull request Aug 1, 2021 that will close this issue

Update AP calculation #4260

Merged

glenn-jocher closed this as completed in #4260 Aug 1, 2021

lebedevdes mentioned this issue Feb 26, 2022

Bug fix mAP0.5-0.95 #6787

Merged

Terry-Zheng mentioned this issue Dec 22, 2023

Why not uncomment second argsort in process_batch? #12543

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Evaluation Error in val.py #4251

Possible Evaluation Error in val.py #4251

Johnathan-Xie commented Aug 1, 2021

glenn-jocher commented Aug 1, 2021

glenn-jocher commented Aug 1, 2021 •

edited

Loading

glenn-jocher commented Aug 1, 2021 •

edited

Loading

wingvortex commented Aug 20, 2021

glenn-jocher commented Aug 26, 2021

Possible Evaluation Error in val.py #4251

Possible Evaluation Error in val.py #4251

Comments

Johnathan-Xie commented Aug 1, 2021

glenn-jocher commented Aug 1, 2021

glenn-jocher commented Aug 1, 2021 • edited Loading

glenn-jocher commented Aug 1, 2021 • edited Loading

wingvortex commented Aug 20, 2021

glenn-jocher commented Aug 26, 2021

glenn-jocher commented Aug 1, 2021 •

edited

Loading

glenn-jocher commented Aug 1, 2021 •

edited

Loading