Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Evaluation Error in val.py #4251

Closed
Johnathan-Xie opened this issue Aug 1, 2021 · 5 comments · Fixed by #4260
Closed

Possible Evaluation Error in val.py #4251

Johnathan-Xie opened this issue Aug 1, 2021 · 5 comments · Fixed by #4260
Assignees
Labels
bug Something isn't working TODO

Comments

@Johnathan-Xie
Copy link

I believe there is a slight error in the current validation score that may be slightly lowering the mAP IOU=0.5:0.95, so making the fix should give a slight raise to that main metric.

Proof of error
When cloning and running this code "python val.py --data coco.yaml --img 640 --conf 0.001 --iou 0.65 --weights yolov5s.pt" command on the official repository, I added a print statement to print out the AP at each of the 10 IOU thresholds which gives:
[ 0.54585 0.51816 0.49107 0.45745 0.42016 0.3753 0.31443 0.23312 0.13052 0.021785]

Now if I run the exact same command, but this time I change the iou threshold to be only 6 points, 0.7 - 0.95, I receive this output
[ 0.44055 0.38887 0.32283 0.23714 0.13167 0.021903]
If the code were correct, then the last 6 values of the 10 point and the values of the 6 point metric should match, rather we see a slightly higher value for the 6 point metric.

Explanation of Error
Lines 63-72 of val.py shows

        ious, i = box_iou(predictions[pi, 0:4], labels[ti, 1:5]).max(1)  # best ious, indices
        detected_set = set()
        for j in (ious > iouv[0]).nonzero():
            d = ti[i[j]]  # detected label
            if d.item() not in detected_set:
                detected_set.add(d.item())
                detected.append(d)  # append detections
                correct[pi[j]] = ious[j] > iouv  # iou_thres is 1xn
                if len(detected) == nl:  # all labels already located in image
                    break

The code chooses all predictions above a certain AP for a given class and then iterates through them to record which detections they match with. While this gives an accurate measurement for AP50, I believe this will choose matches in a random order. This means that if two predictions with IOU > 0.5 say 0.6 and 0.7 match both to the same target the 0.6 one could be chosen over the 0.7 one. Then considering that the target would be in detected_set, the 0.7 one would not replace. When testing mAP @ 0.5:0.95 the result may be slightly lower than what should actually be reported because of occasional lower IOU matches when there is actually a higher IOU prediction for that specific target.

I am not certain what the best way to go about fixing this would be, as I'm not too familiar with how it may integrate with the rest of the testing code. Also I'm not sure if this will affect official results as those seem to be evaluated using COCO tools?

@Johnathan-Xie Johnathan-Xie added the bug Something isn't working label Aug 1, 2021
@glenn-jocher
Copy link
Member

@Johnathan-Xie this is good investigative work! I'll definitely look into this. Yes the AP vector should remain unchanged in your experiment, if it is changing then something is wrong.

We have a second competition related to this mAP misalignment between pycocotools and YOLOv5 mAP in #2258, if your discovery narrows the gap you may be eligible for prize money.

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 1, 2021

@Johnathan-Xie if I try to reproduce your results I also see differences in the AP vector:

# DEFAULT
!python val.py --weights yolov5s.pt --data coco128.yaml --img 640 --iou 0.65 --half

[    0.78668     0.75755     0.71975     0.66421     0.60833     0.53343     0.44166     0.33731     0.20313    0.020986] # linspace(0.5, 0.95, 10)
[                                                    0.64106     0.55289     0.45326     0.34195     0.20313    0.020986] # linspace(0.7, 0.95, 6)

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 1, 2021

@Johnathan-Xie I experimented with creating a new process_batch() function based off of our confusion matrix code. This function returns identical APs across the vector no matter how the vector is constructed, but results in lower overall mAP, which can't be right since we need to raise our mAP to meet pycocotools mAP.

# PROPOSED
[    0.69652     0.68366     0.67644      0.6617     0.62043     0.55521     0.45455     0.34195     0.20313    0.020986] # linspace(0.5, 0.95, 10)
[                                                    0.62043     0.55521     0.45455     0.34195     0.20313    0.020986] # linspace(0.7, 0.95, 6)
def process_batch_new(detections, labels, iouv):
    """
    Return intersection-over-union (Jaccard index) of boxes.
    Both sets of boxes are expected to be in (x1, y1, x2, y2) format.
    Arguments:
        detections (Array[N, 6]), x1, y1, x2, y2, conf, class
        labels (Array[M, 5]), class, x1, y1, x2, y2
    Returns:
        correct (Array[N, 10]), for 10 IoU levels
    """
    correct = torch.zeros(detections.shape[0], iouv.shape[0], dtype=torch.bool, device=iouv.device)
    gt_classes = labels[:, 0].int()
    detection_classes = detections[:, 5].int()
    iou = box_iou(labels[:, 1:], detections[:, :4])

    good = (iou > iouv[0]) & (gt_classes.view(-1, 1) == detection_classes)
    x = torch.where(good)
    if x[0].shape[0]:
        matches = torch.cat((torch.stack(x, 1), iou[x[0], x[1]][:, None]), 1).cpu().numpy()
        if x[0].shape[0] > 1:
            matches = matches[matches[:, 2].argsort()[::-1]]
            matches = matches[np.unique(matches[:, 1], return_index=True)[1]]
            matches = matches[matches[:, 2].argsort()[::-1]]
            matches = matches[np.unique(matches[:, 0], return_index=True)[1]]
        matches = torch.Tensor(matches).to(iouv.device)
        m0, m1, iou = matches.T  # label_index, detection_index, iou
        correct[m1.long()] = iou[:, None] > iouv
    return correct

@glenn-jocher glenn-jocher linked a pull request Aug 1, 2021 that will close this issue
@wingvortex
Copy link

@glenn-jocher Hi, why do you comment line 72 of process_batch in the newest val.py:
# matches = matches[matches[:, 2].argsort()[::-1]]
It seems it has to uncomment this line to sort the matches once more, otherwise, it is possible to discard the higher iou pred

@glenn-jocher
Copy link
Member

@wingvortex hi thanks for the question. When we uncomment this line we see lower mAP, though we want to move in the opposite direction as our repo mAP consistently trails pycocotools mAP (see #2258), indicating that we are underestimating the true mAP we report on custom datasets.

lebedevdes pushed a commit to lebedevdes/yolov5 that referenced this issue Feb 26, 2022
Two changes provided
1. Added limit on the maximum number of detections for each image likewise pycocotools
2. Rework process_batch function

Changes ultralytics#2 solved issue ultralytics#4251
I also independently encountered the problem described in issue ultralytics#4251 that the values for the same thresholds do not match when changing the limits in the torch.linspace function.
These changes solve this problem.

Currently during validation yolov5x.pt model the following results were obtained:
from yolov5 validation
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 157/157 [01:07<00:00,  2.33it/s]
                 all       5000      36335      0.743      0.626      0.682      0.506
from pycocotools
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.505
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.685

These results are very close, although not completely pass the competition issue ultralytics#2258.
I think it's problem with false positive bboxes matched ignored criteria, but this is not actual for custom datasets and does not require an additional solution.
glenn-jocher added a commit that referenced this issue May 20, 2022
* Improve mAP0.5-0.95

Two changes provided
1. Added limit on the maximum number of detections for each image likewise pycocotools
2. Rework process_batch function

Changes #2 solved issue #4251
I also independently encountered the problem described in issue #4251 that the values for the same thresholds do not match when changing the limits in the torch.linspace function.
These changes solve this problem.

Currently during validation yolov5x.pt model the following results were obtained:
from yolov5 validation
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 157/157 [01:07<00:00,  2.33it/s]
                 all       5000      36335      0.743      0.626      0.682      0.506
from pycocotools
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.505
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.685

These results are very close, although not completely pass the competition issue #2258.
I think it's problem with false positive bboxes matched ignored criteria, but this is not actual for custom datasets and does not require an additional solution.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove line to retain pycocotools results

* Update val.py

* Update val.py

* Remove to device op

* Higher precision int conversion

* Update val.py

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
tdhooghe pushed a commit to tdhooghe/yolov5 that referenced this issue Jun 10, 2022
* Improve mAP0.5-0.95

Two changes provided
1. Added limit on the maximum number of detections for each image likewise pycocotools
2. Rework process_batch function

Changes ultralytics#2 solved issue ultralytics#4251
I also independently encountered the problem described in issue ultralytics#4251 that the values for the same thresholds do not match when changing the limits in the torch.linspace function.
These changes solve this problem.

Currently during validation yolov5x.pt model the following results were obtained:
from yolov5 validation
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 157/157 [01:07<00:00,  2.33it/s]
                 all       5000      36335      0.743      0.626      0.682      0.506
from pycocotools
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.505
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.685

These results are very close, although not completely pass the competition issue ultralytics#2258.
I think it's problem with false positive bboxes matched ignored criteria, but this is not actual for custom datasets and does not require an additional solution.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove line to retain pycocotools results

* Update val.py

* Update val.py

* Remove to device op

* Higher precision int conversion

* Update val.py

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
ctjanuhowski pushed a commit to ctjanuhowski/yolov5 that referenced this issue Sep 8, 2022
* Improve mAP0.5-0.95

Two changes provided
1. Added limit on the maximum number of detections for each image likewise pycocotools
2. Rework process_batch function

Changes ultralytics#2 solved issue ultralytics#4251
I also independently encountered the problem described in issue ultralytics#4251 that the values for the same thresholds do not match when changing the limits in the torch.linspace function.
These changes solve this problem.

Currently during validation yolov5x.pt model the following results were obtained:
from yolov5 validation
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 157/157 [01:07<00:00,  2.33it/s]
                 all       5000      36335      0.743      0.626      0.682      0.506
from pycocotools
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.505
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.685

These results are very close, although not completely pass the competition issue ultralytics#2258.
I think it's problem with false positive bboxes matched ignored criteria, but this is not actual for custom datasets and does not require an additional solution.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove line to retain pycocotools results

* Update val.py

* Update val.py

* Remove to device op

* Higher precision int conversion

* Update val.py

Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working TODO
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants