-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mAP Computation vs Pycocotools #7
Comments
Thanks for the feedback. I opened Issue #5 about this earlier. Currently only one precision-recall curve is generated per image in test.py, whereas like you say I believe we want one for each class in each image, and then the average of those APs is the mAP for that image. I can try and make this correction myself, or we could try and use an off-the-shelf solution, though that would require more imports. I was studying this link to learn more: http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html |
After reviewing more examples, I think I can copy the methods in this repo: There is another main difference: the mAP should be calculated across all images at once, rather than once per image the way it is now. So I'm going to try to fully replace the mAP code with a new one that calculates accumulated TP and FP vectors for each class, then produces 80 precision and recall curves for all the objects in the 5000 validation images at once. |
Yeah, the evaluation code in that repo is correct. Looking forward for your updates! |
It looks like the original code for AP from recall-precision is fine: Line 129 in c43be7b
So I left it alone and created a new function to call it once per class in commit c43be7b: Line 82 in c43be7b
# Find unique classes
unique_classes = np.unique(np.concatenate((pred_cls, target_cls), 0))
# Create Precision-Recall curve and compute AP for each class
ap = []
for c in unique_classes:
i = pred_cls == c
n_gt = sum(target_cls == c) # Number of ground truth objects
if sum(i) == 0:
ap.append(0)
else:
# Accumulate FPs and TPs
fpa = np.cumsum(1 - tp[i])
tpa = np.cumsum(tp[i])
# Recall
recall = tpa / (n_gt + 1e-16)
# Precision
precision = tpa / (tpa + fpa)
# AP from recall-precision curve
ap.append(compute_ap(recall, precision)) When I re-evaluate mAP it drops from 58.1 to 56.7 with this method however. Darknet reports 57.9. I currently combine true and predicted classes into the list of classes evaluated per image, perhaps I should only be using one or the other. I will have to experiment some more. |
Maybe you should perform per-class rank ordering instead of per-image rank ordering. Taking VOC for example, the evaluation code will first produce a per-class prediction list over the whole test-set in the format (image_id, score, x0, y0, x1, y1), like: 0000.jpg 0.98 100 100 200 200 # the 1st instance of image 0000.jpg then perform rank ordering for all instances: then compute and accumulate TPs and FPs ... In this way, the mAP should be higher than per-image rank ordering (and no doubt the authors of yolo said mAP is screwed up xD) |
@xyutao I'm updating the mAP code, to both add corrections to the repo mAP calculation, and also to output the COCO JSON file to pass to allow the cocoapi to compute the official mAP. Before you recommended switching to per class rank ordering from per image rank ordering. Is this still your recommendation? Do you know if this is how COCO computes mAP? |
I think the ordering is performed for each class in each image independently. It is performed in the following lines |
@xyutao @okanlv I just noticed, the pycocotools demo notebook is selecting a subset of the entire validation set, just as we want for yolov3, since darknet only validates on the 5000 images in https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocoEvalDemo.ipynb |
@xyutao ,你搞明白这个问题了吗?按照传统的map计算方法的话,的确应该是按照你说的就是把所有图片中的可能物体都检测出来然后根据confidence进行排序,然后依次选取他们作为正例样本并计算召回率和准确率,最后计算每一个类别的ap。不过我看了这个代码作者的相关回答并结合coco api的详细代码后发现或许计算每一张图片的ap再平均到一起或许才是coco的map计算方法。但我目前没看到相关的正式文档去准确描述coco的map计算方法,最起码找遍中文搜索的目标检测map计算描述后发现都是传统的计算方法 |
@guagen COCO的API也是用传统方法算的。它先调用evaluateImg函数,对单张图的每个类,计算各检测框和真值的match情况,然后调用accumulate函数,对同一个类的所有图片的match结果进行合并,并按照检测框的得分进行降序,最后再统一计算precision-recall。 evaluateImg函数的返回结果详见:https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L302 对单个类合并所有图片检测框得分的代码: 对检测框得分降序排列的代码: |
@glenn-jocher The per-image evaluation just matches the detections and gt for each category, as shown in the evaluateImg function: The recall and precision are computed per category, by accumulating the matching and ranking the detection scores for all images of the category. See: |
@xyutao 好吧,头疼,看来这个代码的作者写错代码了,而且按照他的这种写法,得出的map不等于每一类的单独ap加到一起再除以类别总数 |
@glenn-jocher Here I paste the key code from COCO API for accumulating dets as follows: In this fraction, k_list stores the category ids, a_list stores the area ranges, m_list stores max detections, i_list stores the image ids. The outside for-loop is per-category iteration, while the inside for-loop is to accumulate the matching results of each image: |
I think it should be a fairly straightforward change to test.py to calculate the mAP averaged per class rather than averaged per image. I can try and implement this in the next few days. Luckily the new We probably want to lower the default |
@glenn-jocher |
All, I created a new https://github.com/ultralytics/yolov3/tree/map_update branch to test mAP updates. I converted from image-averaging to class-averaging. The result is 0.519 mAP now vs 0.550 pycocotools mAP using official This mainly affects custom data, as for COCO data we can simply use Another item is that the previous mAP calculation could operate at a reasonable rm -rf yolov3 && git clone -b map_update --depth 1 https://github.com/ultralytics/yolov3 yolov3
python3 test.py --conf-thres 0.001 --save-json
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
Image Total P R mAP
100%|████████████████████████████████████████████████████████████████████████████| 157/157 [06:34<00:00, 1.93s/it]
5000 5000 0.0865 0.727 0.519
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.309
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.550
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.309
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.142
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.336
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.455
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.267
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.408
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.432
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.240
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.470
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590 |
UPDATE: difference narrowed down to 0.531 (repo calculation) vs 0.551 (pycocotools). The rm -rf yolov3 && git clone -b map_update --depth 1 https://github.com/ultralytics/yolov3 yolov3
python3 test.py --conf-thres 0.001 --save-json
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
Image Total P R mAP
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157/157 [07:00<00:00, 2.09s/it]
5000 5000 0.0865 0.727 0.531
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.308
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.551
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.308
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.143
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.334
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.455
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.267
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.407
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.432
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.240
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.470
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590 |
Final results are in, and PR #176 complete. Repo mAP now aligns with COCO mAP under most circumstances to within 1%. Also mAP output now exceeds yolov3 darknet published results. I will close the issue finally unless there are any other questions.
sudo rm -rf yolov3 && git clone https://github.com/ultralytics/yolov3
# bash yolov3/data/get_coco_dataset.sh
sudo rm -rf cocoapi && git clone https://github.com/cocodataset/cocoapi && cd cocoapi/PythonAPI && make && cd ../.. && cp -r cocoapi/PythonAPI/pycocotools yolov3
cd yolov3
python3 test.py --save-json --conf-thres 0.001 --img-size 416
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
Image Total P R mAP
Calculating mAP: 100%|█████████████████████████████████| 157/157 [08:34<00:00, 2.53s/it]
5000 5000 0.0896 0.756 0.555
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.312
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.554
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.317
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.145
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.343
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.452
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.268
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.411
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.435
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.244
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.477
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.587
python3 test.py --save-json --conf-thres 0.001 --img-size 608 --batch-size 16
Namespace(batch_size=16, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=608, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
Image Total P R mAP
Calculating mAP: 100%|█████████████████████████████████| 313/313 [08:54<00:00, 1.55s/it]
5000 5000 0.0966 0.786 0.579
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.331
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.582
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.344
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.198
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.362
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.427
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.281
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.437
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.463
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.309
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.494
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.577 |
The mAP computation code is similar as https://github.com/eriklindernoren/PyTorch-YOLOv3/blob/959e0ff43f5b82bdacef87f4240bae8415eac45b/test.py#L69
It is incorrect to average the AP for each sample, because AP is computed per-class. The right way is to rank all detected instances across the whole test set for each object class, compute AP for each class, and then average the AP.
The text was updated successfully, but these errors were encountered: