Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diverged metrics in PatchCore: dropping from 0.99 to 0.44 and 0.03 is rather critical? #74

Closed
samet-akcay opened this issue Jan 17, 2022 · 6 comments · Fixed by #73
Closed

Comments

@samet-akcay
Copy link
Contributor

diverged metrics: dropping from 0.99 to 0.44 and 0.03 is rather critical?
log images: nice :)
Also padim dropped in performance, but not as crazy.
Here's a patchcore result of "good" parts:
008

this is padim:
DATALOADER:0 TEST RESULTS
{'image_AUROC': 0.7589669823646545,
'image_F1': 0.8787878751754761,
'pixel_AUROC': 0.9781586527824402,
'pixel_F1': 0.22379672527313232}
008padim

Originally posted by @sequoiagrove in #67 (comment)

@samet-akcay
Copy link
Contributor Author

@sequoiagrove, @blakshma worked on the patchcore results, where he managed to improve the performance to the following:

DATALOADER:0 TEST RESULTS
{'image_AUROC': 0.9524492025375366,
 'image_F1': 0.9551020860671997,
 'pixel_AUROC': 0.9894225597381592,
 'pixel_F1': 0.3562552034854889}

To reproduce the numbers you could use this branch, to be merged to development soon after this PR.

Here are the qualitative results for the ones you shared above. (screw/test/good/009/png)
image

@dk-teknologisk-mlnn
Copy link

I get those numbers too now. but I wonder of the computation of the numbers is also wrong, because when I look through the results images, it only detects about 1/4 to 1/2 of the defects in the different categories. The pixel F1 = 0.35 seems to decribe best the actual performance. but I know that it is low due to correctly detected defects dont have to overlap perfectly to still be a good result.

@dk-teknologisk-mlnn
Copy link

dk-teknologisk-mlnn commented Jan 17, 2022

Like how can AUC/F1 be 1.0 when it finds detects in only 3 out of 5 bad examples?

DATALOADER:0 TEST RESULTS
{'image_AUROC': 1.0,
'image_F1': 1.0,
'pixel_AUROC': 0.967517614364624,
'pixel_F1': 0.6376267075538635}

image

@blakshma
Copy link
Contributor

@sequoiagrove unfortunately, the classification results are independent of the segmentation results. Hence, the algo might have very good classification result while the segmentation results are poor in some cases as you have pointed out. We will investigate into this.

@dk-teknologisk-mlnn
Copy link

pixel auroc of 0.97 sound like it is good performance, but looking at the masks it is really not useful in a real system, and the auc is a bad metric for quantifying performance.
So I guess I should always look at pixel F1. Think thats about the same as dice, right? I use dice in segmentation tasks.
Too bad this list shows only the unrealistically positive auroc numbers:
https://paperswithcode.com/sota/anomaly-detection-on-mvtec-ad

@samet-akcay
Copy link
Contributor Author

yeah, AUC is widely used in academia, but usually not a good metric for industrial applications since it could be misleading. Finding the best threshold from the AUC is not so easy even though we implemented adaptive thresholding mechanism. This is the reason why we also added f1 score to our evaluations.

Regarding always looking at the pixel f1 score for evaluation, there is room there for improvement. We haven't optimise our heatmaps from which the predicted masks are generated. Once we do, I agree, pixel f1 would become the standard metric to evaluate the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants