[Question] Recommended KNN/ANN index for large datasets #127

misotrnka · 2021-02-01T14:44:47Z

I would like to use CropResistantHash to quickly find near-duplicates from a large set of reference images.

With other hash functions I would normaly use some kind of approximate nearest neighbor index, such as NMSLib or Annoy. The challenge is that CropResistantHash is variable length and cannot be compared using one of the standard distance functions (Angular, Hamming, Manhattan, ...).

Can anyone point me to an alternative solution? How do you use this with large datasets?

The text was updated successfully, but these errors were encountered:

JohannesBuchner · 2021-02-01T16:21:52Z

I suppose you can store it in a database with a 1:N image-hash mapping, and then do a equality query? I think some databases support hamming distances etc.

It is probably more performant to limit the bits so that you can make a equal test rather than going for a nearest neighbour search.

misotrnka · 2021-02-02T08:58:46Z

I'm not sure if checking for equality would be sufficient for our use case, as we need certain level of precision.

I'm thinking of maybe using ANN to index individual region hashes and then use that to narrow down options for the full difference scan.

JohannesBuchner · 2021-02-02T09:11:15Z

Still, can store it in a database table the 1:N image-hash (segment_hashes in cropresistanthash) mapping and search using databases supported functions like hamming distances which should be fast.

misotrnka · 2021-02-02T10:07:36Z

I can try that, but I believe that any distance function in a DB would rely on full sequential scan of the table. We are talking about hundreds of millions of rows here, so I think some kind of index is neccessary to narrow the options down a bit. But thank you for the idea, I'll explore it.

JohannesBuchner · 2021-02-02T10:54:25Z

Some links that could be useful:

n0099 · 2023-01-19T18:37:20Z

https://stackoverflow.com/questions/9606492/hamming-distance-similarity-searches-in-a-database/47487949#47487949

msminhas93 · 2024-05-24T21:07:23Z

@misotrnka were you able to perform deduplication at 100M+ scale? I'm also trying to do something similar. If you could please share any pointers those would be valuable. I am thinking about inserting the perceptual hash in a db and doing a distinct select. We would be okay with certain loss of images in the process so long as it is not outrageously wrong.

JohannesBuchner closed this as completed Jul 15, 2021

JohannesBuchner mentioned this issue Jan 6, 2022

Comparing a captured image Hash with a database of image hashes #156

Closed

JohannesBuchner mentioned this issue Aug 31, 2022

store crop_resistant hash in database #170

Closed

KDJDEV mentioned this issue Jan 30, 2023

A tutorial on creating a reverse image search #186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Recommended KNN/ANN index for large datasets #127

[Question] Recommended KNN/ANN index for large datasets #127

misotrnka commented Feb 1, 2021

JohannesBuchner commented Feb 1, 2021 •

edited

Loading

misotrnka commented Feb 2, 2021

JohannesBuchner commented Feb 2, 2021

misotrnka commented Feb 2, 2021

JohannesBuchner commented Feb 2, 2021

n0099 commented Jan 19, 2023

msminhas93 commented May 24, 2024

[Question] Recommended KNN/ANN index for large datasets #127

[Question] Recommended KNN/ANN index for large datasets #127

Comments

misotrnka commented Feb 1, 2021

JohannesBuchner commented Feb 1, 2021 • edited Loading

misotrnka commented Feb 2, 2021

JohannesBuchner commented Feb 2, 2021

misotrnka commented Feb 2, 2021

JohannesBuchner commented Feb 2, 2021

n0099 commented Jan 19, 2023

msminhas93 commented May 24, 2024

JohannesBuchner commented Feb 1, 2021 •

edited

Loading