Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Recommended KNN/ANN index for large datasets #127

Closed
misotrnka opened this issue Feb 1, 2021 · 7 comments
Closed

[Question] Recommended KNN/ANN index for large datasets #127

misotrnka opened this issue Feb 1, 2021 · 7 comments

Comments

@misotrnka
Copy link

I would like to use CropResistantHash to quickly find near-duplicates from a large set of reference images.

With other hash functions I would normaly use some kind of approximate nearest neighbor index, such as NMSLib or Annoy. The challenge is that CropResistantHash is variable length and cannot be compared using one of the standard distance functions (Angular, Hamming, Manhattan, ...).

Can anyone point me to an alternative solution? How do you use this with large datasets?

@JohannesBuchner
Copy link
Owner

JohannesBuchner commented Feb 1, 2021

I suppose you can store it in a database with a 1:N image-hash mapping, and then do a equality query? I think some databases support hamming distances etc.

It is probably more performant to limit the bits so that you can make a equal test rather than going for a nearest neighbour search.

@misotrnka
Copy link
Author

I'm not sure if checking for equality would be sufficient for our use case, as we need certain level of precision.

I'm thinking of maybe using ANN to index individual region hashes and then use that to narrow down options for the full difference scan.

@JohannesBuchner
Copy link
Owner

Still, can store it in a database table the 1:N image-hash (segment_hashes in cropresistanthash) mapping and search using databases supported functions like hamming distances which should be fast.

@misotrnka
Copy link
Author

I can try that, but I believe that any distance function in a DB would rely on full sequential scan of the table. We are talking about hundreds of millions of rows here, so I think some kind of index is neccessary to narrow the options down a bit. But thank you for the idea, I'll explore it.

@msminhas93
Copy link

@misotrnka were you able to perform deduplication at 100M+ scale? I'm also trying to do something similar. If you could please share any pointers those would be valuable. I am thinking about inserting the perceptual hash in a db and doing a distinct select. We would be okay with certain loss of images in the process so long as it is not outrageously wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants