Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weighted min hash - minhash_many function #195

Open
dopc opened this issue Jan 19, 2023 · 3 comments
Open

weighted min hash - minhash_many function #195

dopc opened this issue Jan 19, 2023 · 3 comments

Comments

@dopc
Copy link

dopc commented Jan 19, 2023

hey, thanks for this great project.
I want to use min hash for my text embedding vectors which have both negative and positive numbers.
I have searched the issues and found that weighted min hash can be used for that.
I tried it and it actually works we.

my problem is about minhash_many function. its result is different than minhash function.
below is a minimal code to reproduce and a screenshot to demonstrate without running the code.

I want to use minhash_many since it is faster than for loop.
So is this normal or something unexpected.
thx.

from time import perf_counter as pc
from datasketch import WeightedMinHashGenerator

vectors = np.random.uniform(-1, 1, (20000, 100))

mg = WeightedMinHashGenerator(vectors.shape[1], 32)
t0 = pc()
many_result = np.array(list(map(lambda x: x.digest(), mg.minhash_many(vectors))))
print(f'shape many: {many_result.shape}')
print(f'time many: {pc()-t0:.3f}')
print(f'many_result[0][:10]:\n{many_result[0][:10]}\n')

t0 = pc()
for_result = np.array(list(map(lambda x: mg.minhash(x).digest(), vectors)))
print(f'shape for: {many_result.shape}')
print(f'time for: {pc()-t0:.3f}')
print(f'for_result[0][:10]:\n{for_result[0][:10]}')

image

@ekzhu
Copy link
Owner

ekzhu commented Jan 24, 2023

@jroose-jv is this an expected behavior? My understanding is that minhash_many is a batch version of minhash.

@ekzhu
Copy link
Owner

ekzhu commented Jan 30, 2023

Sorry for the late response. If you want consistency across all weighted minhash, I recommend picking either minhash or minhash_many but not both.

@dopc
Copy link
Author

dopc commented Feb 1, 2023

I want to use minhash_many, but its result does not have any meaning, as far as I understand. In above, I used both of them to show the difference between them.

ekzhu added a commit that referenced this issue Feb 19, 2023
ekzhu added a commit that referenced this issue Feb 19, 2023
* #195 add doc note

* edit actions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants