-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support list query for explorer #1087
Support list query for explorer #1087
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a minor comment. LGTM!
src/datumaro/components/algorithms/hash_key_inference/explorer.py
Outdated
Show resolved
Hide resolved
if isinstance(query, list): | ||
topk_for_query = int(topk // len(query)) * 2 if not len(query) == 1 else topk | ||
query_hash_key_list = [] | ||
result_list = [] | ||
for query_ in query: | ||
if isinstance(query_, DatasetItem): | ||
query_key = self._get_hash_key_from_item_query(query_) | ||
query_hash_key_list.append(query_key) | ||
elif isinstance(query_, str): | ||
query_key = self._get_hash_key_from_text_query(query_) | ||
query_hash_key_list.append(query_key) | ||
else: | ||
raise MediaTypeError( | ||
"Unexpected media type of query '%s'. " | ||
"Expected 'DatasetItem' or 'string', actual'%s'" % (query_, type(query_)) | ||
) | ||
|
||
for query_key in query_hash_key_list: | ||
unpacked_key = np.unpackbits(query_key.hash_key, axis=-1) | ||
logits = calculate_hamming(unpacked_key, database_keys) | ||
ind = np.argsort(logits) | ||
|
||
item_list = np.array(self._item_list)[ind] | ||
result_list.extend(item_list[:topk_for_query].tolist()) | ||
return np.random.choice(result_list, topk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a non-deterministic process here? Last time you said that the N-to-M top-k similarity search 1) finds the k similarity items for N queries (Nk candidates) and 2) pick up the top-k items from them, is it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the process for this. First make Ntopk or Ntopk_for_query candidates and logits for N queries and sort logits by value. Resort candidates through logit indices and pick up top-k items from them.
de801fe
to
c038a5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Summary
How to test
Checklist
License
Feel free to contact the maintainers if that's a concern.