Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added corpus_iter for Terrier index #426

Merged
merged 5 commits into from
Feb 27, 2024
Merged

added corpus_iter for Terrier index #426

merged 5 commits into from
Feb 27, 2024

Conversation

cmacdonald
Copy link
Contributor

Initial draft of #425

Unit tests are needed.

@seanmacavaney is the API useful?

@cmacdonald cmacdonald added the enhancement New feature or request label Feb 14, 2024
Copy link
Collaborator

@seanmacavaney seanmacavaney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a few minor suggestions


for skipped in range(0, direct_inputstream.getEntriesSkipped()):
meta = meta_inputstream.next()
yield {k : meta[keys_offset[k]] for k in keys_offset}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familar with the semantics of getEntriesSkipped. But should this include an empty toks to ensure that every dict from the iter has the same fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should, yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familar with the semantics of getEntriesSkipped.

direct_inputstream only address documents that are non-empty. So we have to address non-empty documents using getEntriesSkipped()

@@ -271,14 +271,55 @@ def _index_add(self, other):
raise ValueError("Cannot document-wise merge indices with and without positions (%r vs %r)" % (blocks_1, blocks_2))
multiindex_cls = autoclass("org.terrier.realtime.multi.MultiIndex")
return multiindex_cls([self, other], blocks_1, fields_1 > 0)

def _index_corpusiter(self, direct=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

direct isn't the most intuitive name to me for this field. Perhaps return_toks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or toks? The corresponding indexing arg is pretokenised?

docs/terrier-index-api.rst Outdated Show resolved Hide resolved
@cmacdonald
Copy link
Contributor Author

direct isn't the most intuitive name to me for this field.

Final decision: toks or return_toks ?

@seanmacavaney
Copy link
Collaborator

I don't think I feel super strongly one way or the other, but I feel return_toks is a bit clearer (i.e., it sounds like a boolean).

@cmacdonald
Copy link
Contributor Author

Ok, ta. I write some unit tests then I merge.

@cmacdonald
Copy link
Contributor Author

cmacdonald commented Feb 26, 2024

My revised implementation requires Python 3.8.
Python 3.7 is EOL (July 2023). Should enforce an upgrade?

Colab is Py 3.10 now

@seanmacavaney
Copy link
Collaborator

I'm happy with min python version of 3.8

@cmacdonald cmacdonald merged commit 81f20bd into master Feb 27, 2024
13 checks passed
@cmacdonald cmacdonald deleted the terrier_corpus_iter branch February 27, 2024 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants