added corpus_iter for Terrier index #426

cmacdonald · 2024-02-14T18:53:33Z

Initial draft of #425

Unit tests are needed.

seanmacavaney

looks good, just a few minor suggestions

seanmacavaney · 2024-02-15T09:07:36Z

pyterrier/bootstrap.py

+
+                for skipped in range(0, direct_inputstream.getEntriesSkipped()):
+                    meta = meta_inputstream.next()
+                    yield {k : meta[keys_offset[k]] for k in keys_offset}


Not super familar with the semantics of getEntriesSkipped. But should this include an empty toks to ensure that every dict from the iter has the same fields?

it should, yes

Not super familar with the semantics of getEntriesSkipped.

direct_inputstream only address documents that are non-empty. So we have to address non-empty documents using getEntriesSkipped()

seanmacavaney · 2024-02-15T09:08:58Z

pyterrier/bootstrap.py

@@ -271,14 +271,55 @@ def _index_add(self, other):
            raise ValueError("Cannot document-wise merge indices with and without positions (%r vs %r)" % (blocks_1, blocks_2))
        multiindex_cls = autoclass("org.terrier.realtime.multi.MultiIndex")
        return multiindex_cls([self, other], blocks_1, fields_1 > 0)
+
+    def _index_corpusiter(self, direct=True):


direct isn't the most intuitive name to me for this field. Perhaps return_toks?

or toks? The corresponding indexing arg is pretokenised?

docs/terrier-index-api.rst

cmacdonald · 2024-02-17T12:12:56Z

direct isn't the most intuitive name to me for this field.

Final decision: toks or return_toks ?

seanmacavaney · 2024-02-19T10:02:22Z

I don't think I feel super strongly one way or the other, but I feel return_toks is a bit clearer (i.e., it sounds like a boolean).

cmacdonald · 2024-02-19T10:36:16Z

Ok, ta. I write some unit tests then I merge.

cmacdonald · 2024-02-26T15:08:56Z

My revised implementation requires Python 3.8.
Python 3.7 is EOL (July 2023). Should enforce an upgrade?

Colab is Py 3.10 now

seanmacavaney · 2024-02-26T18:18:23Z

I'm happy with min python version of 3.8

added corpus_iter example

eab3990

cmacdonald added the enhancement New feature or request label Feb 14, 2024

cmacdonald requested a review from seanmacavaney February 14, 2024 18:53

seanmacavaney reviewed Feb 15, 2024

View reviewed changes

addresses Sean's feedback

11dc54e

cmacdonald added 2 commits February 23, 2024 13:03

improved testing

e4108fb

added unit tests. control for empty docs, check python version

48f6ae6

Merge branch 'master' into terrier_corpus_iter

3485b16

cmacdonald merged commit 81f20bd into master Feb 27, 2024
13 checks passed

cmacdonald deleted the terrier_corpus_iter branch February 27, 2024 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added corpus_iter for Terrier index #426

added corpus_iter for Terrier index #426

cmacdonald commented Feb 14, 2024

seanmacavaney left a comment

seanmacavaney Feb 15, 2024

cmacdonald Feb 15, 2024

cmacdonald Feb 15, 2024

seanmacavaney Feb 15, 2024

cmacdonald Feb 15, 2024

cmacdonald commented Feb 17, 2024

seanmacavaney commented Feb 19, 2024

cmacdonald commented Feb 19, 2024

cmacdonald commented Feb 26, 2024 •

edited

Loading

seanmacavaney commented Feb 26, 2024

added corpus_iter for Terrier index #426

added corpus_iter for Terrier index #426

Conversation

cmacdonald commented Feb 14, 2024

seanmacavaney left a comment

Choose a reason for hiding this comment

seanmacavaney Feb 15, 2024

Choose a reason for hiding this comment

cmacdonald Feb 15, 2024

Choose a reason for hiding this comment

cmacdonald Feb 15, 2024

Choose a reason for hiding this comment

seanmacavaney Feb 15, 2024

Choose a reason for hiding this comment

cmacdonald Feb 15, 2024

Choose a reason for hiding this comment

cmacdonald commented Feb 17, 2024

seanmacavaney commented Feb 19, 2024

cmacdonald commented Feb 19, 2024

cmacdonald commented Feb 26, 2024 • edited Loading

seanmacavaney commented Feb 26, 2024

cmacdonald commented Feb 26, 2024 •

edited

Loading