improve error messages for invalid indexing configurations #349

cmacdonald · 2022-12-06T19:18:58Z

This addresses the experience reported in #348 by @maxhenze

seanmacavaney · 2022-12-06T19:34:37Z

Along this line, a few situations I've seen students encounter recently that could use better indexing error messages:

Short circuit error message if a docno is encountered that's longer than the configured maximum. (Currently no error thrown, only know once you get very poor performance when retrieving).
Clearer error message when user passes non-dict objects to IterDictIndexer (it's caught in Java and is rather cryptic)

cmacdonald · 2022-12-07T12:35:32Z

Clearer error message when user passes non-dict objects to IterDictIndexer (it's caught in Java and is rather cryptic)

I have introduced a check for this on the first document. There is a related issue (tracking at terrier-org/terrier-core#215) of not propagating the Java exception back into Python by FLATJsonDocument that we can fix upstream for a future release.

Short circuit error message if a docno is encountered that's longer than the configured maximum. (Currently no error thrown, only know once you get very poor performance when retrieving).

Actually, this is caused by 'metaindex.compressed.crop.long' : 'true' which you have for IterDictIndexer (presumably to make text cropping automatic) - see https://github.com/terrier-org/pyterrier/blob/master/pyterrier/index.py#L803. We could pass the meta lengths to FlatJSONDocumentIterator and check the validity of the first document? This wont handle all possible error modes (e.g. first document is fine, but latter document is not), but would help, right?

seanmacavaney · 2022-12-07T12:47:40Z

Awesome, thanks!

This wont handle all possible error modes (e.g. first document is fine, but latter document is not), but would help, right?

Yeah, this would probably cover most cases and would be a huge help.

But cases where things are numbered (e.g., %p1 ... %p10) could consistently fail. The error message in the passage case would be that a duplicate docno was found, and the root cause could be tricky to get to the bottom of without this check.

The reasoning for only checking the first one is that checking every docno would get expensive? I imagine that it wouldn't add all that much overhead, especially as compared to the JSON generation, crossing between Python/Java, doing the actual indexing, etc.

cmacdonald · 2022-12-07T22:20:36Z

Ok, I have adjusted FlatJSONDocumentIterator to handle first doc. I note that FlatJSONDocumentIterator is only used by the non-FIFO setting, ie Windows, so these improvements would not work on Linux/macOS. The fifo impl seems to do more things natively in Java, and I cant see easily how to fix this.

Aside: the fifo impl is more complicated and possibly unnecessary if the default is threads=1.

seanmacavaney · 2022-12-07T22:40:14Z

and I cant see easily how to fix this

Should be able to just peek at the first doc here, no? If you like, I could maybe take a look.

Aside: the fifo impl is more complicated and possibly unnecessary if the default is threads=1.

I suspect it's far more efficient though. From what I've seen benchmarking jnius, it's costly to move between Python and Java (which the nofifo one needs to do for every document, while the fifo one does not).

cmacdonald · 2022-12-08T10:27:42Z

If you like, I could maybe take a look.

Sure, go for it.

… fifo

seanmacavaney · 2022-12-08T13:35:08Z

pyterrier/index.py

@@ -849,7 +883,7 @@ def index(self, it, fields=('text',), meta=None, meta_lengths=None, threads=None
            #     {'docno' : 'd1', 'toks' : {'a' : 1, 'aa' : 2}}
            # ]

-            iter_docs = DocListIterator(it)
+            iter_docs = DocListIterator(self._filter_iterable(it, fields))


I'm not sure why self._filter_iterable wasn't here before, as it is in the other invocation below.

dont think I understood its purpose....

seanmacavaney · 2022-12-08T13:37:11Z

@cmacdonald -- updated! Mind taking a look when you have a chance?

I also changed the warning to an error in the case of docno. I feel that if your docnos are being truncated, it is always a problem and it's best to force the user to deal with it before indexing will start. Other fields do not matter as much, so they remain just a warning.

cmacdonald · 2022-12-08T17:34:26Z

lgtm, after discussion

improve error messages for invalid indexing configurations

b8e63f9

cmacdonald mentioned this pull request Dec 7, 2022

FLATJsonDocument consumes parsing exceptions terrier-org/terrier-core#215

Closed

type checking iterdict. as requested

c2a761f

cmacdonald changed the title ~~improve error messages for invalid DF indexing configurations~~ improve error messages for invalid indexing configurations Dec 7, 2022

addresses first-doc fix

ac90d3f

seanmacavaney added 2 commits December 8, 2022 13:32

refactored, error when docno too long (rather than warn), support for…

11d1a8f

… fifo

formatting

025048b

seanmacavaney reviewed Dec 8, 2022

View reviewed changes

cmacdonald merged commit 6947ee4 into master Dec 8, 2022

cmacdonald deleted the issue348 branch December 8, 2022 17:35

cmacdonald added this to the 0.9 milestone Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve error messages for invalid indexing configurations #349

improve error messages for invalid indexing configurations #349

cmacdonald commented Dec 6, 2022

seanmacavaney commented Dec 6, 2022

cmacdonald commented Dec 7, 2022

seanmacavaney commented Dec 7, 2022

cmacdonald commented Dec 7, 2022

seanmacavaney commented Dec 7, 2022

cmacdonald commented Dec 8, 2022

seanmacavaney Dec 8, 2022

cmacdonald Dec 8, 2022

seanmacavaney commented Dec 8, 2022 •

edited

Loading

cmacdonald commented Dec 8, 2022

improve error messages for invalid indexing configurations #349

improve error messages for invalid indexing configurations #349

Conversation

cmacdonald commented Dec 6, 2022

seanmacavaney commented Dec 6, 2022

cmacdonald commented Dec 7, 2022

seanmacavaney commented Dec 7, 2022

cmacdonald commented Dec 7, 2022

seanmacavaney commented Dec 7, 2022

cmacdonald commented Dec 8, 2022

seanmacavaney Dec 8, 2022

Choose a reason for hiding this comment

cmacdonald Dec 8, 2022

Choose a reason for hiding this comment

seanmacavaney commented Dec 8, 2022 • edited Loading

cmacdonald commented Dec 8, 2022

seanmacavaney commented Dec 8, 2022 •

edited

Loading