Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve error messages for invalid indexing configurations #349

Merged
merged 5 commits into from
Dec 8, 2022

Conversation

cmacdonald
Copy link
Contributor

This addresses the experience reported in #348 by @maxhenze

@seanmacavaney
Copy link
Collaborator

Along this line, a few situations I've seen students encounter recently that could use better indexing error messages:

  • Short circuit error message if a docno is encountered that's longer than the configured maximum. (Currently no error thrown, only know once you get very poor performance when retrieving).
  • Clearer error message when user passes non-dict objects to IterDictIndexer (it's caught in Java and is rather cryptic)

@cmacdonald cmacdonald changed the title improve error messages for invalid DF indexing configurations improve error messages for invalid indexing configurations Dec 7, 2022
@cmacdonald
Copy link
Contributor Author

Clearer error message when user passes non-dict objects to IterDictIndexer (it's caught in Java and is rather cryptic)

I have introduced a check for this on the first document. There is a related issue (tracking at terrier-org/terrier-core#215) of not propagating the Java exception back into Python by FLATJsonDocument that we can fix upstream for a future release.

Short circuit error message if a docno is encountered that's longer than the configured maximum. (Currently no error thrown, only know once you get very poor performance when retrieving).

Actually, this is caused by 'metaindex.compressed.crop.long' : 'true' which you have for IterDictIndexer (presumably to make text cropping automatic) - see https://github.com/terrier-org/pyterrier/blob/master/pyterrier/index.py#L803. We could pass the meta lengths to FlatJSONDocumentIterator and check the validity of the first document? This wont handle all possible error modes (e.g. first document is fine, but latter document is not), but would help, right?

@seanmacavaney
Copy link
Collaborator

Awesome, thanks!

This wont handle all possible error modes (e.g. first document is fine, but latter document is not), but would help, right?

Yeah, this would probably cover most cases and would be a huge help.

But cases where things are numbered (e.g., %p1 ... %p10) could consistently fail. The error message in the passage case would be that a duplicate docno was found, and the root cause could be tricky to get to the bottom of without this check.

The reasoning for only checking the first one is that checking every docno would get expensive? I imagine that it wouldn't add all that much overhead, especially as compared to the JSON generation, crossing between Python/Java, doing the actual indexing, etc.

@cmacdonald
Copy link
Contributor Author

Ok, I have adjusted FlatJSONDocumentIterator to handle first doc. I note that FlatJSONDocumentIterator is only used by the non-FIFO setting, ie Windows, so these improvements would not work on Linux/macOS. The fifo impl seems to do more things natively in Java, and I cant see easily how to fix this.

Aside: the fifo impl is more complicated and possibly unnecessary if the default is threads=1.

@seanmacavaney
Copy link
Collaborator

and I cant see easily how to fix this

Should be able to just peek at the first doc here, no? If you like, I could maybe take a look.

Aside: the fifo impl is more complicated and possibly unnecessary if the default is threads=1.

I suspect it's far more efficient though. From what I've seen benchmarking jnius, it's costly to move between Python and Java (which the nofifo one needs to do for every document, while the fifo one does not).

@cmacdonald
Copy link
Contributor Author

If you like, I could maybe take a look.

Sure, go for it.

@@ -849,7 +883,7 @@ def index(self, it, fields=('text',), meta=None, meta_lengths=None, threads=None
# {'docno' : 'd1', 'toks' : {'a' : 1, 'aa' : 2}}
# ]

iter_docs = DocListIterator(it)
iter_docs = DocListIterator(self._filter_iterable(it, fields))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why self._filter_iterable wasn't here before, as it is in the other invocation below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont think I understood its purpose....

@seanmacavaney
Copy link
Collaborator

seanmacavaney commented Dec 8, 2022

@cmacdonald -- updated! Mind taking a look when you have a chance?

I also changed the warning to an error in the case of docno. I feel that if your docnos are being truncated, it is always a problem and it's best to force the user to deal with it before indexing will start. Other fields do not matter as much, so they remain just a warning.

@cmacdonald
Copy link
Contributor Author

lgtm, after discussion

@cmacdonald cmacdonald merged commit 6947ee4 into master Dec 8, 2022
@cmacdonald cmacdonald deleted the issue348 branch December 8, 2022 17:35
@cmacdonald cmacdonald added this to the 0.9 milestone Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants