-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve error messages for invalid indexing configurations #349
Conversation
Along this line, a few situations I've seen students encounter recently that could use better indexing error messages:
|
I have introduced a check for this on the first document. There is a related issue (tracking at terrier-org/terrier-core#215) of not propagating the Java exception back into Python by FLATJsonDocument that we can fix upstream for a future release.
Actually, this is caused by |
Awesome, thanks!
Yeah, this would probably cover most cases and would be a huge help. But cases where things are numbered (e.g., The reasoning for only checking the first one is that checking every docno would get expensive? I imagine that it wouldn't add all that much overhead, especially as compared to the JSON generation, crossing between Python/Java, doing the actual indexing, etc. |
Ok, I have adjusted FlatJSONDocumentIterator to handle first doc. I note that FlatJSONDocumentIterator is only used by the non-FIFO setting, ie Windows, so these improvements would not work on Linux/macOS. The fifo impl seems to do more things natively in Java, and I cant see easily how to fix this. Aside: the fifo impl is more complicated and possibly unnecessary if the default is threads=1. |
Should be able to just peek at the first doc here, no? If you like, I could maybe take a look.
I suspect it's far more efficient though. From what I've seen benchmarking jnius, it's costly to move between Python and Java (which the nofifo one needs to do for every document, while the fifo one does not). |
Sure, go for it. |
@@ -849,7 +883,7 @@ def index(self, it, fields=('text',), meta=None, meta_lengths=None, threads=None | |||
# {'docno' : 'd1', 'toks' : {'a' : 1, 'aa' : 2}} | |||
# ] | |||
|
|||
iter_docs = DocListIterator(it) | |||
iter_docs = DocListIterator(self._filter_iterable(it, fields)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why self._filter_iterable
wasn't here before, as it is in the other invocation below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont think I understood its purpose....
@cmacdonald -- updated! Mind taking a look when you have a chance? I also changed the warning to an error in the case of docno. I feel that if your docnos are being truncated, it is always a problem and it's best to force the user to deal with it before indexing will start. Other fields do not matter as much, so they remain just a warning. |
lgtm, after discussion |
This addresses the experience reported in #348 by @maxhenze