Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd bug in results sorting #74

Closed
scottlet opened this issue Mar 5, 2014 · 4 comments
Closed

Odd bug in results sorting #74

scottlet opened this issue Mar 5, 2014 · 4 comments

Comments

@scottlet
Copy link

scottlet commented Mar 5, 2014

Hi,
I've got about 9000 food items in an array. I'm wanting to use lunr to match results and order them. So far so good.

Having tried it in Node, I'm getting an error. I thought I'd try it using the front end and I get the same error. Namely, searching for "bread" brings back "seafood breader" first, then "breadfruit" and then finally "bread". I'd expect "bread" to be first...

I've uploaded my test case including the data here: http://03sq.net/lunr-test/ as I'm not sure if I've done something obviously wrong or if this is a bug or how to debug it :)

@olivernn
Copy link
Owner

olivernn commented Mar 5, 2014

Firstly, thanks for a great bug report, having a test case like the one you have put together makes it so much easier to try and diagnose the issue.

I'll try and describe what is happening here, hopefully it makes sense!

When you search for bread lunr is treating that as a term with an implicit wildcard at the end, e.g. bread*. This term is expanded into the following terms ["bread", "breadcrumb", "breadstick", "breader", "breadfruit"]. You can see this for yourself by calling idx.tokenStore.expand('bread'). These terms are then the ones used to try and find matching documents.

lunr uses TF-IDF to rank how similar a document and a search term are. The IDF part of this, inverse document frequency, penalises tokens that are common in the corpus (the total collection of documents). In the case of your index the token bread appears a total of 87 times, where as the token breader appears only once. Again you can check this using the following snippet: Object.keys(idx.tokenStore.get('bread')).length. You can see the affect this has on the terms IDF score, bread has a value of 6.4740077904950954 whilst breader does much better at 10.93991590914968, calculated using idx.idf(token).

There are measures in place to try and ensure that exact matches get a score boost, however this isn't a significant enough boost in your use case.

This is an issue that has cropped up before, I think in your case the issue may be amplified by the small size of the documents.

As for a solution, at the moment I'm not sure. There are a couple of issues like this that have prompted me to take a closer look at how scoring/ranking of search results is calculated, I don't have anything definitive yet but these are problems I'd like to solve.

A potential work-around for you is to disable the IDF calculation, this can be achieved fairly simply (though via monkey freedom patching)

idx.idf = function () { return 1 }

I'll definitely keep this in mind though for upcoming releases, perhaps a simpler way to disable IDF checking, I'll have a think.

@micahbolen
Copy link

+1 for highly educational discourse

@scottlet
Copy link
Author

scottlet commented Mar 5, 2014

Thanks for this! Perhaps there might be a way to intelligently guess after indexing the data what the shape of the data is and enable or disable the IDF calculation accordingly as well as creating some kind of option to turn it on and off.

I’ll have a try with your monkeyfreedompatch when I get back to the office tomorrow :)

Very impressed with Lunr so far!

@olivernn
Copy link
Owner

The latest version of Lunr (v2) no longer automatically inserts wildcards at the end of queries. A search for "bread" will not return any results for "seafood breader" or "breadfruit". Wildcards are still supported, but must be explicit. To re-create the behaviour in this issue you would have to search for "bread*".

So, it only took me 37 months to fix this issue, not bad!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants