Odd bug in results sorting #74

scottlet · 2014-03-05T14:35:33Z

Hi,
I've got about 9000 food items in an array. I'm wanting to use lunr to match results and order them. So far so good.

Having tried it in Node, I'm getting an error. I thought I'd try it using the front end and I get the same error. Namely, searching for "bread" brings back "seafood breader" first, then "breadfruit" and then finally "bread". I'd expect "bread" to be first...

I've uploaded my test case including the data here: http://03sq.net/lunr-test/ as I'm not sure if I've done something obviously wrong or if this is a bug or how to debug it :)

olivernn · 2014-03-05T18:21:50Z

Firstly, thanks for a great bug report, having a test case like the one you have put together makes it so much easier to try and diagnose the issue.

I'll try and describe what is happening here, hopefully it makes sense!

When you search for bread lunr is treating that as a term with an implicit wildcard at the end, e.g. bread*. This term is expanded into the following terms ["bread", "breadcrumb", "breadstick", "breader", "breadfruit"]. You can see this for yourself by calling idx.tokenStore.expand('bread'). These terms are then the ones used to try and find matching documents.

lunr uses TF-IDF to rank how similar a document and a search term are. The IDF part of this, inverse document frequency, penalises tokens that are common in the corpus (the total collection of documents). In the case of your index the token bread appears a total of 87 times, where as the token breader appears only once. Again you can check this using the following snippet: Object.keys(idx.tokenStore.get('bread')).length. You can see the affect this has on the terms IDF score, bread has a value of 6.4740077904950954 whilst breader does much better at 10.93991590914968, calculated using idx.idf(token).

There are measures in place to try and ensure that exact matches get a score boost, however this isn't a significant enough boost in your use case.

This is an issue that has cropped up before, I think in your case the issue may be amplified by the small size of the documents.

As for a solution, at the moment I'm not sure. There are a couple of issues like this that have prompted me to take a closer look at how scoring/ranking of search results is calculated, I don't have anything definitive yet but these are problems I'd like to solve.

A potential work-around for you is to disable the IDF calculation, this can be achieved fairly simply (though via ~~monkey~~ freedom patching)

idx.idf = function () { return 1 }

I'll definitely keep this in mind though for upcoming releases, perhaps a simpler way to disable IDF checking, I'll have a think.

micahbolen · 2014-03-05T18:40:35Z

+1 for highly educational discourse

scottlet · 2014-03-05T18:43:37Z

Thanks for this! Perhaps there might be a way to intelligently guess after indexing the data what the shape of the data is and enable or disable the IDF calculation accordingly as well as creating some kind of option to turn it on and off.

I’ll have a try with your monkeyfreedompatch when I get back to the office tomorrow :)

Very impressed with Lunr so far!

olivernn · 2017-04-10T20:05:20Z

The latest version of Lunr (v2) no longer automatically inserts wildcards at the end of queries. A search for "bread" will not return any results for "seafood breader" or "breadfruit". Wildcards are still supported, but must be explicit. To re-create the behaviour in this issue you would have to search for "bread*".

So, it only took me 37 months to fix this issue, not bad!

olivernn mentioned this issue Apr 1, 2014

Creates a TokenMetadataStore to return startPosition of tokens in results #79

Open

olivernn closed this as completed Apr 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Odd bug in results sorting #74

Odd bug in results sorting #74

scottlet commented Mar 5, 2014

olivernn commented Mar 5, 2014

micahbolen commented Mar 5, 2014

scottlet commented Mar 5, 2014

olivernn commented Apr 10, 2017

Odd bug in results sorting #74

Odd bug in results sorting #74

Comments

scottlet commented Mar 5, 2014

olivernn commented Mar 5, 2014

micahbolen commented Mar 5, 2014

scottlet commented Mar 5, 2014

olivernn commented Apr 10, 2017