Fix invalid break iterator highlighting on keyword field #49566

jimczi · 2019-11-25T20:10:40Z

By default the unified highlighter splits the input into passages using
a sentence break iterator. However we don't check if the field is tokenized
or not so keyword field also applies the break iterator even though they can
only match on the entire content. This means that by default we'll split the
content of a keyword field on sentence break if the requested number of fragments
is set to a value different than 0 (default to 5). This commit changes this behavior
to ignore the break iterator on non-tokenized fields (keyword) in order to always
highlight the entire values. The number of requested fragments controls the number of
matched values that are returned but the boundary_scanner_type is now ignored.
Note that this is the behavior in 6x but some refactoring of the Lucene's highlighter
exposed this bug in Elasticsearch 7x.

By default the unified highlighter splits the input into passages using a sentence break iterator. However we don't check if the field is tokenized or not so `keyword` field also applies the break iterator even though they can only match on the entire content. This means that by default we'll split the content of a `keyword` field on sentence break if the requested number of fragments is set to a value different than 0 (default to 5). This commit changes this behavior to ignore the break iterator on non-tokenized fields (keyword) in order to always highlight the entire values. The number of requested fragments control the number of matched values are returned but the boundary_scanner_type is now ignored. Note that this is the behavior in 6x but some refactoring of the Lucene's highlighter exposed this bug in Elasticsearch 7x.

elasticmachine · 2019-11-25T20:10:42Z

Pinging @elastic/es-search (:Search/Highlighting)

mayya-sharipova

Thanks @jimczi , makes sense

By default the unified highlighter splits the input into passages using a sentence break iterator. However we don't check if the field is tokenized or not so `keyword` field also applies the break iterator even though they can only match on the entire content. This means that by default we'll split the content of a `keyword` field on sentence break if the requested number of fragments is set to a value different than 0 (default to 5). This commit changes this behavior to ignore the break iterator on non-tokenized fields (keyword) in order to always highlight the entire values. The number of requested fragments control the number of matched values are returned but the boundary_scanner_type is now ignored. Note that this is the behavior in 6x but some refactoring of the Lucene's highlighter exposed this bug in Elasticsearch 7x.

…dIgnoreBoundaryScanner

…rdIgnoreBoundaryScanner

By default the unified highlighter splits the input into passages using a sentence break iterator. However we don't check if the field is tokenized or not so `keyword` field also applies the break iterator even though they can only match on the entire content. This means that by default we'll split the content of a `keyword` field on sentence break if the requested number of fragments is set to a value different than 0 (default to 5). This commit changes this behavior to ignore the break iterator on non-tokenized fields (keyword) in order to always highlight the entire values. The number of requested fragments control the number of matched values are returned but the boundary_scanner_type is now ignored. Note that this is the behavior in 6x but some refactoring of the Lucene's highlighter exposed this bug in Elasticsearch 7x.

…hKeywordIgnoreBoundaryScanner

jimczi added >bug :Search Relevance/Highlighting How a query matched a document v8.0.0 v7.6.0 labels Nov 25, 2019

mayya-sharipova approved these changes Nov 26, 2019

View reviewed changes

jimczi merged commit 871408f into elastic:master Dec 4, 2019

jimczi added a commit that referenced this pull request Dec 4, 2019

add missing change after backport of #49566

1d522c6

jimczi added a commit that referenced this pull request Dec 4, 2019

#49566 Fix non-deterministic sort order in testHighlightingWithKeywor…

6e0342d

…dIgnoreBoundaryScanner

jimczi added a commit that referenced this pull request Dec 4, 2019

\#49566 Fix non-deterministic sort order in testHighlightingWithKeywo…

53d801c

…rdIgnoreBoundaryScanner

SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020

elastic#49566 Fix non-deterministic sort order in testHighlightingWit…

b3a6f47

…hKeywordIgnoreBoundaryScanner

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix invalid break iterator highlighting on keyword field #49566

Fix invalid break iterator highlighting on keyword field #49566

jimczi commented Nov 25, 2019 •

edited

Loading

elasticmachine commented Nov 25, 2019

mayya-sharipova left a comment

Fix invalid break iterator highlighting on keyword field #49566

Fix invalid break iterator highlighting on keyword field #49566

Conversation

jimczi commented Nov 25, 2019 • edited Loading

elasticmachine commented Nov 25, 2019

mayya-sharipova left a comment

Choose a reason for hiding this comment

jimczi commented Nov 25, 2019 •

edited

Loading