Bug: When using graph synonym and stop token filter together #28838

aslamy · 2018-02-27T14:16:50Z

Elasticsearch 6.2.0

Description:
When using stop and graph synonym filters together, the document that should match doesn't match and highlight doesn't work as it should.

Step to reproduce:

Mapping

{  
   "settings":{  
      "analysis":{  
         "analyzer":{  
            "english_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "english_stopwords_tokenfilter"
               ],
               "tokenizer":"standard"
            },
            "english_search_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "synonym_graph_tokenfilter",
                  "english_stopwords_tokenfilter"
               ],
               "tokenizer":"standard"
            }
         },
         "filter":{  
            "english_stopwords_tokenfilter":{  
               "type":"stop",
               "stopwords":"_english_"
            },
            "synonym_graph_tokenfilter":{  
               "type":"synonym_graph",
               "synonyms":[  
                  "world of war, wow"
               ]
            }
         }
      }
   },
   "mappings":{  
      "doc":{  
         "properties":{  
            "title":{  
               "type":"text",
               "analyzer":"english_analyzer",
               "search_analyzer":"english_search_analyzer"
            }
         }
      }
   }
}

Indexing 3 documents

{  "title":"world of war"}
{  "title":"wow"}
{  "title":"world of war. wow"}

Search

{  
   "query":{  
      "match":{  
         "title":"world of war"
      }
   },
   "highlight":{  
      "fields":{  
         "title":{  
            "fragment_size":0,
            "type":"unified"
         }
      }
   }
}

Search Result:

{  
   "took":1,
   "timed_out":false,
   "_shards":{  
      "total":5,
      "successful":5,
      "skipped":0,
      "failed":0
   },
   "hits":{  
      "total":2,
      "max_score":0.2876821,
      "hits":[  
         {  
            "_index":"test",
            "_type":"doc",
            "_id":"2",
            "_score":0.2876821,
            "_source":{  
               "title":"world of war. wow"
            },
            "highlight":{  
               "title":[  
                  "world of war. <em>wow</em>"
               ]
            }
         },
         {  
            "_index":"test",
            "_type":"doc",
            "_id":"1",
            "_score":0.2876821,
            "_source":{  
               "title":"wow"
            },
            "highlight":{  
               "title":[  
                  "<em>wow</em>"
               ]
            }
         }
      ]
   }
}

Problems:
Bug 1. Document { "title":"world of war"} does not match. But it should match.
Bug 2. Highlighter does not highlight "world of war".

I have also tried to put synonym_graph_tokenfilter after english_stopwords_tokenfilter filter but I get:

{  
   "error":{  
      "root_cause":[  
         {  
            "type":"illegal_argument_exception",
            "reason":"failed to build synonyms"
         }
      ],
      "type":"illegal_argument_exception",
      "reason":"failed to build synonyms",
      "caused_by":{  
         "type":"parse_exception",
         "reason":"Invalid synonym rule at line 1",
         "caused_by":{  
            "type":"illegal_argument_exception",
            "reason":"term: world of war analyzed to a token (war) with position increment != 1 (got: 2)"
         }
      }
   },
   "status":400
}

The text was updated successfully, but these errors were encountered:

javanna · 2018-03-01T15:27:51Z

cc @elastic/es-search-aggs

colings86 · 2018-03-01T17:40:00Z

@romseygeek Could you take a look at this?

jimczi · 2018-03-01T17:50:04Z

This is a known issue in Lucene and we're currently discussing different options for the fix:
https://issues.apache.org/jira/browse/LUCENE-8137
The only workaround for now is to not use the stop word filter when using the synonym_graph or to remove the stop words manually from the synonyms defined for the filter.

mayya-sharipova · 2018-03-20T19:48:36Z

I will be closing this issue, as the issue in on the Lucene level (it has been opened and currently in progress), and there is nothing we ca do on the Elastic level.

kut · 2020-02-20T16:59:33Z

Hey @jimczi - just wanted to follow up on this. I'm getting a similar issue. The exact bug above (where only 2 out of 3 matches are found) no longer occurs (I'm using ES 7.6.0) - good news. And if you switch the order of the stopword and synonym_graph filters, you still get the illegal_argument_exception as expected (the Lucene bug has not been fixed). HOWEVER, with the filters in the new order, the workaround described above does not work:

This is a known issue in Lucene and we're currently discussing different options for the fix:
https://issues.apache.org/jira/browse/LUCENE-8137
The only workaround for now is to not use the stop word filter when using the synonym_graph or to remove the stop words manually from the synonyms defined for the filter.

If in the example above, you put synonym graph filter AFTER the stopwords filter AND manually remove stopwords from the synonyms (i.e. now synonyms=["world war, wow"]), then a query with "world of war" CANNOT match text with "world of war. Did I misunderstand the workaround? (That's very likely because I imagine lots of people use synonym_graph with stopwords.)

Thanks in advance!

(PS: the reason I need to put synonym_graph AFTER stopwords is that the stopwords are case sensitive whereas the synonyms are not case sensitive)

If helpful, here are the requests I'm running:

PUT /test-xxx
{  
   "settings":{  
      "analysis":{  
         "analyzer":{  
            "english_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "english_stopwords_tokenfilter"
               ],
               "tokenizer":"standard"
            },
            "english_search_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "english_stopwords_tokenfilter",
                  "synonym_graph_tokenfilter"
               ],
               "tokenizer":"standard"
            }
         },
         "filter":{  
            "english_stopwords_tokenfilter":{  
               "type":"stop",
               "stopwords":"_english_"
            },
            "synonym_graph_tokenfilter":{  
               "type":"synonym_graph",
               "synonyms":[  
                  "world war, wow"
               ]
            }
         }
      }
   },
   "mappings":{  
     "properties":{  
        "title":{  
           "type":"text",
           "analyzer":"english_analyzer",
           "search_analyzer":"english_search_analyzer"
        }
     }
   }
}

POST _bulk
{ "index" : { "_index" : "test-xxx" } }
{ "title":"world of war" }
{ "index" : { "_index" : "test-xxx" } }
{ "title":"wow" }
{ "index" : { "_index" : "test-xxx" } }
{ "title":"world of war. wow" }

GET /test-xxx/_search
{  
   "query":{  
      "match":{  
         "title":"world of war"
      }
   },
   "highlight":{  
      "fields":{  
         "title":{  
            "fragment_size":0,
            "type":"unified"
         }
      }
   }
}

DELETE /test-xxx

jimczi · 2020-05-15T08:50:47Z

I am reopening this issue since it's a long standing bug and it's not resolved in Lucene.
The only workaround that work at the moment is to not use stop words, at index and query time.
You can define rules with and without stop words, for instance:
"world of war, world war, wow should match all variations.
Removing terms in a filter before or after the synonym graph should be avoided until the bug is resolved.
We want to solve this situation but it is not likely to happen before a major release considering the changes that are required on the analysis chain.

elasticsearchmachine · 2024-07-12T10:31:00Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

javanna added the :Search Relevance/Analysis How text is split into tokens label Mar 1, 2018

colings86 assigned romseygeek Mar 1, 2018

mayya-sharipova closed this as completed Mar 20, 2018

yiwei-sb mentioned this issue Mar 10, 2020

PositionIncrement问题（Elasticsearch6.6.2 + jieba6.4.1） sing1ee/elasticsearch-jieba-plugin#35

Closed

jimczi added the >bug label May 15, 2020

jimczi reopened this May 15, 2020

passerbythesun mentioned this issue May 16, 2020

在配置synonyms碰到问题 sing1ee/elasticsearch-jieba-plugin#49

Open

ywelsch mentioned this issue Apr 28, 2022

match_phrase queries miss documents containing stop words in synonyms #86021

Closed

togatoga mentioned this issue Nov 12, 2023

The synonym filter is being influenced by other filters WorksApplications/elasticsearch-sudachi#110

Closed

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: When using graph synonym and stop token filter together #28838

Bug: When using graph synonym and stop token filter together #28838

aslamy commented Feb 27, 2018

javanna commented Mar 1, 2018

colings86 commented Mar 1, 2018

jimczi commented Mar 1, 2018

mayya-sharipova commented Mar 20, 2018

kut commented Feb 20, 2020 •

edited

Loading

jimczi commented May 15, 2020 •

edited

Loading

elasticsearchmachine commented Jul 12, 2024

Bug: When using graph synonym and stop token filter together #28838

Bug: When using graph synonym and stop token filter together #28838

Comments

aslamy commented Feb 27, 2018

javanna commented Mar 1, 2018

colings86 commented Mar 1, 2018

jimczi commented Mar 1, 2018

mayya-sharipova commented Mar 20, 2018

kut commented Feb 20, 2020 • edited Loading

jimczi commented May 15, 2020 • edited Loading

elasticsearchmachine commented Jul 12, 2024

kut commented Feb 20, 2020 •

edited

Loading

jimczi commented May 15, 2020 •

edited

Loading