Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: When using graph synonym and stop token filter together #28838

Open
aslamy opened this issue Feb 27, 2018 · 7 comments
Open

Bug: When using graph synonym and stop token filter together #28838

aslamy opened this issue Feb 27, 2018 · 7 comments
Assignees
Labels
>bug :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@aslamy
Copy link

aslamy commented Feb 27, 2018

Elasticsearch 6.2.0

Description:
When using stop and graph synonym filters together, the document that should match doesn't match and highlight doesn't work as it should.

Step to reproduce:

Mapping

{  
   "settings":{  
      "analysis":{  
         "analyzer":{  
            "english_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "english_stopwords_tokenfilter"
               ],
               "tokenizer":"standard"
            },
            "english_search_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "synonym_graph_tokenfilter",
                  "english_stopwords_tokenfilter"
               ],
               "tokenizer":"standard"
            }
         },
         "filter":{  
            "english_stopwords_tokenfilter":{  
               "type":"stop",
               "stopwords":"_english_"
            },
            "synonym_graph_tokenfilter":{  
               "type":"synonym_graph",
               "synonyms":[  
                  "world of war, wow"
               ]
            }
         }
      }
   },
   "mappings":{  
      "doc":{  
         "properties":{  
            "title":{  
               "type":"text",
               "analyzer":"english_analyzer",
               "search_analyzer":"english_search_analyzer"
            }
         }
      }
   }
}

Indexing 3 documents

{  "title":"world of war"}
{  "title":"wow"}
{  "title":"world of war. wow"}

Search

{  
   "query":{  
      "match":{  
         "title":"world of war"
      }
   },
   "highlight":{  
      "fields":{  
         "title":{  
            "fragment_size":0,
            "type":"unified"
         }
      }
   }
}

Search Result:

{  
   "took":1,
   "timed_out":false,
   "_shards":{  
      "total":5,
      "successful":5,
      "skipped":0,
      "failed":0
   },
   "hits":{  
      "total":2,
      "max_score":0.2876821,
      "hits":[  
         {  
            "_index":"test",
            "_type":"doc",
            "_id":"2",
            "_score":0.2876821,
            "_source":{  
               "title":"world of war. wow"
            },
            "highlight":{  
               "title":[  
                  "world of war. <em>wow</em>"
               ]
            }
         },
         {  
            "_index":"test",
            "_type":"doc",
            "_id":"1",
            "_score":0.2876821,
            "_source":{  
               "title":"wow"
            },
            "highlight":{  
               "title":[  
                  "<em>wow</em>"
               ]
            }
         }
      ]
   }
}

Problems:
Bug 1. Document { "title":"world of war"} does not match. But it should match.
Bug 2. Highlighter does not highlight "world of war".

I have also tried to put synonym_graph_tokenfilter after english_stopwords_tokenfilter filter but I get:

{  
   "error":{  
      "root_cause":[  
         {  
            "type":"illegal_argument_exception",
            "reason":"failed to build synonyms"
         }
      ],
      "type":"illegal_argument_exception",
      "reason":"failed to build synonyms",
      "caused_by":{  
         "type":"parse_exception",
         "reason":"Invalid synonym rule at line 1",
         "caused_by":{  
            "type":"illegal_argument_exception",
            "reason":"term: world of war analyzed to a token (war) with position increment != 1 (got: 2)"
         }
      }
   },
   "status":400
}
@javanna javanna added the :Search Relevance/Analysis How text is split into tokens label Mar 1, 2018
@javanna
Copy link
Member

javanna commented Mar 1, 2018

cc @elastic/es-search-aggs

@colings86
Copy link
Contributor

@romseygeek Could you take a look at this?

@jimczi
Copy link
Contributor

jimczi commented Mar 1, 2018

This is a known issue in Lucene and we're currently discussing different options for the fix:
https://issues.apache.org/jira/browse/LUCENE-8137
The only workaround for now is to not use the stop word filter when using the synonym_graph or to remove the stop words manually from the synonyms defined for the filter.

@mayya-sharipova
Copy link
Contributor

I will be closing this issue, as the issue in on the Lucene level (it has been opened and currently in progress), and there is nothing we ca do on the Elastic level.

@kut
Copy link

kut commented Feb 20, 2020

Hey @jimczi - just wanted to follow up on this. I'm getting a similar issue. The exact bug above (where only 2 out of 3 matches are found) no longer occurs (I'm using ES 7.6.0) - good news. And if you switch the order of the stopword and synonym_graph filters, you still get the illegal_argument_exception as expected (the Lucene bug has not been fixed). HOWEVER, with the filters in the new order, the workaround described above does not work:

This is a known issue in Lucene and we're currently discussing different options for the fix:
https://issues.apache.org/jira/browse/LUCENE-8137
The only workaround for now is to not use the stop word filter when using the synonym_graph or to remove the stop words manually from the synonyms defined for the filter.

If in the example above, you put synonym graph filter AFTER the stopwords filter AND manually remove stopwords from the synonyms (i.e. now synonyms=["world war, wow"]), then a query with "world of war" CANNOT match text with "world of war. Did I misunderstand the workaround? (That's very likely because I imagine lots of people use synonym_graph with stopwords.)

Thanks in advance!

(PS: the reason I need to put synonym_graph AFTER stopwords is that the stopwords are case sensitive whereas the synonyms are not case sensitive)

If helpful, here are the requests I'm running:

PUT /test-xxx
{  
   "settings":{  
      "analysis":{  
         "analyzer":{  
            "english_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "english_stopwords_tokenfilter"
               ],
               "tokenizer":"standard"
            },
            "english_search_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "english_stopwords_tokenfilter",
                  "synonym_graph_tokenfilter"
               ],
               "tokenizer":"standard"
            }
         },
         "filter":{  
            "english_stopwords_tokenfilter":{  
               "type":"stop",
               "stopwords":"_english_"
            },
            "synonym_graph_tokenfilter":{  
               "type":"synonym_graph",
               "synonyms":[  
                  "world war, wow"
               ]
            }
         }
      }
   },
   "mappings":{  
     "properties":{  
        "title":{  
           "type":"text",
           "analyzer":"english_analyzer",
           "search_analyzer":"english_search_analyzer"
        }
     }
   }
}

POST _bulk
{ "index" : { "_index" : "test-xxx" } }
{ "title":"world of war" }
{ "index" : { "_index" : "test-xxx" } }
{ "title":"wow" }
{ "index" : { "_index" : "test-xxx" } }
{ "title":"world of war. wow" }

GET /test-xxx/_search
{  
   "query":{  
      "match":{  
         "title":"world of war"
      }
   },
   "highlight":{  
      "fields":{  
         "title":{  
            "fragment_size":0,
            "type":"unified"
         }
      }
   }
}

DELETE /test-xxx

@jimczi
Copy link
Contributor

jimczi commented May 15, 2020

I am reopening this issue since it's a long standing bug and it's not resolved in Lucene.
The only workaround that work at the moment is to not use stop words, at index and query time.
You can define rules with and without stop words, for instance:
"world of war, world war, wow should match all variations.
Removing terms in a filter before or after the synonym graph should be avoided until the bug is resolved.
We want to solve this situation but it is not likely to happen before a major release considering the changes that are required on the analysis chain.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

8 participants