Preconfigured `edge_ngram` tokenizer has incorrect defaults #43582

romseygeek · 2019-06-25T13:04:59Z

The docs state:

With the default settings, the `edge_ngram` tokenizer treats the initial text as a
single token and produces N-grams with minimum length `1` and maximum length
`2`:

This is corrrect if you define a new tokenizer of type edge_ngram, like so:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "my_ngram"
        }
      },
      "tokenizer" : {
        "my_ngram" : {
          "type" : "edge_ngram"
        }
      }
    }
  }
}
GET test/_analyze
{
  "analyzer" : "default",
  "text" : "test"
}

{
  "tokens" : [
    {
      "token" : "t",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "te",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    }
  ]
}

However, if you instead use the pre-configured edge_ngram tokenizer, you only get ngrams of size 1:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "edge_ngram"
        }
      }
    }
  }
}
GET test/_analyze
{
  "analyzer" : "default",
  "text" : "test"
}

{
  "tokens" : [
    {
      "token" : "t",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    }
  ]
}

We should change the preconfigured filter to correspond to the documentation

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-06-25T13:05:02Z

Pinging @elastic/es-search

When a named token filter or char filter is passed as part of an Analyze API request with no index, we currently try and build the relevant filter using no index settings. However, this can miss cases where there is a pre-configured filter defined in the analysis registry. One example here is the elision filter, which has a pre-configured version built with the french elision set; when used as part of normal analysis, this preconfigured set is used, but when used as part of the Analyze API we end up with NPEs because it tries to instantiate the filter with no index settings. This commit changes the Analyze API to check for pre-configured filters in the case that the request has no index defined, and is using a name rather than a custom definition for a filter. It also changes the pre-configured `word_delimiter_graph` filter and `edge_ngram` tokenizer to make their settings consistent with the defaults used when creating them with no settings Closes #43002 Closes #43621 Closes #43582

romseygeek added the :Search Relevance/Analysis How text is split into tokens label Jun 25, 2019

romseygeek self-assigned this Jun 25, 2019

romseygeek added a commit to romseygeek/elasticsearch that referenced this issue Jun 25, 2019

Add fix for elastic#43582

fd4086b

romseygeek closed this as completed in fbefb46 Jun 27, 2019

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preconfigured `edge_ngram` tokenizer has incorrect defaults #43582

Preconfigured `edge_ngram` tokenizer has incorrect defaults #43582

romseygeek commented Jun 25, 2019

elasticmachine commented Jun 25, 2019

Preconfigured edge_ngram tokenizer has incorrect defaults #43582

Preconfigured edge_ngram tokenizer has incorrect defaults #43582

Comments

romseygeek commented Jun 25, 2019

elasticmachine commented Jun 25, 2019

Preconfigured `edge_ngram` tokenizer has incorrect defaults #43582

Preconfigured `edge_ngram` tokenizer has incorrect defaults #43582