Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use preconfigured filters correctly in Analyze API #43568

Merged
merged 9 commits into from
Jun 27, 2019

Conversation

romseygeek
Copy link
Contributor

When a named token filter or char filter is passed as part of an Analyze API
request with no index, we currently try and build the relevant filter using no
index settings. However, this can miss cases where there is a pre-configured
filter defined in the analysis registry. One example here is the elision filter, which
has a pre-configured version built with the french elision set; when used as part
of normal analysis, this preconfigured set is used, but when used as part of the
Analyze API we end up with NPEs because it tries to instantiate the filter with
no index settings.

This commit changes the Analyze API to check for pre-configured filters in the case
that the request has no index defined, and is using a name rather than a custom
definition for a filter.

Relates to #43002

@romseygeek romseygeek self-assigned this Jun 25, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@romseygeek
Copy link
Contributor Author

This is an interesting failure: the pre-configured EdgeNGramTokenizer uses the lucene default values for min and max, which are both 1; however, the elasticsearch EdgeNGramTokenizerFactory instead uses default values from NGramTokenizer, giving a min of 1 and a max of 2. This is a pretty clear bug - settings of tokenizer=edge_ngram and tokenizer.type=edge_ngram give you different configurations.

@cbuescher cbuescher self-requested a review June 25, 2019 13:19
Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change in AnalysisRegistry makes sense to me looking at the underlying issue. When reading the test I was wondering if it would be better (and/or easier) to start testing AnalysisRegistry#buildCustomAnalyzer() more directly in AnalysisRegistryTests instead of going through TransportAnalyzeActionTests. Not a heard requirement for this PR but maybe there are things that could be pulled out or added easily.

@romseygeek
Copy link
Contributor Author

When reading the test I was wondering if it would be better (and/or easier) to start testing AnalysisRegistry#buildCustomAnalyzer() more directly in AnalysisRegistryTests instead of going through TransportAnalyzeActionTests

+1, that makes sense, I'll open a follow-up issue

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a minor comment around testing, feel free to disregard if you don't like the idea. Rest LGTM

VersionUtils.randomVersionBetween(random(), Version.V_7_0_0, VersionUtils.getPreviousVersion(Version.V_7_3_0)))
.put("index.analysis.analyzer.my_analyzer.tokenizer", "edge_ngram")
.build();
IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", indexSettings);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would make sense to factor out the settings creation into a private helper that takes the version and the tokenizer name as arguments. Its more or less the same in all four cases and takes up a bit of space.

.put("index.analysis.analyzer.my_analyzer.tokenizer", "standard")
.putList("index.analysis.analyzer.my_analyzer.filter", "word_delimiter_graph")
.build();
IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", indexSettings);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@romseygeek romseygeek merged commit fbefb46 into elastic:master Jun 27, 2019
@romseygeek romseygeek deleted the analyze-preconfig-filters branch June 27, 2019 08:08
romseygeek added a commit that referenced this pull request Jun 27, 2019
When a named token filter or char filter is passed as part of an Analyze API
request with no index, we currently try and build the relevant filter using no
index settings. However, this can miss cases where there is a pre-configured
filter defined in the analysis registry. One example here is the elision filter, which
has a pre-configured version built with the french elision set; when used as part
of normal analysis, this preconfigured set is used, but when used as part of the
Analyze API we end up with NPEs because it tries to instantiate the filter with
no index settings.

This commit changes the Analyze API to check for pre-configured filters in the case
that the request has no index defined, and is using a name rather than a custom
definition for a filter.

It also changes the pre-configured `word_delimiter_graph` filter and `edge_ngram`
tokenizer to make their settings consistent with the defaults used when creating
them with no settings

Closes #43002
Closes #43621
Closes #43582
romseygeek added a commit that referenced this pull request Jun 27, 2019
…r is used (#43684)

#26625 deprecated delimited_payload_filter and added tests to check
that warnings would be emitted when both a normal and pre-configured
filter were used. Unfortunately, due to a bug in the Analyze API, the pre-
configured filter check was never actually triggered, and it turns out that
the deprecation warning was not in fact being emitted in this case.

#43568 fixed the Analyze API bug, which then surfaced this on backport.

This commit ensures that the preconfigured filter also emits the warnings
and triggers an error if a new index tries to use a preconfigured
delimited_payload_filter
javanna added a commit to javanna/elasticsearch that referenced this pull request Sep 17, 2024
edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0,
hence their support can entirely be removed from main.

The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support.

Relates to elastic#50376, elastic#50862, elastic#43568
javanna added a commit that referenced this pull request Sep 18, 2024
edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0,
hence their support can entirely be removed from main.

The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support.

Relates to #50376, #50862, #43568
javanna added a commit to javanna/elasticsearch that referenced this pull request Sep 18, 2024
…c#113009)

edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0,
hence their support can entirely be removed from main.

The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support.

Relates to elastic#50376, elastic#50862, elastic#43568
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants