Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Prevent fetching multi-field from source #48770

Conversation

dimitris-athanasiou
Copy link
Contributor

Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates #48756

Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates elastic#48756
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@@ -283,10 +283,12 @@ public void testLookbackOnlyWithSourceDisabled() throws Exception {
new LookbackOnlyTestHelper("test-lookback-only-with-source-disabled", "airline-data-disabled-source").execute();
}

@AwaitsFix(bugUrl = "This test uses painless which is not available in the integTest phase")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that since this test was written the test's environment is different and we have access to painless now. I re-enabled the test.

@dimitris-athanasiou
Copy link
Contributor Author

run elasticsearch-ci/1 (unrelated failure in SnapshotIT)

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor nits around unnecessary hashset construction.

Overall, the interfaces seem cleaner.

private final T falseValue;

BooleanMapper(ExtractedField field, T trueValue, T falseValue) {
super(field.getName(), Collections.singleton(BooleanFieldMapper.CONTENT_TYPE));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like Collections.singleton(BooleanFieldMapper.CONTENT_TYPE) could be a private static final field.

static final String TYPE = "geo_point";

public GeoPointField(String name) {
super(name, Collections.singleton(TYPE));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#nit
this Collections.singleton(TYPE) and the others that are simply passed up to super could be private static finalso that a new HashSet isn't built on every ctor call

@dimitris-athanasiou
Copy link
Contributor Author

run elasticsearch-ci/2 (unrelated failure)

@dimitris-athanasiou
Copy link
Contributor Author

run elasticsearch-ci/default-distro

1 similar comment
@dimitris-athanasiou
Copy link
Contributor Author

run elasticsearch-ci/default-distro

@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

@dimitris-athanasiou dimitris-athanasiou merged commit e9367ef into elastic:master Nov 1, 2019
@dimitris-athanasiou dimitris-athanasiou deleted the prevent-fetching-multi-field-from-source branch November 1, 2019 10:34
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Nov 1, 2019
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates elastic#48756

Backport of elastic#48770
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Nov 1, 2019
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates elastic#48756

Backport of elastic#48770
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Nov 1, 2019
In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.

Now that elastic#48770 is merged, this commit is adding logic to
only select one variant of multi-fields.

Closes elastic#48756
dimitris-athanasiou added a commit that referenced this pull request Nov 1, 2019
In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.

Now that #48770 is merged, this commit is adding logic to
only select one variant of multi-fields.

Closes #48756
dimitris-athanasiou added a commit that referenced this pull request Nov 1, 2019
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates #48756

Backport of #48770
dimitris-athanasiou added a commit that referenced this pull request Nov 1, 2019
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.

This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.

Relates #48756

Backport of #48770
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Nov 1, 2019
…48799)

In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.

Now that elastic#48770 is merged, this commit is adding logic to
only select one variant of multi-fields.

Closes elastic#48756

Backport of elastic#48799
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Nov 1, 2019
…48799)

In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.

Now that elastic#48770 is merged, this commit is adding logic to
only select one variant of multi-fields.

Closes elastic#48756

Backport elastic#48799
dimitris-athanasiou added a commit that referenced this pull request Nov 1, 2019
…48806)

In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.

Now that #48770 is merged, this commit is adding logic to
only select one variant of multi-fields.

Closes #48756

Backport of #48799
dimitris-athanasiou added a commit that referenced this pull request Nov 1, 2019
…48807)

In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.

Now that #48770 is merged, this commit is adding logic to
only select one variant of multi-fields.

Closes #48756

Backport #48799
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants