Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML][Data Frame] adds new pipeline field to dest config #43124

Merged

Conversation

benwtrent
Copy link
Member

This PR adds a new field pipeline that references an existing ingest pipeline via its id. If set, the DestConfig#pipeline field will be set on each index request. Meaning, all documents being indexed will go through the defined pipeline.

closes #43061

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Member Author

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change may bring the need for adding BWC tests for data frames. If desired, will address that in a separate PR.

}

public DestConfig(final StreamInput in) throws IOException {
index = in.readString();
if (in.getVersion().onOrAfter(Version.CURRENT)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be adjusted after backport

public boolean isValid() {
return index.isEmpty() == false;
}

@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeString(index);
if (out.getVersion().onOrAfter(Version.CURRENT)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be adjusted after backport

@benwtrent
Copy link
Member Author

I have been thinking more about how to handle pipelines in _preview. It may be beneficial to _simulate the pipeline given the created docs.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's there so far looks good, but, as you said in a comment, we need to think how to include this in the preview too. Otherwise the preview becomes actively misleading.

* @return The {@link Builder} with index set
*/
public Builder setIndex(String index) {
this.index = index;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Objects.requireNonNull(index) here so the error that the constructor will throw is traced to the root cause.

@benwtrent
Copy link
Member Author

@droberts195 thinking more about _preview. If we do a pipeline simulation, I think it also makes sense to include doc metadata (_id, etc.) in the preview. Right now it is just the _source but if we allow pipelines, it makes sense for each preview entry to also include the pipeline metadata.

Consequently, even without a pipeline the preview results should probably look like {_id: id, _source: <the current preview objects>}

@benwtrent
Copy link
Member Author

benwtrent commented Jun 12, 2019

Current simulation returns formatted docs like:

"preview" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "adzAekwbm_Y7QlKddxbhMG4AAAAAAAAA",
        "_source" : {
          "static_field" : 42,
          "machine" : {
            "os" : "ios"
          },
          "max(time)" : "2019-08-01T21:45:26.749Z"
        },
        "_ingest" : {
          "timestamp" : "2019-06-12T19:28:22.342755Z"
        }
      }
    },
...
]

It seems to me that we want the two returns to be as similar as possible. So, we can:

1️⃣ Nest the normal preview doc objects into a doc field.
2️⃣ Bring the doc field value supplied by the _simulate response up a level
3️⃣ Only display the _source from the _simulate response and don't change the format of the preview response without a pipeline at all.

I am not sure which would be better. When it comes to consuming the API, either work.

@benwtrent
Copy link
Member Author

Ultimately, I have opted to use option 3️⃣ mentioned above. It keeps the preview response consistent with what is currently done. Additionally, we don't want to "advertise" the _id as for consistent data frame transforms, we need the _id field untouched by the user.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I agree that making the output of preview match the final document structure as closely as possible is the best option.

My main comment is about the authorization aspect of this.

String id = (String) doc.get(DataFrameField.DOCUMENT_ID_FIELD);
doc.keySet().removeIf(k -> k.startsWith("_"));
src.put("_source", doc);
src.put("_id", id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also pass in _index (obtained from config.getDestination().getIndex()) so that if the pipeline accesses _index the simulation works like the real thing would.

ClientHelper.executeWithHeadersAsync(threadPool.getThreadContext().getHeaders(),
ClientHelper.DATA_FRAME_ORIGIN,
client,
SimulatePipelineAction.INSTANCE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The action name for this is cluster:admin/ingest/pipeline/simulate, so previewing a data frame transform with a pipeline now requires that privilege. We should document this and also test how easy it is to understand the resulting error message if you have all the other privileges required to use the data frame transform except this one.

It's not ideal that using the data frame transform for real does not require any special privilege over and above being able to index data, but simulating does. The cluster privilege that includes cluster:admin/ingest/pipeline/simulate is manage_ingest_pipelines, which is the same one that lets you delete ingest pipelines! As a followup maybe we should enquire about the possibility of adding a simulate_ingest_pipelines cluster privilege that just lets you simulate, and not CRUD.

We also need to pass this information on to the UI team when the ability to add a pipeline gets added to the UI, as the UI might want to proactively prevent people who cannot simulate a pipeline from creating a data frame transform that uses one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 what do you think of not using the user headers for _simulate? I am not sure the reason behind requiring the special permission when they can already index documents referring to the pipeline itself....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good idea. If we run this action as the internal _xpack user then it bypasses the need for the end user to have pipeline admin privileges. We're still requiring that they have permission to preview the data frame transform. If we consider simulating the pipeline an internal implementation detail of previewing the data frame transform then it's reasonable to run it as _xpack.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit 9f29749 into elastic:master Jun 19, 2019
@benwtrent benwtrent deleted the feature/ml-df-add-pipeline-support branch June 19, 2019 17:58
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Jun 19, 2019
* [ML][Data Frame] adds new pipeline field to dest config

* Adding pipeline support to _preview

* removing unused import

* moving towards extracting _source from pipeline simulation

* fixing permission requirement, adding _index entry to doc
benwtrent added a commit that referenced this pull request Jun 19, 2019
#43388)

* [ML][Data Frame] adds new pipeline field to dest config (#43124)

* [ML][Data Frame] adds new pipeline field to dest config

* Adding pipeline support to _preview

* removing unused import

* moving towards extracting _source from pipeline simulation

* fixing permission requirement, adding _index entry to doc

* adjusting for java 8 compatibility

* adjusting bwc serialization version to 7.3.0
@jakelandis jakelandis removed the v8.0.0 label Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Post-scripts for Continuous Dataframes
4 participants