Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add execution context service #102039

Merged
merged 47 commits into from
Jul 7, 2021
Merged

Conversation

mshustov
Copy link
Contributor

@mshustov mshustov commented Jun 14, 2021

Summary

Part of #102626
This PR adds an initial implementation of the ExectuionContext service that takes care of propagation runtime meta-information Kibana client App --> Kibana Server --> Elasticsearch server.

Design

Client-side

Kibana plugins create context and pass it through their application logic to inject it to http service call. Kibana Core will serialize context object and inject it as a custom header.

const context = coreStart.executionContext.create({
  type: 'tsvb',
  name: 'Time Servies Visual Builder',
  id: '5b2de169-2785-441b-ae8c-186a1936b17d',
  description: '[Network throughput] vis',
});

const result = await coreStart.http.get('/api/timeseries', {
   context,
});

// or 
const result = await coreStart.http.post('/my/endpoint', {
   body: { ...., context: context.toJSON() },
});

Server-side

There are two cases:

  • An incoming request contains the header with a context object. In this case, the context object is parsed and stored in AsyncLocalStorage. Whenever a plugin or Kibana Core calls Elasticseach server, some meta information from context (type + id) is attached to the x-opaque-id header. If a search operation takes longer than expected, parameters of the incoming request (including x-opaque-id) will be logged to the search slowlogs file.
  • A plugin initiates an operation in the Kibana server (for example, a task manager schedules a task). The plugin calls executionContext.set(context) to attach context object to the current async "thread". Unlike the logic on the client, the plugin doesn't need to pass the context object through all the layers of the application, nodejs already provides the API to store context through async operations.
schedule(async () => {
  executionContext.set(...);
  await doSomething();
});

Elasticsearch

Receives x-opaque-id header, which starts with requestId for the BWC with the logic introduced in #71019. It has the following format:

  • x-opaque-id: 1234-5678-9000. Contains requestId only if execution context hasn't been attached.
  • x-opaque-id: 1234-5678-9000;kibana:tsvb:5b2de169-2785-441b-ae8c-186a1936b17d contains requestId + kibana:executionContext.type:executionContext.id if the context has been attached.

Next steps

  • For the first implementation, I started with context capturing the single context level.
    In the next iteration, I'm going to add support for nested execution contexts. It can be used to compose execution context relationships across different apps:
const context = {
  type: 'dashboard',
  name: 'Dashboard',
  id: '1234',
  description: '[eCommerce] Revenue Dashboard',
  child: {
    type: 'visualization',
    name: 'gauge',
    id: '5678',
    description: '[eCommerce] Sold Products per Day',
  }
}

Performance impact

Usage of AsyncLocalStorage and AsyncHooks are not free. Keeping track of async context does add some overhead.
I ran DemoJourney of https://github.com/elastic/kibana-load-testing with 100 concurrent users and saw the total 95th percentile of response time increased by a few percent. However, response time in a few scenarios increased by 5-30%

See detailed report Before:

before

before.tar.gz

After:
after
after.tar.gz

Right now plan to keep the logic enabled by default for all the users. Before the v7.15 release we should measure the performance overhead of the final solution in #102706 Based on the final result, we might make the service opt-in.
Also, there is a PR in nodejs v14 that should improve async_hooks performance by 3-4 times.

Checklist

For maintainers

@mshustov mshustov added chore v8.0.0 release_note:skip Skip the PR/issue when compiling release notes v7.14.0 labels Jun 14, 2021
@mshustov mshustov added v7.15.0 and removed v7.14.0 labels Jun 21, 2021
@botelastic botelastic bot added the Team:APM All issues that need APM UI Team support label Jul 5, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking good.

I have to admit, async state holding is still a little magical to me, especially with hapi's event based mechanism (I'm not even really sure to understand how node is able to retain the async trace in that situation tbh)

A few concerns / questions:

  • How is AsyncLocalStorage working regarding garbage collection? My fear is that not being able to properly clear the storage may result in memory leaks, is that an actual concern? The PR is cleaning up the storage state during the server's response event but

    • are we sure this covers all responses EOL scenarios?
    • what about contexts created outside of the scope of a request handler. I'm thinking about task manager for example. Will the owners of such server-side services have to manually clear the context at the end of an operation?
  • If I do think we want to enable that by default, the perf impact makes me wonder if we shouldn't still add an option to disable the feature? OTOH that would force to re-implement the possibility to read the x-opaque-id from the ES client, which was removed in this PR, so this would complexity the code a bit. Just want to be sure we're all (the team, Product and so on) understanding the perf implication of this feature.


// the trimmed value in the server logs is better than nothing.
function enforceMaxLength(header: string): string {
return header.slice(0, MAX_BAGGAGE_LENGTH);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the header value is a serialized json object, wouldn't truncation cause an invalid object in the end? I see we're try/catching on the server-side when parsing the header, but I wonder if this is good enough?

src/core/public/http/fetch.ts Show resolved Hide resolved
src/core/server/elasticsearch/client/configure_client.ts Outdated Show resolved Hide resolved
requestUuid: uuid.v4(),
} as KibanaRequestState;
return responseToolkit.continue;
});
}

private setupContextExecutionCleanup(executionContext?: InternalExecutionContextSetup) {
if (!executionContext) return;
this.server!.events.on('response', function () {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is response covering all the request EOL scenarios? e.g is this handler called in case of internal handler error?

mshustov and others added 3 commits July 6, 2021 09:39
Co-authored-by: Josh Dover <1813008+joshdover@users.noreply.github.com>
@mshustov
Copy link
Contributor Author

mshustov commented Jul 6, 2021

How is AsyncLocalStorage working regarding garbage collection? My fear is that not being able to properly clear the storage may result in memory leaks, is that an actual concern? The PR is cleaning up the storage state during the server's response event but

@pgayvallet AsyncLocalStorage is built on top of async_hooks which tracks all the async operations generated in the current context. Therefore, it knows when the async context is finished and store can be destroyed by GC.
From the official nodejs docs:

While you can create your own implementation on top of the async_hooks module, AsyncLocalStorage should be preferred as it is a performant and memory safe implementation that involves significant optimizations that are non-obvious to implement.

From the nodejs test, we can see that the store object is removed when async stack trace is finished https://github.com/nodejs/node/blob/e46c680bf2b211bbd52cf959ca17ee98c7f657f5/test/async-hooks/test-async-local-storage-enter-with.js

It makes executionContext.reset redundant, so I decided to remove it from the HTTP server at all to avoid confusion.

To make sure we don't introduce a memory leak, I added a long string to the execution context:

executionContext?.set({
  ...parentContext,
  requested,
  randomString: Math.random().toString().repeat(100_000), // 1.8Mb per a single request!
});

and ran load-testing for 2*6 minutes

props.maxUsers = 100
//...
constantConcurrentUsers(20) during (6 * 60), // 1
rampConcurrentUsers(20) to props.maxUsers during (6 * 60) // 2

https://github.com/elastic/kibana-load-testing/blob/22114b6d9e6c8cdcbc61f817e6b7ddee6a96ca49/src/test/scala/org/kibanaLoadTest/simulation/branch/DemoJourney.scala#L31-L32

Memory consumption on the Monitoring page:
2021-07-06_15-06-00

But anyway we should add a flag to disable executionContext service if something goes wrong.

If I do think we want to enable that by default, the perf impact makes me wonder if we shouldn't still add an option to disable the feature? OTOH that would force to re-implement the possibility to read the x-opaque-id from the ES client, which was removed in this PR, so this would complexity the code a bit.

fair point. I can put the logic for legacy x-opaque-id header propagation back to provide BWC.

Just want to be sure we're all (the team, Product and so on) understanding the perf implication of this feature.

yeah, as mentioned in the PR title, with nodejs/node#38577 landing to nodejs v14 It's getting about 3-4x better performance.. With this change, we can consider enabling the executionContext service by default.

I am curious if your client-side example in the PR description intentionally included the context in the body.

@joshdover yes, data plugin.
https://github.com/mshustov/kibana/pull/8/files#diff-46fda133c91a951944bba002227ef3b3580bae089ccbafc27036966693f839a2R33-R37

From mshustov#8:
The data plugin data fetching model is a bit different from the one most plugins use. It doesn't send an HTTP request for every search operation, but batches them according to some internal rules and send them in bulk. In turn, the Kibana server parses the batch and issues a dedicated search request for every search operation. Kibana server streams Elasticsearch server response back to the browser as soon as every search operation is finished.
data_plugin_flow

Copy link
Contributor

@smith smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APM changes look good.

...this.config.requestHeadersWhitelist,
]);
scopedHeaders = filterHeaders(
{ ...requestHeaders, ...requestIdHeaders, ...authHeaders },
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still pass 'x-opaque-id' header if executionContext service is disabled


export const config: ServiceConfigDescriptor<ExecutionContextConfigType> = {
path: 'execution_context',
schema: configSchema,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I didn't pass the config value on the client. I don't see a lot of benefits of making ExecutionContextContainermethods no-ops as they don't add a lot of overhead. Any objections?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to have the client always be 'enabled' regardless of the config value.

import { ServiceConfigDescriptor } from '../internal_types';

const configSchema = schema.object({
enabled: schema.boolean({ defaultValue: true }),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can disable it by default based on the outcome of #102706
In the long term, the service should be enabled by default.

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see anything else, LGTM.


// the trimmed value in the server logs is better than nothing.
function enforceMaxLength(header: string): string {
return header.slice(0, MAX_BAGGAGE_LENGTH);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels quite complex, so I'd say it's fine keeping it as you did for now. Let's use this initial implementation and see with our usages if the limit is effectively reached for any real usage.


export const config: ServiceConfigDescriptor<ExecutionContextConfigType> = {
path: 'execution_context',
schema: configSchema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to have the client always be 'enabled' regardless of the config value.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
core 365 368 +3

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
core 1056 1071 +15

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
core 30 31 +1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
core 420.1KB 421.5KB +1.4KB
Unknown metric groups

API count

id before after diff
core 2296 2327 +31

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@mshustov mshustov merged commit e01e682 into elastic:master Jul 7, 2021
mshustov added a commit to mshustov/kibana that referenced this pull request Jul 7, 2021
* add execution context service on the server-side

* integrate execution context service into http service

* add integration tests for execution context + http server

* update core code

* update integration tests

* update settings docs

* add execution context test plugin

* add a client-side test

* remove requestId from execution context

* add execution context service for the client side

* expose execution context service to plugins

* add execution context service for the server-side

* update http service

* update elasticsearch service

* move integration tests from http to execution_context service

* integrate in es client

* expose to plugins

* refactor functional tests

* remove x-opaque-id from create_cluster tests

* update test plugin package.json

* fix type errors in the test mocks

* fix elasticsearch service tests

* add escaping to support non-ascii symbols in description field

* improve test coverage

* update docs

* remove unnecessary import

* update docs

* Apply suggestions from code review

Co-authored-by: Josh Dover <1813008+joshdover@users.noreply.github.com>

* address comments

* remove execution context cleanup

* add option to disable execution_context service on the server side

* put x-opaque-id test back

* put tests back

* add header size limitation to the server side as well

* fix integration tests

* address comments

Co-authored-by: Josh Dover <1813008+joshdover@users.noreply.github.com>
mshustov added a commit that referenced this pull request Jul 7, 2021
* add execution context service on the server-side

* integrate execution context service into http service

* add integration tests for execution context + http server

* update core code

* update integration tests

* update settings docs

* add execution context test plugin

* add a client-side test

* remove requestId from execution context

* add execution context service for the client side

* expose execution context service to plugins

* add execution context service for the server-side

* update http service

* update elasticsearch service

* move integration tests from http to execution_context service

* integrate in es client

* expose to plugins

* refactor functional tests

* remove x-opaque-id from create_cluster tests

* update test plugin package.json

* fix type errors in the test mocks

* fix elasticsearch service tests

* add escaping to support non-ascii symbols in description field

* improve test coverage

* update docs

* remove unnecessary import

* update docs

* Apply suggestions from code review

Co-authored-by: Josh Dover <1813008+joshdover@users.noreply.github.com>

* address comments

* remove execution context cleanup

* add option to disable execution_context service on the server side

* put x-opaque-id test back

* put tests back

* add header size limitation to the server side as well

* fix integration tests

* address comments

Co-authored-by: Josh Dover <1813008+joshdover@users.noreply.github.com>

Co-authored-by: Josh Dover <1813008+joshdover@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore release_note:skip Skip the PR/issue when compiling release notes Team:APM All issues that need APM UI Team support v7.15.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants