Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring & Tracing capability for multiple Thread pools of Che Server #14727

Merged
merged 24 commits into from
Oct 30, 2019

Conversation

skabashnyuk
Copy link
Contributor

@skabashnyuk skabashnyuk commented Oct 1, 2019

What does this PR do?

Add monitoring & Tracing capability for multiple Thread pools of Che Server.

ExecutorServiceWrapper

One of the trickiest thing that was preventing us to add this monitoring earlier was a

  • Fact that monitoring and tracing can be enabled/disabled separately.
  • Quite a lot of classes constructing ExecutorService in the constructor that is hard to override.
  • To be able to handle the timing of the submitted task executor has to be wrapped, it's not enough to have just a reference.
  • The sequence of wrapping is important. For example, if we wrap with tracing first we may lose the notion that ExecitorService was actually ThreadPoolExecutor.

To help to handle this kind of situation I've added such interface

public interface ExecutorServiceWrapper {

  ExecutorService wrap(ExecutorService executor, String name, String... tags);

  ScheduledExecutorService wrap(ScheduledExecutorService executor, String name, String... tags);

  CronExecutorService wrap(CronExecutorService executor, String name, String... tags);
}

implementation of this class has to be injected in all classes where ExecutorService is used to wrap the instance of executor.

Minor enhancement.

  • JsonRPC. Reduced the number of classes used to configure JSONRpc endpoint.
  • ExecutorServiceBuilder - helper class to build new ExecutorService
  • CountedRejectedExecutionHandler - class that helps to count task execution rejections
  • CountedThreadFactory - the class that helps to count thread creating/running/termination
  • Grafana dashboard with traces grouped by execution level. Traces that happen on the same level grouped on the same panels

Grafana

To help to consume values from monitored executor services I've added some changes on Grafana dashboards.

Executors

Знімок екрана  о 13 33 03

  • Threads running - the number of threads that are not terminated aka alive. May include threads that are in a waiting or blocked state.
  • Threads terminated - the number of threads that was finished its execution.
  • Threads created - number of threads created by thread factory for given executor service.
  • Created thread/minute - Speed of thread creating for the given executor service.

Знімок екрана  о 16 24 42

  • Executor threads active - number of threads that actively execute tasks.
  • Executor pool size - number of threads that actively execute tasks.
  • Queued task - the approximate number of tasks that are queued for execution
  • Queued occupancy - the percent of the queue used by the tasks that is waining for execution.

Знімок екрана  о 16 24 51

  • Rejected task - the number of tasks that were rejected from execution.
  • Rejected task/minute - the speed of task rejections
  • Completed tasks - the number of completed tasks
  • Completed tasks/minute - the speed of task execution

Знімок екрана  о 16 24 59

  • Task execution seconds max - 5min moving maximum of task execution
  • Tasks execution seconds avg - 1h moving average of task execution
  • Executor idle seconds max - 5min moving maximum of executor idle state.
  • Executor idle seconds avg - 1h moving average of executor idle state.

Знімок екрана  о 13 33 37

  • Scheduled repetitively - number of times how scheduleAtFixedRate or scheduleWithFixedDelay are called.
  • Executor scheduled once - number of times schedule methods are called.
  • Executor scheduled cron - number of times schedule with CronExpression methods is called.

Traces

Знімок екрана  о 13 33 45

  • Workspace start Max - maximum workspace start time
  • Workspace start Avg - 1h moving average of the workspace start time components
  • Workspace stop Max - maximum of workspace stop time
  • Workspace stop Avg - 1h moving average of the workspace stop time components

Знімок екрана  о 16 39 32

  • OpenShiftInternalRuntime#start Max - maximum time of OpenShiftInternalRuntime#start operation
  • OpenShiftInternalRuntime#start avg - 1h moving average time of OpenShiftInternalRuntime#start operation
  • Plugin Brokering Execution Max - maximum time of PluginBrokerManager#getTooling operation
  • Plugin Brokering Execution Avg - 1 of PluginBroker moving average of erManager#getTooling operation

Знімок екрана  о 16 39 25

  • OpenShiftEnvironmentProvisioner#provision Max - maximum time of OpenShiftEnvironmentProvisioner#provision operation
  • OpenShiftEnvironmentProvisioner#provision avg -1h moving average of OpenShiftEnvironmentProvisioner#provision operation
  • Plugin Brokering Execution Max - maximum time of PluginBrokerManager#getTooling components execution time
  • Plugin Brokering Execution Avg - 1h moving average of time of PluginBrokerManager#getTooling components execution time

Знімок екрана  о 13 34 15

  • WaitMachinesStart Max - maximim time of WaitMachinesStart operations
  • WaitMachinesStart Avg - 1h moving average time of WaitMachinesStart operations
  • OpenShiftInternalRuntime#startMachines Max - maximim time of OpenShiftInternalRuntime#startMachines operations
  • OpenShiftInternalRuntime#startMachines Avg - 1h moving average of the time of OpenShiftInternalRuntime#startMachines operations

What issues does this PR fix or reference?

#14601

Release Notes

n/a

Docs PR

eclipse-che/che-docs#870

@che-bot che-bot added status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. kind/task Internal things, technical debt, and to-do tasks to be performed. labels Oct 1, 2019
@skabashnyuk skabashnyuk changed the title Monitoring & Tracing capability for multiple Thread pools of Che Server 🚧 Monitoring & Tracing capability for multiple Thread pools of Che Server Oct 2, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 9, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 9, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 9, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 9, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 9, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 9, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 9, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 9, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 9, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 9, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 9, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 9, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 13, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 13, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 13, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 13, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 13, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 13, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 13, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 13, 2019
@eclipse-che eclipse-che deleted a comment from che-bot Oct 13, 2019
@skabashnyuk skabashnyuk force-pushed the observability branch 2 times, most recently from a9d80b1 to 88b35f2 Compare October 18, 2019 08:16
@che-bot
Copy link
Contributor

che-bot commented Oct 21, 2019

E2E tests of Eclipse Che Multiuser on OCP has been successful:

@eclipse-che eclipse-che deleted a comment from che-bot Oct 21, 2019
…y Signed-off-by: Sergii Kabashniuk <skabashniuk@redhat.com>

Signed-off-by: Sergii Kabashniuk <skabashniuk@redhat.com>
…vability Signed-off-by: Sergii Kabashniuk <skabashniuk@redhat.com>

Signed-off-by: Sergii Kabashniuk <skabashniuk@redhat.com>
@skabashnyuk
Copy link
Contributor Author

Wrapping/unwrapping looks fragile, but I've couldn't think of better solution :/ What if something fails (most probably at the reflection)? Does it kill whole server or just logs proper warning and rest keeps working?

I prefer to crash the whole JVM on guice startup phase. Because logs can be configured in the wrong way or it may be skipped or missed. But JVM crash unlikely can't be ignored

I'm also wondering if we will be able to utilize JavaFlightRecorder for monitoring like this of insides of jvm (once we'll switch to java 11+).

Java Flight Recorder requires a commercial license for use in production. To learn more about commercial features and how to enable them please visit http://www.oracle.com/technetwork/java/javaseproducts/.

I think because of that we unlikely can reuse it in Eclipse Che.

@skabashnyuk
Copy link
Contributor Author

ci-build

@che-bot
Copy link
Contributor

che-bot commented Oct 29, 2019

E2E Happy path tests of Eclipse Che Single User on K8S (minikube v1.1.1) has been successful:

@che-bot
Copy link
Contributor

che-bot commented Oct 29, 2019

E2E Happy path tests of Eclipse Che Single User on K8S (minikube v1.1.1) has failed:

@che-bot
Copy link
Contributor

che-bot commented Oct 29, 2019

E2E tests of Eclipse Che Multiuser on OCP has been successful:

@skabashnyuk
Copy link
Contributor Author

ci-build

@sparkoo
Copy link
Member

sparkoo commented Oct 29, 2019

I prefer to crash the whole JVM on guice startup phase. Because logs can be configured in the wrong way or it may be skipped or missed. But JVM crash unlikely can't be ignored

Yes, fail fast and visibly is good. I'm just afraid of incompatibilities with different java versions and compatibility with java modules (if it will ever be a concern for us :)

Java Flight Recorder requires a commercial license for use in production. To learn more about commercial features and how to enable them please visit http://www.oracle.com/technetwork/java/javaseproducts/.

I think because of that we unlikely can reuse it in Eclipse Che.

JFR was opensourced in java 11 (https://openjdk.java.net/jeps/328). We (Red Hat) even have people working on it.

Signed-off-by: Sergii Kabashniuk <skabashniuk@redhat.com>
@che-bot
Copy link
Contributor

che-bot commented Oct 30, 2019

E2E Happy path tests of Eclipse Che Single User on K8S (minikube v1.1.1) has been successful:

@che-bot
Copy link
Contributor

che-bot commented Oct 30, 2019

E2E tests of Eclipse Che Multiuser on OCP has been successful:

…lity Signed-off-by: Sergii Kabashniuk <skabashniuk@redhat.com>

Signed-off-by: Sergii Kabashniuk <skabashniuk@redhat.com>
Signed-off-by: Sergii Kabashniuk <skabashniuk@redhat.com>
@che-bot
Copy link
Contributor

che-bot commented Oct 30, 2019

E2E Happy path tests of Eclipse Che Single User on K8S (minikube v1.1.1) has been successful:

…lity Signed-off-by: Sergii Kabashniuk <skabashniuk@redhat.com>

Signed-off-by: Sergii Kabashniuk <skabashniuk@redhat.com>
@che-bot
Copy link
Contributor

che-bot commented Oct 30, 2019

E2E Happy path tests of Eclipse Che Single User on K8S (minikube v1.1.1) has been successful:

@che-bot
Copy link
Contributor

che-bot commented Oct 30, 2019

E2E tests of Eclipse Che Multiuser on OCP has failed:

…servability Signed-off-by: Sergii Kabashniuk <skabashniuk@redhat.com>

Signed-off-by: Sergii Kabashniuk <skabashniuk@redhat.com>
@che-bot
Copy link
Contributor

che-bot commented Oct 30, 2019

E2E Happy path tests of Eclipse Che Single User on K8S (minikube v1.1.1) has been successful:

@che-bot
Copy link
Contributor

che-bot commented Oct 30, 2019

E2E Happy path tests of Eclipse Che Single User on K8S (minikube v1.1.1) has failed:

@skabashnyuk
Copy link
Contributor Author

crw-ci-test

@che-bot
Copy link
Contributor

che-bot commented Oct 30, 2019

E2E Happy path tests of Eclipse Che Single User on K8S (minikube v1.1.1) has been successful:

@che-bot
Copy link
Contributor

che-bot commented Oct 30, 2019

E2E tests of Eclipse Che Multiuser on OCP has been successful:

@skabashnyuk skabashnyuk merged commit 3de4e7f into master Oct 30, 2019
@skabashnyuk skabashnyuk deleted the observability branch October 30, 2019 13:11
@che-bot che-bot removed the status/code-review This issue has a pull request posted for it and is awaiting code review completion by the community. label Oct 30, 2019
@che-bot che-bot added this to the 7.4.0 milestone Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/task Internal things, technical debt, and to-do tasks to be performed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants