Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report more metrics to monitor K8s task runner #14771

Merged
merged 10 commits into from
Aug 16, 2023

Conversation

YongGang
Copy link
Contributor

@YongGang YongGang commented Aug 7, 2023

Description

Report one new metrics to monitor K8s task runner:

  • k8s/peon/startup/time: to report the time for peon pod takes to startup

And implement the two existing metrics: taskSlot/used/count and taskSlot/idle/count

Along with the previous #14698 change, we can answer the following questions about K8s task runner status:

  • how many tasks/peons are running in K8s now? Use taskSlot/used/count.
  • How long K8s takes to start peon job? Use k8s/peon/startup/time
  • How long the pod actually run to finish? Use task/run/time
  • How long task is pending before running in K8s? Use task/pending/time

Release note

Report k8s/peon/startup/time metrics for k8s based ingestion.


Key changed/added classes in this PR
  • In KubernetesPeonClient report k8s/peon/startup/time metrics when pod startup.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@@ -370,13 +370,13 @@ public Optional<ScalingStats> getScalingStats()
@Override
public Map<String, Long> getIdleTaskSlotCount()
{
return Collections.emptyMap();
return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(config.getCapacity() - tasks.size()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value could be negative, e.g. -6 means 6 tasks are queued in the thread pool haven't been scheduled to run in K8s yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other implementations of getIdleTaskSlotCount do not use negative numbers to indicate being over capacity. I think we should follow the same pattern here.

Suggested change
return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(config.getCapacity() - tasks.size()));
return ImmutableMap.of(WORKER_CATEGORY, Math.max(0L, Long.valueOf(config.getCapacity() - tasks.size())));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In K8sTaskRunner the task lifecycle is different from other type of runners, for example task is only in pending status if K8s runner submit the job already while in other runners task is set to pending immediately after it handled by the runner.
So in other runners we can check the number of pending tasks to see if the cluster is under provisioned but in K8sTaskRunner we have no visibility on this, thus here I reuse the taskSlot/idle/count metric to indicate that.

Or we can make the task lifecycle in K8sTaskRunner more align with other runners, e.g. if task is queued in thread pool then it's in pending status, if K8s submit the job then it's in running status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the K8sTaskRunner, the metric should track the number of pods that are in the waitUntilCondition in launchPeonJobAndWaitForStart. That is what I believe is the equivalent of a pending task with the K8sTaskRunner.

.waitUntilCondition(pod -> {
if (pod == null) {
return false;
}
return pod.getStatus() != null && pod.getStatus().getPodIP() != null;
}, howLong, timeUnit);

We could also add a dimension to report if the pod timed out waiting to start.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request changes because:

  • The worker categories are inconsistent across the metrics reported from the k8sTaskRunner
  • I do not think idle tasks should report a negative number as none of the other task runners do something like this. If we want to show the number of tasks that are not running, but should be, I think we need a new metric because a task can be added to the tasks map in Druid, but not yet started because of many reasons and it would be good to have visibility into that.
  • Please add docs for the new metrics in docs/development/extensions-contrib/k8s-jobs.md

@@ -370,13 +370,13 @@ public Optional<ScalingStats> getScalingStats()
@Override
public Map<String, Long> getIdleTaskSlotCount()
{
return Collections.emptyMap();
return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(config.getCapacity() - tasks.size()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other implementations of getIdleTaskSlotCount do not use negative numbers to indicate being over capacity. I think we should follow the same pattern here.

Suggested change
return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(config.getCapacity() - tasks.size()));
return ImmutableMap.of(WORKER_CATEGORY, Math.max(0L, Long.valueOf(config.getCapacity() - tasks.size())));

}

@Override
public Map<String, Long> getUsedTaskSlotCount()
{
return Collections.emptyMap();
return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(Math.min(config.getCapacity(), tasks.size())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be a delay between tasks being added to the tasks map and when they are actually running. Can you explain how an operator should think about using this metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it's not over capacity, I think the delay is minimum (like within a second, we consider peon job submitted/start as taskSlot being used). So operator can use this metric to decide whether need to add more resource if it's always same as config.getCapacity()

@@ -69,12 +85,14 @@ public Pod launchPeonJobAndWaitForStart(Job job, long howLong, TimeUnit timeUnit
}, howLong, timeUnit);
long duration = System.currentTimeMillis() - start;
log.info("Took task %s %d ms for pod to startup", jobName, duration);
emitK8sPodMetrics(job, "peon/startup/time", duration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to preface the metric with k8s so that it is clear that the metric only applies to peons started by the kubernetes task runner.

I also like adding the unit to the end of the metric name to make it easier for operators to understand what unit the metric is reported in.

Suggested change
emitK8sPodMetrics(job, "peon/startup/time", duration);
emitK8sPodMetrics(job, "k8s/peon/startup/timeMillis", duration);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of prefix with k8s, will do that. But I saw other metrics are not end with unit such as task/run/time wonder whether we should align with the existing ones.

@@ -87,6 +105,8 @@ public JobResponse waitForPeonJobCompletion(K8sTaskId taskId, long howLong, Time
howLong,
unit
);
long duration = System.currentTimeMillis() - start;
emitK8sPodMetrics(job, "peon/running/time", duration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different than the task/run/time metric?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you would probably want to include the startup time as well, maybe it makes sense to emit this in KubernetesPeonLifecycle instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, task/run/time metric will include job startup and running time.

{
this.clientApi = clientApi;
this.namespace = namespace;
this.debugJobs = debugJobs;
this.adapter = adapter;
this.emitter = emitter;
}

public Pod launchPeonJobAndWaitForStart(Job job, long howLong, TimeUnit timeUnit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job object that is being passed in here is calculated by using the adapter to convert a task to the Job in KubernetesTaskRunner#doTask. And then later we are using the adapter to convert it back to a task.

Would it be cleaner to just pass in the Task object here instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am also hoping to deprecate toTask(Job) in this pr: https://github.com/apache/druid/pull/14802/files to allow us to pull tasks instead of storing them as a env variable

@YongGang
Copy link
Contributor Author

Request changes because:

  • The worker categories are inconsistent across the metrics reported from the k8sTaskRunner
  • I do not think idle tasks should report a negative number as none of the other task runners do something like this. If we want to show the number of tasks that are not running, but should be, I think we need a new metric because a task can be added to the tasks map in Druid, but not yet started because of many reasons and it would be good to have visibility into that.
  • Please add docs for the new metrics in docs/development/extensions-contrib/k8s-jobs.md

Thanks for the review, I think I addressed all of them.

@@ -117,7 +117,7 @@ protected KubernetesPeonLifecycle(
* @return
* @throws IllegalStateException
*/
protected synchronized TaskStatus run(Job job, long launchTimeout, long timeout) throws IllegalStateException
protected synchronized TaskStatus run(Job job, Task task, long launchTimeout, long timeout) throws IllegalStateException
Copy link
Contributor

@suneet-s suneet-s Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a strange function definition, as the PeonLifecycle is already scoped to a task. Passing in both a task and a job is confusing, because they should be to the same task. How about a function definition like this, and has the constructor accept an adapter so that the adapter can be used in this method

Suggested change
protected synchronized TaskStatus run(Job job, Task task, long launchTimeout, long timeout) throws IllegalStateException
protected synchronized TaskStatus run(long launchTimeout, long timeout) throws IllegalStateException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated but not the same as you suggested as my previous comment. Basically reverted my change to the method definition.

}

public Pod launchPeonJobAndWaitForStart(Job job, long howLong, TimeUnit timeUnit)
public Pod launchPeonJobAndWaitForStart(Job job, Task task, long howLong, TimeUnit timeUnit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public Pod launchPeonJobAndWaitForStart(Job job, Task task, long howLong, TimeUnit timeUnit)
public Pod launchPeonJobAndWaitForStart(Task task, long howLong, TimeUnit timeUnit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this is better but I found change like this will make the integration test DruidPeonClientIntegrationTest hard to write as we push the job creation to the underlying class, there is no easy way to construct a job to suit our testing purpose. So I decide to keep the interface as it is.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! I have one suggestion on changing the interface to KubernetesPeonLifecycle and KubernetesPeonClient to try and make it a little easier to follow.

Can you update the PR title and description to reflect the latest state of the patch

@YongGang YongGang changed the title Report pod running metrics to monitor K8s task runner Report more metrics to monitor K8s task runner Aug 15, 2023
@YongGang
Copy link
Contributor Author

Looks nice! I have one suggestion on changing the interface to KubernetesPeonLifecycle and KubernetesPeonClient to try and make it a little easier to follow.

Can you update the PR title and description to reflect the latest state of the patch

Done, also replied why some interface is not updated as you suggest.

@suneet-s suneet-s merged commit 3954685 into apache:master Aug 16, 2023
74 checks passed
@YongGang YongGang deleted the more-metrics branch August 27, 2023 20:42
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants