-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report more metrics to monitor K8s task runner #14771
Conversation
@@ -370,13 +370,13 @@ public Optional<ScalingStats> getScalingStats() | |||
@Override | |||
public Map<String, Long> getIdleTaskSlotCount() | |||
{ | |||
return Collections.emptyMap(); | |||
return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(config.getCapacity() - tasks.size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value could be negative, e.g. -6 means 6 tasks are queued in the thread pool haven't been scheduled to run in K8s yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other implementations of getIdleTaskSlotCount do not use negative numbers to indicate being over capacity. I think we should follow the same pattern here.
return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(config.getCapacity() - tasks.size())); | |
return ImmutableMap.of(WORKER_CATEGORY, Math.max(0L, Long.valueOf(config.getCapacity() - tasks.size()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In K8sTaskRunner
the task lifecycle is different from other type of runners
, for example task is only in pending status if K8s runner submit the job already while in other runners
task is set to pending immediately after it handled by the runner.
So in other runners
we can check the number of pending tasks to see if the cluster is under provisioned but in K8sTaskRunner we have no visibility on this, thus here I reuse the taskSlot/idle/count
metric to indicate that.
Or we can make the task lifecycle in K8sTaskRunner
more align with other runners, e.g. if task is queued in thread pool then it's in pending status, if K8s submit the job then it's in running status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the K8sTaskRunner, the metric should track the number of pods that are in the waitUntilCondition
in launchPeonJobAndWaitForStart
. That is what I believe is the equivalent of a pending task with the K8sTaskRunner.
Lines 64 to 69 in 8fa7859
.waitUntilCondition(pod -> { | |
if (pod == null) { | |
return false; | |
} | |
return pod.getStatus() != null && pod.getStatus().getPodIP() != null; | |
}, howLong, timeUnit); |
We could also add a dimension to report if the pod timed out waiting to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request changes because:
- The worker categories are inconsistent across the metrics reported from the k8sTaskRunner
- I do not think idle tasks should report a negative number as none of the other task runners do something like this. If we want to show the number of tasks that are not running, but should be, I think we need a new metric because a task can be added to the tasks map in Druid, but not yet started because of many reasons and it would be good to have visibility into that.
- Please add docs for the new metrics in
docs/development/extensions-contrib/k8s-jobs.md
...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
Outdated
Show resolved
Hide resolved
@@ -370,13 +370,13 @@ public Optional<ScalingStats> getScalingStats() | |||
@Override | |||
public Map<String, Long> getIdleTaskSlotCount() | |||
{ | |||
return Collections.emptyMap(); | |||
return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(config.getCapacity() - tasks.size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other implementations of getIdleTaskSlotCount do not use negative numbers to indicate being over capacity. I think we should follow the same pattern here.
return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(config.getCapacity() - tasks.size())); | |
return ImmutableMap.of(WORKER_CATEGORY, Math.max(0L, Long.valueOf(config.getCapacity() - tasks.size()))); |
} | ||
|
||
@Override | ||
public Map<String, Long> getUsedTaskSlotCount() | ||
{ | ||
return Collections.emptyMap(); | ||
return ImmutableMap.of(WORKER_CATEGORY, Long.valueOf(Math.min(config.getCapacity(), tasks.size()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be a delay between tasks being added to the tasks
map and when they are actually running. Can you explain how an operator should think about using this metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it's not over capacity, I think the delay is minimum (like within a second, we consider peon job submitted/start as taskSlot being used). So operator can use this metric to decide whether need to add more resource if it's always same as config.getCapacity()
@@ -69,12 +85,14 @@ public Pod launchPeonJobAndWaitForStart(Job job, long howLong, TimeUnit timeUnit | |||
}, howLong, timeUnit); | |||
long duration = System.currentTimeMillis() - start; | |||
log.info("Took task %s %d ms for pod to startup", jobName, duration); | |||
emitK8sPodMetrics(job, "peon/startup/time", duration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to preface the metric with k8s
so that it is clear that the metric only applies to peons started by the kubernetes task runner.
I also like adding the unit to the end of the metric name to make it easier for operators to understand what unit the metric is reported in.
emitK8sPodMetrics(job, "peon/startup/time", duration); | |
emitK8sPodMetrics(job, "k8s/peon/startup/timeMillis", duration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of prefix with k8s, will do that. But I saw other metrics are not end with unit such as task/run/time
wonder whether we should align with the existing ones.
@@ -87,6 +105,8 @@ public JobResponse waitForPeonJobCompletion(K8sTaskId taskId, long howLong, Time | |||
howLong, | |||
unit | |||
); | |||
long duration = System.currentTimeMillis() - start; | |||
emitK8sPodMetrics(job, "peon/running/time", duration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different than the task/run/time
metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you would probably want to include the startup time as well, maybe it makes sense to emit this in KubernetesPeonLifecycle instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, task/run/time
metric will include job startup and running time.
{ | ||
this.clientApi = clientApi; | ||
this.namespace = namespace; | ||
this.debugJobs = debugJobs; | ||
this.adapter = adapter; | ||
this.emitter = emitter; | ||
} | ||
|
||
public Pod launchPeonJobAndWaitForStart(Job job, long howLong, TimeUnit timeUnit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job object that is being passed in here is calculated by using the adapter to convert a task to the Job in KubernetesTaskRunner#doTask
. And then later we are using the adapter to convert it back to a task.
Would it be cleaner to just pass in the Task object here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am also hoping to deprecate toTask(Job) in this pr: https://github.com/apache/druid/pull/14802/files to allow us to pull tasks instead of storing them as a env variable
...lord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunnerFactory.java
Fixed
Show fixed
Hide fixed
.../src/test/java/org/apache/druid/k8s/overlord/taskadapter/DruidPeonClientIntegrationTest.java
Fixed
Show fixed
Hide fixed
Thanks for the review, I think I addressed all of them. |
@@ -117,7 +117,7 @@ protected KubernetesPeonLifecycle( | |||
* @return | |||
* @throws IllegalStateException | |||
*/ | |||
protected synchronized TaskStatus run(Job job, long launchTimeout, long timeout) throws IllegalStateException | |||
protected synchronized TaskStatus run(Job job, Task task, long launchTimeout, long timeout) throws IllegalStateException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a strange function definition, as the PeonLifecycle is already scoped to a task
. Passing in both a task and a job is confusing, because they should be to the same task. How about a function definition like this, and has the constructor accept an adapter
so that the adapter can be used in this method
protected synchronized TaskStatus run(Job job, Task task, long launchTimeout, long timeout) throws IllegalStateException | |
protected synchronized TaskStatus run(long launchTimeout, long timeout) throws IllegalStateException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated but not the same as you suggested as my previous comment. Basically reverted my change to the method definition.
} | ||
|
||
public Pod launchPeonJobAndWaitForStart(Job job, long howLong, TimeUnit timeUnit) | ||
public Pod launchPeonJobAndWaitForStart(Job job, Task task, long howLong, TimeUnit timeUnit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public Pod launchPeonJobAndWaitForStart(Job job, Task task, long howLong, TimeUnit timeUnit) | |
public Pod launchPeonJobAndWaitForStart(Task task, long howLong, TimeUnit timeUnit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this is better but I found change like this will make the integration test DruidPeonClientIntegrationTest
hard to write as we push the job creation to the underlying class, there is no easy way to construct a job to suit our testing purpose. So I decide to keep the interface as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice! I have one suggestion on changing the interface to KubernetesPeonLifecycle and KubernetesPeonClient to try and make it a little easier to follow.
Can you update the PR title and description to reflect the latest state of the patch
Done, also replied why some interface is not updated as you suggest. |
Description
Report one new metrics to monitor K8s task runner:
k8s/peon/startup/time
: to report the time for peon pod takes to startupAnd implement the two existing metrics:
taskSlot/used/count
andtaskSlot/idle/count
Along with the previous #14698 change, we can answer the following questions about K8s task runner status:
taskSlot/used/count
.k8s/peon/startup/time
task/run/time
task/pending/time
Release note
Report
k8s/peon/startup/time
metrics for k8s based ingestion.Key changed/added classes in this PR
KubernetesPeonClient
reportk8s/peon/startup/time
metrics when pod startup.This PR has: