Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for terminating pods when doing preemption #510

Closed
2 of 3 tasks
Tracked by #636
alculquicondor opened this issue Jan 11, 2023 · 9 comments · Fixed by #692
Closed
2 of 3 tasks
Tracked by #636

Account for terminating pods when doing preemption #510

alculquicondor opened this issue Jan 11, 2023 · 9 comments · Fixed by #692
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@alculquicondor
Copy link
Contributor

What would you like to be added:

Account for the quota in use by pods that are still terminating when doing preemptions.

Why is this needed:

We issue preemptions by setting Workload.spec.admission=nil and immediately consider these resources freed. But in reality, pods take time to terminate.

We will need to keep the old admission somewhere for the calculations, and bubble up information about running pods from the Job. Maybe this will require improvements to the job controller.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@trasc
Copy link
Contributor

trasc commented Apr 6, 2023

/assign

@kannon92
Copy link
Contributor

kannon92 commented Apr 6, 2023

Hey @trasc,

@alculquicondor suggested I look into this issue as part of kubernetes/enhancements#3940.

I think it’s good that you are looking into it because I’m a little confused on what Kueue would want for accounting for terminating pods.

@alculquicondor
Copy link
Contributor Author

There are 2 parts to this problem:

  1. In Kueue, we need to preserve the information about how the job was admitted (in which queue and using which flavors).
  2. In Job status, we need the information about how many pods are still terminating.

@trasc
Copy link
Contributor

trasc commented Apr 6, 2023

As I see this:

  1. In Kueue, we need to preserve the information about how the job was admitted (in which queue and using which flavors).

#599 will add a new condition "Evicted" set upon preemption, an we can have this set without clearing up the admission (both the condition and the struct) and only reset the admission when job.IsActive() starts returning false, so the workload is considered "active" until we are "sure" that no resources are blocked by it. This should help to avoid over-provisioning.

2. In Job status, we need the information about how many pods are still terminating.

In my opinion this is the job of job.IsActive() and we can think of ways of making the implementation of it more accurate (maybe walk some kind of object ownership tree).

@alculquicondor
Copy link
Contributor Author

In my opinion this is the job of job.IsActive() and we can think of ways of making the implementation of it more accurate (maybe walk some kind of object ownership tree).

Right, we can always implement the logic in Kueue by watching pods. But ideally Kueue doesn't need to know about pods, for separation of concerns.

For the purpose of making progress, we can start with 1. and we can wait to see if we can leverage kubernetes/enhancements#3940

@trasc
Copy link
Contributor

trasc commented Apr 6, 2023

@kannon92 in the implementation of IsActive() for batch/job

func (j *Job) IsActive() bool {
return j.Status.Active != 0
}

we are relying on j.Status.Active to determine if a job is blocking any resources.

@kannon92
Copy link
Contributor

kannon92 commented Apr 6, 2023

Yea so the issue is that Active doesn't include terminating pods. That KEP is about including terminating pods in active but I realize that there were some design decisions around termating pods being considered failed.

If PodFailurePolicy is on, then job will mark a terminating pod as failed once it is fully terminated. If PodFailurePolicy is off, then a job immediately transitions to failed but the pod still has resources.

I think that we will go with what @alculquicondor has suggested and probably have a status field for terminating so we can catch this intermittent state without changing the behavior.

https://github.com/kubernetes/enhancements/blob/1a9513382b0338026a2524baa4159951f66924b0/keps/sig-apps/3939-include-terminating-pods-as-active/README.md#open-questions-on-job-controller

Does this mean that if you want this feature for other workloads (MPI, etc) then they should include terminating pods in their status?

@alculquicondor
Copy link
Contributor Author

Let's leave job conversations to k/k :)

@trasc, let start by keeping the .status.admission field when evicting/preempting and just updating the Admitted condition.
We can add the Evicted condition in a follow up.

@alculquicondor
Copy link
Contributor Author

Synced with @trasc offline to understand his suggestion better.

So the idea is to add the Evicted condition without changing the Admitted condition or clearing the .status.admission field. Once we are "certain" that the job doesn't have running pods, we actually clear .status.admission and update the Admitted condition.

This works better and is backwards-compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants