KEP-3939: Include Terminating Pods As Active

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Currently, Jobs and Deployments start Pods as soon as they are marked for terminating.
This KEP proposes a new field for the Job, Deployment and ReplicaSet controllers that counts terminating pods as active. The goal of this KEP is to allow for opt-in behavior where terminating pods count as active.

Motivation

Existing Issues:

Job Creates Replacement Pods as soon as Pod is marked for deletion
Option for acknowledging terminating Pods in Deployment rolling update
Kueue: Account for terminating pods when doing preemption

Many common machine learning frameworks, such as Tensorflow, require unique pods. Terminating pods that count as active pods can cause errors. This is a rare case but it can provide problems if a job needs to guarantee that the existing pods terminate before starting new pods.

In Option for acknowledging terminating Pods in Deployment rolling update, there is a request in the Deployment API to guarantee that the number of replicas should include terminating. Terminating pods do utilize resources because resources are still allocated to them and there is potential for a user to be charged for utilizing those resources.

In scarce compute environments, these resources can be difficult to obtain so pods can take a long time to find resources and they may only be able to find nodes once the existing pods have been terminated.

Goals

Job Controller should only create new pods once the existing ones are marked as Failed/Succeeded
Deployment controller should allow for flexibility in waiting for pods to be fully terminated before creating new ones

Non-Goals

DaemonSets and StatefulSets are not included in this proposal
- They were designed to enforce uniqueness from the start so we will not include them in this design.

Proposal

Both Jobs and the ReplicaSet controller get a list of active pods. Active pods usually mean pods that have not been registered for deletion. In this KEP, we want to include terminating pods as active pods.

We will propose two new API fields in Jobs and Deployments/ReplicaSets in this KEP.

User Stories (Optional)

Story 1

As a machine learning user, ML frameworks allow scheduling of multiple pods.
The Job controller does not typically wait for terminating pods to be marked as failed. Tensorflow and other ML frameworks may have a requirement that they only want Pods to be started once the other pods are fully terminated. The following yaml can fit these needs:

This case was added due to a bug discovered with running IndexedJobs with Tensorflow. See Jobs create replacement Pods as soon as a Pod is marked for deletion for more details.

Story 2

As a cloud user, users would want to guarantee that the number of pods that are running is exactly the amount that they specify. Terminating pods do not relinguish resources so scarce compute resource are still scheduled to those pods. See Kueue: Account for terminating pods when doing preemption for an example of this.

Story 3

As a cloud user, users would want to guarantee that the number of pods that are running includes terminating pods. In scare compute environments, users may only have a limited amount of nodes and they do not want to try and schedule pods to a new resource. Counting terminating pods as active allows for the scheduling of pods to wait until pods are terminated.

See Option for acknowledging terminating Pods in Deployment rolling update for more examples.

Notes/Constraints/Caveats (Optional)

Open Questions on Deployment Controller

The Deployment API is open for discussion. We put the field in Deployment/ReplicaSet because it is related to RolloutStrategy. It is not clear if recreate and/or rollingupdate need this API for both rollout options.

Another open question is if we want to include Deployments in the initial release of this feature. There is some discussion about releasing the Job API first and then follow up with Deployment.

We decided to define the APIs in this KEP as they can utilize the same implementation.

Open Questions on Job Controller

With 3329-retriable-and-non-retriable-failures and PodFailurePolicy enabled, terminating pods are only marked as failed once they have been transitioned to failed. If PodFailurePolicy is disabled, then we mark a terminating pod as failed as soon as deletion is registered.

Should we add a new field to the status that reflects terminating pods?

Job controller should wait for Pods to be in a terminal phase before considering them failed or succeeded is a relevant issue for this case.
I am not sure how to handle these two different cases if we want to count terminating pods as active.

Should we use this feature to help solve 116858? When this feature toggle is on, then we mark terminating pods only as failed once they are complete regardless of PodFailurePolicy.

Risks and Mitigations

Design Details

API Name Choices

TerminatingAsActive
ActiveUntilTerminal
DelayPodRecreationUntilTerminal
?

Job API Definition

At the JobSpec level, we are adding a new BoolPtr field:

type JobSpec struct{
  ...
 // terminatingAsActive specifies if the Job controller should include terminating pods
 // as active. If the field is true, then the Job controller will include active pods
 // to mean running or terminating pods
 // +optional
 TerminatingAsActive *bool
}

Deployment/ReplicaSet API

// DeploymentStrategy stores information about the strategy and rolling-update
// behavior of a deployment.
type DeploymentStrategy struct {
  ... 
  // TerminatingAsActive specifies if the Deployments should include terminating pods
 // as active. If the field is true, then the Deployment controller will include active pods
 // to mean running or terminating pods
 // +optional
 TerminatingAsActive *bool
}

In Option for acknowledging terminating Pods in Deployment rolling update there was a request to add this as part of the DeploymentStrategy field. Generally, handling terminating pods as active can be useful in both RollingUpdates and Recreating rollouts. Having this field for both strategies allows for handling of terminating pods in both cases.

Deployments create ReplicaSets so there is a need to add a field in the ReplicaSet as well. Since ReplicaSets are not typically set by users, we should add a field to the ReplicaSet that is set from the DeploymentSpec.

// ReplicaSetSpec is the specification of a ReplicaSet.
// As the internal representation of a ReplicaSet, it must have
// a Template set.
type ReplicaSetSpec struct {
  ...
 // TerminatingAsActive specifies if the Deployments should include terminating pods
 // as active. If the field is true, then the Deployment controller will include active pods
 // to mean running or terminating pods
 // +optional
 TerminatingAsActive *bool
}

Implementation

Generally, both the Job controller and ReplicaSets utilize FilterActivePods in their reconciliation loop. FilterActivePods gets a list of pods that are not terminating. This KEP will include terminating pods in this list.

// FilterActivePods returns pods that have not terminated.
func FilterActivePods(pods []*v1.Pod, terminatingPods bool) []*v1.Pod {
 var result []*v1.Pod
 for _, p := range pods {
  if IsPodActive(p) {
   result = append(result, p)
  } else if IsPodTerminating(p) && terminatingPods {
      result = append(result, p)
  } else {
   klog.V(4).Infof("Ignoring inactive pod %v/%v in state %v, deletion time %v",
    p.Namespace, p.Name, p.Status.Phase, p.DeletionTimestamp)
  }
 }
 return result
}

func IsPodTerminating(p *v1.Pod) bool {
 return v1.PodSucceeded != p.Status.Phase &&
  v1.PodFailed != p.Status.Phase &&
  p.DeletionTimestamp != nil
}

The Job Controller uses this list to determine if there is a mismatch of active pods between expected values in the JobSpec.
Including active pods in this list allows the job controller to wait until these terminating pods.

Filter Active Pods Usage in Job Controller filters the active pods.

For the Deployment/ReplicaSet, ReplicaSets filter out active pods. The implementation for this should include reading the deployment field and setting the replicaset the same field in the replicaset.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

controller_utils: April 3rd 2023 - 56.6
replicaset: April 3rd 2023 - 78.5
deployment: April 3rd 2023 - 66.4
job: April 3rd 2023 - 90.4

Integration tests

We will add the following integration test for the Job controller:

TerminatingAsActive Feature Toggle On:

NonIndexedJob starts pods that takes a while to terminate
Delete pods
Verify that pod creation only occurs once terminating pods are removed

We should test the above with the FeatureToggle off also.

We will add a similar integration test for Deployment:

e2e tests

Graduation Criteria

Alpha

Job controller includes terminating pods as active
Deployment strategy optionally includes terminating pods as active
Unit Tests
Initial e2e tests

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: TerminatingAsActive
- Components depending on the feature gate: kube-controller-manager

Does enabling the feature change any default behavior?

Yes, terminating pods are included in the active pod count for FilterActivePods.

This means that deployments/Jobs when field is enabled will only create new pods once the existing pods have terminated.

This could potentially make deployments slower.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes.

What happens if we reenable the feature if it was previously rolled back?

Terminating pods will now be dropped from active list and we will revert to old behavior. This means that terminating pods will be considered deleted and new pods will be created.

Are there any tests for feature enablement/disablement?

Yes. Unit tests will include the fields off/on and verify behavior.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

If a user terminates pods that are controlled by a deployment/job, then we should wait until the existing pods are terminated before starting new ones.

We will add e2e test that determine this.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

NA

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

This feature is closely related to the 3329-retriable-and-nonretriable-failures but not sure if that is considered a dependency.

Does this feature depend on any specific services running in the cluster?

No

Scalability

Generally, enabling this will slow down rollouts if pods take a long time to terminate. We would wait to create new pods until the existing ones are terminated

Will enabling / using this feature result in any new API calls?

No

Will enabling / using this feature result in introducing new API types?

We add TerminatingAsActive to JobSpec, DeploymentStrategy and ReplicaSetSpec. This is a boolPtr.

Will enabling / using this feature result in any new calls to the cloud provider?

No

Will enabling / using this feature result in increasing size or count of the existing API objects?

For Job API, we are adding a BoolPtr field named TerminatingAsActive which is a boolPtr of 8 bytes.

API type(s): boolPtr
Estimated increase in size: 8B

ReplicaSet and Deployment have two additions:

API type(s): boolPtr
DeploymentStrategy and ReplicaSetSpec
Estimated increase in size: 16B (2 x 8B)

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Enabling this feature may have rollouts become slower.

Alternatives

We discussed having this under the PodFailurePolicy but this is a more general idea than the PodFailurePolicy.

Infrastructure Needed (Optional)

NA

Files

README.md

Latest commit

History