Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1287: InPlacePodVerticalScaling BETA update #4704

Merged
merged 23 commits into from
Oct 10, 2024

Conversation

tallclair
Copy link
Member

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 7, 2024
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 7, 2024
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:

Compatible container runtime (see [CRI changes](#cri-changes)).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also addition of a RuntimeHandlerFeature for resize? as commented here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's more for the underlying oci runtimes to report features they have available. We could add features to the runtime status object maybe, though I think the runtimes have been compatible for a while unless the design is changing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's now RuntimeFeature which we could use if we wanted to

@tallclair
Copy link
Member Author

On further discussion, we've decided to push this back to v1.32. There are currently too many unresolved edge cases. I'll update this PR soon to highlight those unresolved conditions.

/milestone v1.32

@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Jun 7, 2024
@tallclair tallclair marked this pull request as draft September 4, 2024 16:57
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 4, 2024
@tallclair
Copy link
Member Author

tallclair commented Sep 4, 2024

Still WIP. I'm investigating, and plan to update / add sections on:

  • Resource quota, specifically how downscaling is handled
  • How limit resizes work (might need an extension to the checkpoint)
  • Resizing multiple resources / limits at the same time
  • Lifecycle nuances, how termination & restart impact various stages of resize
  • Resizing terminated containers
  • Rapid overlapping resizes
  • Resize subresource (do we need it?)

@tallclair tallclair marked this pull request as ready for review September 7, 2024 00:03
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 7, 2024
@tallclair
Copy link
Member Author

/assign @thockin

@tallclair
Copy link
Member Author

This is ready for review. I'll bring up some of these points for discussion in the next SIG-Node meeting.

Comment on lines 1111 to 1114
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?** Yes

- The feature should not be disabled on a running node (create a new node instead).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRR reviewer here: I think I'm following the discussion but want to be sure... Can this be made to work without creating new nodes? If so I think we're all set from a PRR perspective. I don't think we can answer this with "no" and consider the feature production ready.

keps/sig-node/1287-in-place-update-pod-resources/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@jpbetz jpbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor change request then LGTM for PRR

keps/sig-node/1287-in-place-update-pod-resources/README.md Outdated Show resolved Hide resolved
@jpbetz
Copy link
Contributor

jpbetz commented Oct 5, 2024

/approve
For PRR
Thanks @tallclair Looking forward to seeing this KEP progress!

@@ -51,6 +63,7 @@
- [Implementation History](#implementation-history)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we should probably add to the PRR or elsewhere is the possible need to adjust APF (from recent experience with misconfigured priority in prod 😮‍💨 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I don't think IPPVS increases API calls, does it? Why is APF specifically relevant here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's the resize calls themselves, is that the case you're thinking of?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. I don't expect this to be a big issue. But it kubelet is lumped together with other node-infra deamonsets in the same FlowSchema, then its impact would be amplified and hard to trace. It's more of a CYA move 😉

@@ -632,7 +653,7 @@ Pod Status in response to user changing the desired resources in Pod Spec.
for a Node with 'static' CPU Manager policy, that resize is rejected, and
an error message is logged to the event stream.
* To avoid races and possible gamification, all components will use Pod's
Status.ContainerStatuses[i].AllocatedResources when computing resources used
Status.ContainerStatuses[i].Resources when computing resources used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a problem if the runtime does not support reporting status.Resources (e.g users on containerd<1.6.9) or if runtime is reporting wrong resources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the runtime doesn't report the resources, we fall back to the allocated resources (e.g. memory requests). If the runtime reports the wrong resources, I don't think there's anything we can do about it, but I'm not sure we need to design for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a fallback solution here, and agreed with @tallclair on not-over-engineering if the runtime reports the wrong resources. All we can do is performing some basic sanity checks on the reported values to see if they are within the allowable range based on node capacity and pod requests. I am ok to do nothing for beta at least. WDYT? @vinaykul @tallclair

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be ok ... if you remember, we were initially setting status...Resources from AllocatedResources upon successful UpdateContainerResources before Lantao found a race in the implementation. We could continue to do that by tracking the last successful UpdateContainerResources values as it brings us closer to last known good configured values.

However, my concern is that by not reporting true values we might hide potential problems that throws off o11y tools and masks alerts where they should be firing off pagers. I don't know enough, let me check with our o11y folks and get back. Also, it is weird that a container which is not running says that it has configured resource requests & limits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let's ensure we do some basic sanity check on the reported values and log something if the value is out of the range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a good discussion with our o11y lead @vjsamuel a short while ago and gave him a quick overview of this feature and proposed changes. He has some interesting insights on how we currently use Requests and how the suggested changes can affect handling in o11y much better than I can.. I know just enough about this to be dangerous. Thanks @vjsamuel for your help!

* Multiple containers

These resource requests & limits can have interdependencies that Kubernetes may not be aware of. For
example, two containers coordinating work may need to be scaled in tandem. It probably doesn't makes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a clear, documented use case for "gang resize" or is this speculation?

I know I leveraged the notion of atomic NodeResourcesFit predicate when computing resize fit to make the case for pod-level ResizeStatus vs. container-level ResizeStatus with @thockin . But we may be making things worse by building it into the design.

Consider this:

  1. We have a node (4 CPUs, 4Gi memory) with two pods: P1(req: {cpu: 2000m, mem: 3000Mi}) and P2(req: {cpu: 1500m, mem: 1000Mi}).
  2. VPA triggers in-place pod resize for P1 updating it to P1(req: {cpu: 3000m, mem: 1000Mi}). But due to atomic resize notion, this request will remain in Deffered state keeping P1 at its original resource allocation levels for all types of resources.
  3. Some time later, a new pod P3(req: {mem: 2000Mi}) comes along. This pod P3 could have run on the node if the above resize request was deconstructed to P1(req: {cpu: 2000m, mem: 1000Mi}) + P1(req: {cpu: 3000m, mem: 1000Mi})). P1 is still Deferred but is now closer to its desired state.

In the above scenario, P3 either remains pending or CA fires up another node just to run P3 - either case increases cost to the customer due to resource waste via over-provisioning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added clarification that this is about containers in the same pod, not resizing multiple pods together. There probably is a "gang resize" use case that goes along with gang scheduling, but that's out of scope for now.

WRT the scenario you listed here, I think scaling up CPU and scaling down memory at the same time is going to be extremely unusual. I would expect if CPU is being scaled up, memory will be scaled up or stay the same. Maybe with VPA if it's constantly scaling to match usage this could happen? I'm not sure it's a use case we should optimize for though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I am also talking about resizing multiple containers in a single pod with something like the tandem resize constraint (to clarify my “gang resize” terminology above - and yes let’s absolutely leave resizing multiple pods all-or-none ("syndicate resize"? 😓 ) out of scope .. non-goal). Also this is not some corner case thing, it is common for app-owners to over-provision their requests e.g ask for way too much memory when app is CPU-intensive. (We have an infra-team resource partly tasked with tuning these and negotiating with app-teams to lower our DC hardware costs and power footprint). Another (contrived) example is a pod’s memory request & limit increase resize-request being withheld despite plenty of memory being available just because CPU increase part of the same resize request is Deferred (leading to OOM-kills)

If those resizes were to come in as multiple discrete requests and we can handle them with the same outcome, then we should be able to handle it coming in as a single request. A complex resize request is eventually broken down to simple UpdateContainerResources CRI calls to actuate them. I think the same can be done at admit and get the pod as close to desired state as possible. I feel atomic-resize notion at admit can become a footgun that leaves clusters over-provisioned… kinda defeats the purpose of this project.

Thinking out loud, what if we deconstructed a complex resize and admitted what we can (decrease requests will always be admitted), and set status to Deferred or Infeasible if any request cannot be admitted? I see now why @thockin's per-container ResizeStatus makes sense (atleast internally, while keeping pod-level ResizeStatus for VPA to consume). Pod-level resize status will be Infeasible if one or more container requests are Infeasible, and Deferred if one or more requests are Deferred but all are feasible.

I need to think through this very carefully but it’s been a long day and I’m beat rn… however modeling it this way feels like it will work. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical that trying to do partial resize is useful or correct, and it seems more likely to fail the "least surprise" test. Kube is level-actuated, kubelet should be driving at the most recent full update, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, level-actuated principle would apply if two resizes R1 (cpu: 1->2) and R2(cpu: 2->3) both arrive before Kubelet has the chance to admit R1. In that case, we do R(cpu: 1->3), if it can be admitted. And all-or-none at pod atomicity makes sense when scheduling and admitting new pod at Kubelet.

In this case, we have R1(mem: 1->2) which is admissible and R2(cpu: 1->2) which is Deferred. If we get [R1,R2], we would admit R1 and defer R2. But if it came in as [R1+R2] or [R2,R1] then we would head-of-line block R1. The scheduler anyways earmarks resources for both R1 & R2 via max(Requests, AllocatedResources). So why withhold R1 by introducing artificial barriers that are not really necessary and waste those resources?

By admitting what we can when those resources are already reserved, we would bring the pod's actual state closer to its desired state than with pod-level atomic resize. We wouldn't hold up a container image update because a Deferred resize prevents a full update, would we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @thockin's points above. I go with the simplicity for now. Getting the in-place pod resizing feature to beta is a significant milestone to us. Focusing on the core functionality and addressing the most common use cases with the atomic approach first allows for a faster beta release. Once the feature is released as beta, we can collect the feedback from users. We can evolve the feature with the partial resizes based on the feedbacks. @vinaykul WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in favor of this, defeats the purpose of the project. Make it a GA blocker?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defeats the purpose of the project

That's a huge statement. What we have here is a relatively minor case (and if not, something unforeseen is happening). IMO, having a clearly defined and safe semantic for such a corner case is more important than being exactly right in every case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Vinay!

@tallclair

  1. Let's document clearly the limitations of the atomic approach and indicate the partial resizes are not yet supported and the potential for resource waste in certain scenarios. The plan is calling out the inputs from the users who are using the coming beta feature.
  2. Let's also include re-evaluating this for GA too. Based on the data and feedback we have, we can re-evaluate the need for partial resizes before GA. If the data shows significant resource waste or user frustration, prioritize implementing partial resizes as a pre-requisite for GA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Dawn's suggestions. I'll update the KEP to include this.

I think we also need to reevaluate this in the context of pod-level resources, so I'd rather wait on a solution.

@tallclair
Copy link
Member Author

I realized there's a problem with the proposal that the scheduler uses the max of the desired resources & actual resources: if the container is not running, the actual resources aren't reported. Note that this issue exists with the current implementation as well, since allocated resources are only included for running containers.

One way around this is we could set status...resources to the allocated resources if the container isn't running.

@dchen1107
Copy link
Member

I realized there's a problem with the proposal that the scheduler uses the max of the desired resources & actual resources: if the container is not running, the actual resources aren't reported. Note that this issue exists with the current implementation as well, since allocated resources are only included for running containers.

One way around this is we could set status...resources to the allocated resources if the container isn't running.

+1 on setting status.containerStatuses[*].resources to the allocated resources even when the container isn't running:

  1. Preventing the node over-commit is important
  2. Provides the accurate information to the scheduler

I don't see any potential issues caused by this. @vinaykul ?

@thockin
Copy link
Member

thockin commented Oct 9, 2024

FTR: I am lgtm on this, even though there are some smaller details open

@dchen1107
Copy link
Member

/approve from node perspective

@tallclair once you addressed our last comments, please ping us for lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, jpbetz, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2024
@tallclair
Copy link
Member Author

Added additional GA constraints, and relaxed ResizePolicy to be fully mutable. I think this addresses the remaining open comment threads, I hope I didn't miss anything!

@dchen1107
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 10, 2024
@k8s-ci-robot k8s-ci-robot merged commit 1e118b1 into kubernetes:master Oct 10, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.