KEP-3063: dynamic resource allocation #3064

pohly · 2021-11-30T16:56:06Z

One-line PR description: initial draft of new KEP
Issue link: DRA: control plane controller ("classic DRA") #3063
Other comments:

/sig node

swatisehgal · 2021-12-07T18:52:15Z

/cc

zvonkok · 2021-12-13T18:33:48Z

zvonkok · 2021-12-13T18:42:19Z

If we follow the CSI methodology, were there any thoughts about keeping the PV and also utilizing the StorageClass provisioner spec entry, let's say one could create a DeviceClass that has a specific provisioner: nvidia.com/gpu-operator which would provision the devices aka PVs, which then can be allocated as DeviceClaims.

If a Pod requests a DeviceClaim which cannot be bound to a Device aka PV the provisioner in the DeviceClass (GPU) would try to do that. The same analogy is in the CSI world where a StorageClass has the aws-ebs provisioner which takes care to do all the needed things (create, allocate, format) to make the PV available for consumption.

pohly · 2021-12-13T19:06:13Z

were there any thoughts about keeping the PV

The split between PV and PVC has been a big source of pain for storage. It leads to a lot of race conditions, of which some still haven't been addressed (kubernetes-csi/external-provisioner#486). I'd prefer to keep it simpler than that unless a use case gets identified that depends on that split.

For storage, one advantage is that a PV can be created beforehand and then can be bound by Kubernetes to a new PVC that gets created later. We were discussing whether something like that would be useful, but then decided against it because the matching still would have to be done by a third-party add-on. If such pre-allocation is useful, it could be handled by the add-on internally.

and also utilizing the StorageClass provisioner spec entry, let's say one could create a DeviceClass that has a specific provisioner: nvidia.com/gpu-operator which would provision the devices aka PVs, which then can be allocated as DeviceClaims.

There is a ResourceClass (= StorageClass) with a PluginName (= Provisioner) in the current proposal to have that additional level of indirection. This also leads to additional complexity (how can a ResourceClaim be deallocated when the original ResourceClass is gone or modified - the later cannot be prevented reliably because delete/recreate can be used as workaround for immutable fields?), but the proposal tries to cover that by requiring that the plugin name gets copied and parameters in the ResourceClaim must be sufficient for deallocation.

swatisehgal

Thanks for the working on this and for the detailed write up. This is a complex topic with a lot of moving part so appreciate the effort put into this. Left a few comments.

I noticed that the roles and responsibilities are implicitly covered in the proposal, it would really help to have user stories from the point of view of a cluster admin and a device vendor like we have here and for it to be explicitly stated what one has to do as a cluster admin or as a device vendor to enable this feature.

keps/sig-node/3063-dynamic-resource-allocation/README.md

swatisehgal · 2021-12-15T09:07:34Z

keps/sig-node/3063-dynamic-resource-allocation/README.md

+}
+```
+
+### Communication between kubelet and resource node plugin


A suggestion: would be nice to have a diagram that clearly shows the endpoint and the full view of gRPC communication between kubelet and the plugin.

I agree, this low-level part needs more work. It's not fully spec-ed out.

Okay, looking forward to seeing more details on this!

I just pushed a major update with a more complete implementation section.

@bart0sh is going to add more details to the kubelet section.

swatisehgal · 2021-12-15T09:24:50Z

keps/sig-node/3063-dynamic-resource-allocation/README.md

+<<[/UNRESOLVED]>>
+```
+
+#### `NodePrepareResource`


Why not call it NodeAllocateResource or AllocateResources as I think it better represents the functionality of this RPC? NodePrepareResource gives the impression that we are preparing the node but the actual allocation happens at a later stage (in another RPC) which is not the case here.

No, the resource is already allocated before kubelet and thus this call here get involved.

Are you referring to the case where devices are shared between various pods i.e. a device is already allocated to a pod but still has bandwidth to accommodate resource request of another pod and then we need to "prepare" the device for the other pod?
If that is the case it boils down when we consider a resource to have been allocated. IMO, it still makes sense to refer to the API as AllocateResourcesas as for that particular container/pod, the kubelet needs to perform some action to make sure that the device can be "allocated" to the new pod.

Our goal was to encourage plugins to do as much of the "allocation" as possible before a pod ~~runs~~ gets scheduled. Then NodePrepareResource should be so light-weight that it cannot fail or at least that it is extremely unlikely. The motivation for that approach is that there is no way how a pod can be moved to a different node if NodePrepareResource fails because, by definition, it only gets called after a pod has been permanently scheduled unto a node.

Plugins can deviate from that. They can do nothing when setting "phase: allocated" and do all of the actual allocation during NodePrepareResource. But that's not how it should be done.

swatisehgal · 2021-12-15T09:27:43Z

keps/sig-node/3063-dynamic-resource-allocation/README.md

+| Resource does not exist | 5 NOT_FOUND | Indicates that a resource corresponding to the specified `resource_id` does not exist. | Caller MUST verify that the `resource_id` is correct and that the resource is accessible and has not been deleted before retrying with exponential back off. |
+
+
+#### `NodeUnprepareResource`


Similar to my comment above, I think we should rename NodeUnprepareResource to NodeDeallocateResource or simply DeallocateResource.

See #3064 (comment): because the "real" allocation happens earlier, these node-local calls are really just about "preparing" the resource for usage by a pod.

keps/sig-node/3063-dynamic-resource-allocation/README.md

cathyhongzhang · 2022-01-14T20:12:56Z

keps/sig-node/3063-dynamic-resource-allocation/README.md

+
+  *Limitation*: GPU driver requirements are currently stored as labels
+  on the container image, and device plugins do not have access to the
+  container image.


How does "device plugins do not have access to the container image" relate to "gracefully fail to start"?

@klueska, @kad: can you clarify?

These limitations are often written so that people familiar with the current device interface understand them, but others can be puzzled by it. We might be better off removing those items from the KEP to avoid questions about them.

cathyhongzhang · 2022-01-14T20:15:10Z

keps/sig-node/3063-dynamic-resource-allocation/README.md

+  runtime-specific operations that may need to occur as a result of
+  injecting a device into a container (device plugins are runtime
+  agnostic). A good example is supporting GPU passthrough
+  virtualization on kata vs. runc.


The limitation seems to complain about no way to perform runtime-specific operation but the requirement states "irrespective of the underlying container runtime" which is runtime agnostic. Could you clarify the use case and the problem this KEP can help solve?

@klueska, @kad: can you clarify?

I didn't get this one as well

I've removed all limitations after "problems that are handled by CDI". The motivation section now just ends with:

Several other limitations are addressed by
CDI.

The other limitations directly affect design decisions in the KEP and therefore are relevant, but not so much the things that can be done with CDI besides providing some additional justification for doing this work - hopefully that isn't questionable 😅

cathyhongzhang · 2022-01-14T20:16:09Z

keps/sig-node/3063-dynamic-resource-allocation/README.md

+- *Access to the plugin container*: When deploying a device plugin, I
+  would like to ensure that all of the operations that need to occur
+  as part of “injecting” a device into a container, also occur for the
+  “plugin container” itself.


By "plugin container", do you mean the "third-party device plugin" that runs as a container?

I'm not sure either what the "plugin container" here is. @kad?

an example would go a long way

keps/sig-node/3063-dynamic-resource-allocation/README.md

cathyhongzhang · 2022-01-14T20:20:47Z

keps/sig-node/3063-dynamic-resource-allocation/README.md

+- ResourceClass, not namespaced, with privileged parameters for
+  multiple resource instances of a certain kind. All these instances
+  are provided by the same resource plugin, which is identified by a
+  field in the class.


Do you mean that the resource class holds parameters of the resource exposed/provided by a specific resource plugin?
Does this resource plugin refer to the third-party plugins on lines 191-192

The ResourceClass holds parameters for a resource driver. They are set by the cluster admin and thus can be different in different clusters although the resource driver is the same. These parameters will be used for all ResourceClaim instances that use this ResourceClass.

can you give an example of a specific device and describe how many possible classes an admin would typically create, just looking to get a sense of the type of parameters that can be set for groups of pods.

Could you add this clarification to the KEP?

I've rewritten the ResourceClass line item, including an example.

keps/sig-node/3063-dynamic-resource-allocation/README.md

cathyhongzhang · 2022-01-14T20:26:51Z

keps/sig-node/3063-dynamic-resource-allocation/README.md

+containers in a pod. Depending on the capabilities defined in the
+ResourceClaim by the plugin, a ResourceClaim can be used exclusively
+by one pod at a time, by a certain maximum number of pods, or an
+unlimited number of pods.


Does a ResourceClaim have capacity? If they are shared, then the design needs to include an accounting of the available capacity, right?

The proposal only covers counting users of a ResourceClaim. We can't do more than that because a concept of "capacity" would be resource specific and thus out of scope of this KEP.

In practice, the common cases will probably be "a single user" (= exclusive access) and "unlimited". The API also supports a fixed number because that is trivial to support and perhaps might be useful.

could you add this clarification to the KEP?

keps/sig-node/3063-dynamic-resource-allocation/README.md

pohly · 2022-01-17T08:07:33Z

Cathy pointed out that in several places, "plugin" is ambiguous. Sometimes the context determines whether it is a plugin for the scheduler, kubelet or the entire cluster. We cannot do much about scheduler and kubelet plugins, that's simply what they are called there.

But we can avoid "plugin" as name for the third-party cluster add-on. Let's do the same as for storage and call it "driver". Then we have:

ResourcePlugin -> ResourceDriver (as in the existing CSIDriver type)
ResourceClass.PluginName -> ResourceClass.DriverName

"controller" is also overloaded. We have a "resource claim controller" (the in-tree component that we add to kube-controller-manager" and "resource controller" (the control plane part of the third-party cluster add-on). Can someone think of a better name for the latter?

I decided against "Allocator" earlier because that seemed too specific, but right now allocation and deallocation is exactly what that component handles. Because we don't need to bake that name into any API (i.e. it's only used in the KEP and presentations), we can use that name now and if the purpose ever changes, pick something more generic when that happens.

Finally, do we need a more specific term instead of just "resource"? There are already "native resources" (the ones handled by kube-scheduler itself), "extended resources" (current device interface, handled by webhooks), and now "custom or generic resources"? If we pick such an adjective, should we also use it in the API (ResourceClaim -> CustomResourceClaim, PodSpec.Resources -> PodSpec.CustomResources, ContainerSpec.PodResources -> ContainerSpec.CustomResources)?

pohly · 2022-06-23T18:34:27Z

As this now seems to be heading towards "implementable" (:crossed_fingers:) , let me include the additional commits that I had pending in the other PR. All of those changes are in proper additional commits (thus easy to review) and just tweak the text. The main change is the status change to "implementable".

TBBle · 2022-06-23T18:37:47Z

keps/sig-node/3063-dynamic-resource-allocation/README.md

+  devices must be carefully tuned to ensure that GPU resources do not go unused
+  because some of the pre-partioned devices are in low-demand.  It also puts
+  the burden on the user to pick a particular MIG device type, rather than
+  declaring the resource constraints more abstractly.


Nit: Picking a MIG device type (a name for a pre-configured set of resources) would be the more-abstract option, I think what we want (and get) from this KEP is being able to declare our resource constraints more "explicitly".

I guess it depends on how you look at it. By “more abstractly” I meant you might only care about making sure you get at least 5GB of memory on your GPU and then the plugin can either create a MIG device that satisfies that for you, or give you a full GPU if that’s all that’s available and it’s OK with that. With pre-defined MIG devices a user has to explicitly request that device type rather than providing a more abstract set of resource constraints that the plugin can work with to find the best GPU for the job.

That's fair. I read the case we're trying to enable not as "choosing the best GPU", but subdividing a GPU to match your explicit resource constraints, e.g., "at least 5GB", on-demand, vs have to specify only a pre-defined GPU division (I thought that's what "MIG device type" meant here) which probably matches the container's requirements, but is abstracted behind a name.

True. Looking at just the "resource sharing" / subdivison aspect what you say makes sense. I ended up conflating 2 different advantages of the new proposal into 1, i.e. better support for "resource sharing" and better support for matching workloads to the "right sized" resources based on an abstract set of constraints.

gjtempleton

Two minor points, more informs though.

Thanks for all the hard work on this.

For SIG Autoscaling:
/lgtm

gjtempleton · 2022-06-23T18:46:39Z

keps/sig-node/3063-dynamic-resource-allocation/README.md

+drivers can be added to the configuration file defined by the `--cloud-config`
+configuration file with a common field that gets added in all cloud provider
+configs or a new `--config` parameter can be added. Later, dynamically
+discovering deployed webhooks can be added through an autoscaler CRD.


Minor heads up that we're beginning discussions about improving the config mechanism for the CA to stop the continuing explosion in command-line flags, current discussion focusing on use of configmaps.

gjtempleton · 2022-06-23T18:51:06Z

keps/sig-node/3063-dynamic-resource-allocation/README.md

+`NodeReady` is needed to solve one particular problem: when a new node first
+starts up, it may be ready to run pods, but the pod from a resource driver's
+DaemonSet may still be starting up. If the resource driver controller needs
+information from such a pod, then it will not be able to filter
+correctly. Similar to how extended resources are handled, the autoscaler then
+first needs to wait until the extender also considers the node to be ready.


This will likely add some complexity to advice around the unready flags in the CA for slow to start resource driver controllers, especially -ok-total-unready-count, though easily addressed.

pohly · 2022-06-23T19:30:16Z

/hold

Let's make sure that @alculquicondor and @thockin confirm merging of this PR as "implementable".

alculquicondor · 2022-06-23T20:57:02Z

keps/sig-node/3063-dynamic-resource-allocation/README.md

+	// The resource driver checks this fields and resets it to false
+	// together with clearing the Allocation field. It also sets it
+	// to false when the resource is not allocated.
+	Deallocate bool


This field is going to be something to watch out for during implementation. But good enough as a high level idea.

alculquicondor

/lgtm
from sig-scheduling perspective.

Just for the record, I still have my reservations on the cluster-autoscaler story.

thockin · 2022-06-23T21:04:36Z

/lgtm

This whole proposal is VERY daunting, but I don't see why we should block it further - the bar was set and cleared.

CDI is a new animal to me, so we should watch it carefully.

Important: we should not merge any part of this until we resolve the open issues ESPECIALLY around scheduling. We don't want to get to 1.25 code freeze and have a partially built mess to deal with :)

dchen1107 · 2022-06-23T22:08:18Z

/lgtm from SIG Node perspective.

Added some comments, and remain some concerns on the complexities and security impacts for K8s vendors, but ok to carry on those discussion during the implementation phase or via pohly#13.

klueska · 2022-06-23T22:14:26Z

@dchen1107 since this is a sig-node KEP it needs an /approve from you as well in order to move forward.

dchen1107 · 2022-06-23T22:25:34Z

/approve

pohly · 2022-06-23T22:35:05Z

/assign @johnbelamaric

Can you do the final approval?

johnbelamaric · 2022-06-23T22:37:47Z

/approve

k8s-ci-robot · 2022-06-23T22:38:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, johnbelamaric, klueska, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-node/OWNERS~~ [dchen1107,johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 30, 2021

k8s-ci-robot requested review from dchen1107 and derekwaynecarr November 30, 2021 16:56

pohly changed the title ~~3063: dynamic resource allocation: initial KEP draft~~ KEP-3063: dynamic resource allocation: initial KEP draft Nov 30, 2021

pohly force-pushed the dynamic-resource-allocation-upstream branch from 9e2e0c3 to d0d0e21 Compare December 1, 2021 09:25

k8s-ci-robot requested a review from swatisehgal December 7, 2021 18:52

swatisehgal reviewed Dec 15, 2021

View reviewed changes

pohly mentioned this pull request Jan 10, 2022

DRA: control plane controller ("classic DRA") #3063

Open

38 tasks

denkensk reviewed Jan 13, 2022

View reviewed changes

keps/sig-node/3063-dynamic-resource-allocation/README.md Outdated Show resolved Hide resolved

keps/sig-node/3063-dynamic-resource-allocation/README.md Show resolved Hide resolved

cathyhongzhang reviewed Jan 14, 2022

View reviewed changes

keps/sig-node/3063-dynamic-resource-allocation/README.md Outdated Show resolved Hide resolved