Skip to content

Commit

Permalink
feat: support node selectors & affinities for Kubernetes resource poo…
Browse files Browse the repository at this point in the history
…ls (#9428)
  • Loading branch information
carolinaecalderon committed Jun 17, 2024
1 parent 6cd7d06 commit 63a4163
Show file tree
Hide file tree
Showing 9 changed files with 520 additions and 43 deletions.
14 changes: 14 additions & 0 deletions docs/release-notes/feature-node-selectors.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
:orphan:

**New Features**

- Kubernetes Configuration: Allow Cluster administrators to define Determined resource pools on
Kubernetes using node selectors and/or affinities. Configure these settings at the default pod
spec level under ``task_container_defaults.cpu_pod_spec`` or
``task_container_defaults.gpu_pod_spec``. This allows a single cluster to be divided into
multiple resource pools using node labels.

- WebUI: Allow resource pool slot counts to reflect the state of the entire cluster. Allow slot
counts and scheduling to respect node selectors and affinities. This impacts Determined clusters
deployed on Kubernetes with multiple resource pools defined in terms of node selectors and/or
affinities.
7 changes: 6 additions & 1 deletion docs/setup-cluster/k8s/custom-pod-specs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,8 @@ Example of configuring default pod specs in ``values.yaml``:
hostPath:
path: /data
The default pod specs can also be configured on a resource pool level. GPU jobs submitted in the
The default pod specs can also be configured on a resource pool level. Cluster administrators can
define pools in terms of node selectors and/or node affinities here. GPU jobs submitted in the
resource pool will have the task spec applied. If a job is submitted in a resource pool with a
matching CPU / GPU pod spec then the top level ``taskContainerDefaults.gpuPodSpec`` or
``taskContainerDefaults.cpuPodSpec`` will not be applied.
Expand All @@ -167,6 +168,10 @@ Example of configuring resource pool default pod spec in ``values.yaml``.
kind: Pod
spec:
affinity:
# Define an example node selector label.
nodeSelectorTerms:
kubernetes.io/hostname: foo
# Define an example node affinity.
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
Expand Down
33 changes: 27 additions & 6 deletions docs/setup-cluster/k8s/install-on-kubernetes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -419,12 +419,33 @@ To set up multiple resource pools for Determined on your Kubernetes cluster:
operator: "Equal"
value: "prod"
effect: "NoSchedule"
#. Label/taint the appropriate nodes you want to include as part of each resource pool. For instance
you may add a taint like ``kubectl taint nodes prod_node_name pool_taint=prod:NoSchedule`` and
the appropriate toleration to the PodTolerationRestriction admissions controller or in
``resourcePools.pool_name.task_container_defaults.gpu_pod_spec`` as above so it is automatically
added to the pod spec based on which namespace (and hence resource pool) a task runs in.
affinity:
# Define an example node selector label.
nodeSelectorTerms:
kubernetes.io/hostname: "foo"
# Define an example node affinity.
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- antarctica-west1
- antarctica-east1
#. Label/taint the appropriate nodes you want to include as part of each resource pool.

#. For instance you may add a taint like ``kubectl taint nodes prod_node_name
pool_taint=prod:NoSchedule`` to add the appropriate toleration to the PodTolerationRestriction
admissions controller or in ``resourcePools.pool_name.task_container_defaults.gpu_pod_spec`` as
above so it is automatically added to the pod spec based on which namespace (and hence resource
pool) a task runs in.

#. Adding node selector or node affinity logic to your resource pool will ensure that only nodes
that match this logic are selected. You may add a node selector like ``kubernetes.io/hostname =
foo``, or match your resource pool to any nodes that match the ``topology.kubernetes.io/zone``
value in the set ``{antactica-west1, antarctica-east`}``.

#. Add the appropriate resource pool name to namespace mappings in the ``resourcePools`` section of
the ``values.yaml`` file in the Helm chart.
Expand Down
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,7 @@ require (
cloud.google.com/go/compute/metadata v0.2.3
golang.org/x/sync v0.5.0
google.golang.org/genproto/googleapis/api v0.0.0-20230711160842-782d3b101e98
k8s.io/component-helpers v0.20.14
)

// Determined AI's CircleCI doesn't have access to "github.hpe.com/hpe/hpc-ard-launcher-go",
Expand Down
2 changes: 2 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -1578,6 +1578,8 @@ k8s.io/apimachinery v0.20.14 h1:LG7YY3R3ZRO5UxaIsInDk8adAb9J744CP2EfckAIM7w=
k8s.io/apimachinery v0.20.14/go.mod h1:4KFiDSxCoGviCiRk9kTXIROsIf4VSGkVYjVJjJln3pg=
k8s.io/client-go v0.20.14 h1:DAtFSq905IE49N/WOzI1PvwnifI6Vduti5v8A2xJEt8=
k8s.io/client-go v0.20.14/go.mod h1:NP3va0ehKLBNmXBUIQD6ddTvK7Pu/wioGuitv++pYow=
k8s.io/component-helpers v0.20.14 h1:MoO3BxPMvlJlgzyLxc85zYZnRIf+f7OPc29MfKq/TNI=
k8s.io/component-helpers v0.20.14/go.mod h1:IiAdi9DFhI9PLuPJJOz4zMIVQe+2+S28I8uHM4DHYAI=
k8s.io/gengo v0.0.0-20200413195148-3a45101e95ac/go.mod h1:ezvh/TsK7cY6rbqRK0oQQ8IAqLxYwwyPxAX1Pzy0ii0=
k8s.io/gengo v0.0.0-20210813121822-485abfe95c7c/go.mod h1:FiNAH4ZV3gBg2Kwh89tzAEV2be7d5xI0vBa/VySYy3E=
k8s.io/klog/v2 v2.0.0/go.mod h1:PBfzABfn139FHAV07az/IF9Wp1bkk3vpT2XSJ76fSDE=
Expand Down
59 changes: 55 additions & 4 deletions master/internal/rm/kubernetesrm/jobs.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ import (
"k8s.io/client-go/tools/cache"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/client-go/util/homedir"
"k8s.io/component-helpers/scheduling/corev1/nodeaffinity"

"github.com/determined-ai/determined/master/internal/config"
"github.com/determined-ai/determined/master/internal/db"
Expand Down Expand Up @@ -1383,8 +1384,8 @@ func (j *jobsService) getAgent(agentID string) *apiv1.GetAgentResponse {
const summarizeCacheDuration = 5 * time.Second

// summarize describes pods' available resources. When there's exactly one resource pool, it uses
// the whole cluster's info. Otherwise, it matches nodes to resource pools using taints and
// tolerations to derive that info. This may be cached, so don't use this for decisions
// the whole cluster's info. Otherwise, it matches nodes to resource pools using node selectors, affinities,
// taints and tolerations to derive that info. This may be cached, so don't use this for decisions
// that require up-to-date information.
func (j *jobsService) summarize() (map[string]model.AgentSummary, error) {
j.summarizeCacheLock.Lock()
Expand Down Expand Up @@ -1424,8 +1425,11 @@ func (j *jobsService) getNodeResourcePoolMapping(nodeSummaries map[string]model.

for poolName, tcd := range poolTaskContainerDefaults {
var poolTolerations []k8sV1.Toleration
var selectors, affinities *k8sV1.NodeSelector

// If they're using the default RP config, use the default tolerations.
// Don't check for node selectors or affinities here because the pod spec
// isn't defined.
if len(j.resourcePoolConfigs) <= 1 &&
(tcd == nil || (tcd.CPUPodSpec == nil && tcd.GPUPodSpec == nil)) {
if slotType == device.CUDA {
Expand All @@ -1440,9 +1444,11 @@ func (j *jobsService) getNodeResourcePoolMapping(nodeSummaries map[string]model.
if slotType == device.CUDA && tcd.GPUPodSpec != nil {
//nolint:gocritic
poolTolerations = append(tcd.GPUPodSpec.Spec.Tolerations, gpuTolerations...)
selectors, affinities = extractNodeSelectors(tcd.GPUPodSpec)
} else if tcd.CPUPodSpec != nil {
//nolint:gocritic
poolTolerations = append(tcd.CPUPodSpec.Spec.Tolerations, cpuTolerations...)
selectors, affinities = extractNodeSelectors(tcd.CPUPodSpec)
}
}

Expand All @@ -1453,8 +1459,11 @@ func (j *jobsService) getNodeResourcePoolMapping(nodeSummaries map[string]model.
Effect: "PreferNoSchedule",
TolerationSeconds: nil,
})
// If all of a node's taints are tolerated by a pool, that node belongs to the pool.
if allTaintsTolerated(node.Spec.Taints, poolTolerations) {

// If all of a node's taints are tolerated by a pool & a node is a "match" to the pool's
// node affinities and node selectors, that node belongs to the pool.
if allTaintsTolerated(node.Spec.Taints, poolTolerations) &&
j.podsCanBeScheduledOnNode(selectors, node) && j.podsCanBeScheduledOnNode(affinities, node) {
poolsToNodes[poolName] = append(poolsToNodes[poolName], node)
nodesToPools[node.Name] = append(nodesToPools[node.Name], poolName)
}
Expand Down Expand Up @@ -1811,6 +1820,48 @@ func allTaintsTolerated(taints []k8sV1.Taint, tolerations []k8sV1.Toleration) bo
return true
}

// Check that pods belong to this resource pool (from poolTCD) can be scheduled on a given node.
func (j *jobsService) podsCanBeScheduledOnNode(selector *k8sV1.NodeSelector, node *k8sV1.Node) bool {
// In case of no defined affinities/node selectors, the pod can default to schedule on the given node.
if selector == nil || len(selector.NodeSelectorTerms) == 0 {
return true
}

ns, err := nodeaffinity.NewNodeSelector(selector)
if err != nil {
j.syslog.WithError(err)
return false
}
return ns.Match(node)
}

// Gets the node affinity/selector from the resource pool.
func extractNodeSelectors(pod *k8sV1.Pod) (selectors, affinities *k8sV1.NodeSelector) {
if pod == nil {
return nil, nil
}

// First check for node affinities.
if nodeAffinity := pod.Spec.Affinity; nodeAffinity != nil {
affinities = nodeAffinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution
}

// Then add any node labels.
if nodeSelector := pod.Spec.NodeSelector; nodeSelector != nil {
selectors = &k8sV1.NodeSelector{}
expr := []k8sV1.NodeSelectorRequirement{}
for k, v := range nodeSelector {
expr = append(expr, k8sV1.NodeSelectorRequirement{
Key: k,
Operator: k8sV1.NodeSelectorOpIn,
Values: strings.Split(v, ","),
})
}
selectors.NodeSelectorTerms = []k8sV1.NodeSelectorTerm{{MatchExpressions: expr}}
}
return selectors, affinities
}

func extractSlotInfo(node model.AgentSummary) (numSlots int, devType device.Type) {
var gpuSlots, cpuSlots int

Expand Down
Loading

0 comments on commit 63a4163

Please sign in to comment.