KEP-3022: Min domains in PodTopologySpread #3030

sanposhiho · 2021-11-01T22:37:42Z

Hello team.

One-line PR description: add the kep KEP-3022: Tuning the number of domains in Pod Topology Spread

Issue link: Min domains in PodTopologySpread #3022

Note

This is my first KEP. Please let me know if I'm missing something.

keps/prod-readiness/sig-scheduling/3022.yaml

keps/sig-scheduling/3022-tuning-the-number-of-domains-in-pod-topology-spread/README.md

alculquicondor · 2021-11-02T19:27:38Z

/assign

Huang-Wei · 2021-11-03T17:05:02Z

@sanposhiho thanks for driving this. Given that the consensus of #3022 so far is to start with minDomains, could you wait for one or two days (if case others have different options), and then update this KEP? I will review once it's updated.

sanposhiho · 2021-11-04T17:36:35Z

@Huang-Wei Sure. Thanks.

sanposhiho · 2021-11-08T16:03:49Z

Updated KEP to focus on MinDomains.

alculquicondor · 2021-11-08T17:10:44Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/kep.yaml

+owning-sig: sig-scheduling
+status: provisional
+creation-date: 2021-10-28
+reviewers:


You don't need that many reviewers/approvers :)

Also, damemi is not a KEP approver (yet :))

you can leave me out of this one.

alculquicondor · 2021-11-08T17:14:35Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+## Motivation
+
+Pod Topology Spread has [`maxSkew` parameter](https://github.com/kubernetes/enhancements/tree/11a976c74e1358efccf251d4c7611d05ce27feb3/keps/sig-scheduling/895-pod-topology-spread#maxskew), which control the degree to which Pods may be unevenly distributed.
+But, there isn't a way to control the number of domains over which we should spread.


Explain why this is a problem.

The main reason is that, if a domain has 0 nodes, the scheduler would still schedule on existing domains, without giving the cluster autoscaler a chance to provide a node for it.

Okay, I'll try to be more specific.

alculquicondor · 2021-11-08T17:15:02Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+Users can define a minimum number of domains with `minDomains` parameter.
+
+When the number of domains that have matching Pods is less than `minDomains`,


This is not part of the Goal, this is part of the Design.

alculquicondor · 2021-11-08T17:15:39Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+  // - when `whenUnsatisfiable` equals `ScheduleAnyway`, scheduler prefers Nodes on the domains that don't have matching Pods.
+  //   i.e. scheduler gives a high score to Nodes on the domains that don't have matching Pods in the `Score` phase.
+  // +optional
+  MinDomains int32


alculquicondor · 2021-11-08T17:19:19Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+......
+  // MinDomains describes the minimum number of domains.
+  // When the number of domains that have matching Pods is less than `minDomains`,
+  // - when `whenUnsatisfiable` equals `DoNotSchedule`, scheduler doesn't schedule a matching Pod to Nodes on the domains that have matching Pods.


Uhm... rethinking a bit more about this, I think the semantics are off.

Let's say maxSkew=2 and there is only one node. I should be able to schedule 2 pods in that one node. The third pod should fail for force a scale up.

So we have to consider how many domains have pods. If it's less than minDomains, the global minimum is 0 https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#spread-constraints-for-pods

Aldo is right. Currently (without minDomains), the semantics of "global minimum" literally indicates the minimum number of matching pods among all topology domains.

While with minDomains, the semantics is sort of dynamic:

if the number of domains >= minDomains, "global minimum" stay the same as before

if the number of domains < minDomains, "global minimum" is treated as 0

Thanks for the clear explanation.

Okay, what I thought was

maxSkew does not affect the behavior of minDomains.

minDomains tries to increase the number of domains as much as possible, if the number of domains < minDomains. So, it always tries to bind matching Pods to the domains that don't have any matching Pods in that case.

But, I also think your opinion was better. I'll fix the description. 🙏

alculquicondor · 2021-11-08T17:19:49Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+```yaml
+spec:
+  topologySpreadConstraints:
+    - minDomains: 10


MaxSkew is mandatory

alculquicondor · 2021-11-08T17:20:16Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+#### Alpha (v1.24):
+
+- [ ] Add new parameter `NinDomains` to `TopologySpreadConstraint` and future gating.


typos: MinDomain and feature

alculquicondor · 2021-11-08T17:20:42Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+#### Beta (v1.25):
+
+- [ ] This feature will be enabled by default as a Beta feature in v1.25.
+- [ ] Add necessary integration/end-to-end tests.


We usually require integration tests for Alpha. It would be great to have E2E tests that include a cluster autoscaler, but I'm not sure if that's possible.

alculquicondor · 2021-11-08T17:21:54Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/kep.yaml

+kep-number: 3022
+authors:
+  - "@sanposhiho"
+owning-sig: sig-scheduling


Add sig-autoscaling as participating sig and @MaciekPytel as reviewer.

alculquicondor · 2021-11-08T17:24:15Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+### Feature Enablement and Rollback
+
+<!--
+This section must be completed when targeting alpha to a release.


I guess we can leave this for the time when the KEP moves to implementable, but you can work on it already if you want.

Huang-Wei · 2021-11-09T22:36:41Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+With [Pod Topology Spread](/keps/sig-scheduling/895-pod-topology-spread), users can define the rule to spread pods across your cluster among failure-domains.
+And, we propose to add the parameter `minDomains` to limit the minimum number of domains in Pod Topology Spread.


The summary section basically describes a fact that XYZ is introduced to achieve a particular goal. So don't need to mention what PodTopologySpread is:

A new field `MinDomains` is introduced to `PodSpec.TopologySpreadConstraint[*]` to limit the minimum number of topology domains. It functions in a mandatory or best-efforts manner, depending on the type of `WhenUnsatisfiable`.

Huang-Wei · 2021-11-09T22:41:49Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+### Goals
+
+Users can define a minimum number of domains with `minDomains` parameter.


The goals usually list a few bullets, each with a neat sentence without too many technical details.

Huang-Wei · 2021-11-09T22:42:06Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+I am using cluster autoscaler and I want to force spreading a deployment over at least 10 Nodes.
+
+## Design Details
+


Please complete this section.

Sure, I'll move current Goals(contains technical details) to this part.

Huang-Wei · 2021-11-09T22:42:21Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+### API
+
+New parameter called `MinDomains` is introduced.


is introdcued to ...

Huang-Wei · 2021-11-09T22:45:23Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+  i.e. scheduler filters Nodes on the domains that have matching Pods in the `Filter` phase.
+- when `whenUnsatisfiable` equals `ScheduleAnyway`, scheduler prefers Nodes on the domains that don't have matching Pods.
+  i.e. scheduler gives a high score to Nodes on the domains that don't have matching Pods in the `Score` phase.
+


Add a Non-Goals section here and mention maxDomains is not considered.

Huang-Wei · 2021-11-09T23:07:07Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+......
+  // MinDomains describes the minimum number of domains.
+  // When the number of domains that have matching Pods is less than `minDomains`,
+  // - when `whenUnsatisfiable` equals `DoNotSchedule`, scheduler doesn't schedule a matching Pod to Nodes on the domains that have matching Pods.


Aldo is right. Currently (without minDomains), the semantics of "global minimum" literally indicates the minimum number of matching pods among all topology domains.

While with minDomains, the semantics is sort of dynamic:

if the number of domains >= minDomains, "global minimum" stay the same as before

if the number of domains < minDomains, "global minimum" is treated as 0

Huang-Wei · 2021-11-09T23:08:18Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+Users can set `MinDomains` and `whenUnsatisfiable: DoNotSchedule` to achieve it.
+
+```yaml
+spec:


Can you give a full Deployment example to show the replicas?

Huang-Wei · 2021-11-09T23:09:02Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+spec:
+  topologySpreadConstraints:
+    - minDomains: 10
+      topologyKey: node


let's use standard label: kubernetes.io/hostname

sanposhiho · 2021-11-13T18:26:55Z

@Huang-Wei @alculquicondor
Thanks for your detailed reviews 🙇
I updated PR with your reviews. Please take a look.

alculquicondor · 2021-11-22T21:59:32Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/kep.yaml

+owning-sig: sig-scheduling
+participating-sigs:
+  - sig-autoscaling
+status: provisional


I think you can aim for implementable already so that we don't have to ping the reviewers again.

alculquicondor · 2021-11-22T22:03:56Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+Pod Topology Spread has [`maxSkew` parameter](https://github.com/kubernetes/enhancements/tree/11a976c74e1358efccf251d4c7611d05ce27feb3/keps/sig-scheduling/895-pod-topology-spread#maxskew), which control the degree to which Pods may be unevenly distributed.
+But, there isn't a way to control the number of domains over which we should spread.
+In some cases, users want to limit the number of domains for the cluster autoscaler and force spreading Pods over a minimum number of domains.


In some cases, users want to force spreading Pods over a minimum number of domains and, if there aren't enough already present, make the cluster-autoscaler provision them.

alculquicondor · 2021-11-22T22:09:09Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+When the number of domains that have matching Pods is less than `minDomains`,
+Pod Topology Spread treats "global minimum" as 0.
+
+As the result,


As a result

alculquicondor · 2021-11-22T22:13:13Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+                  foo: bar
+```
+
+Until 10 Pods have been created, this flow is expected to repeats itself.


This is assuming that the cluster initially only has 0 or 1 node.

True. I changed the explanation a bit. How about this?

alculquicondor · 2021-11-22T22:15:13Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+  MinDomains *int32
+}
+```
+


Can you include some implementation details?

Please don't add code to the KEP, but you can have some thoughts on which files/functions need to change and if we expect to have any performance degradation.

Sure. I added an implementation detail section that explains how to implement this feature in a bit more detail.

alculquicondor · 2021-11-22T22:18:01Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+###### How can this feature be enabled / disabled in a live cluster?
+
+- [ ] Feature gate (also fill in values in `kep.yaml`)
+  - Feature gate name:


please fill this up

alculquicondor · 2021-11-22T22:18:29Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+  // - when `whenUnsatisfiable` equals `ScheduleAnyway`,
+  //   scheduler prefers Nodes on the domains that don't have the same number of matching Pods as `maxSkew`.
+  // +optional
+  MinDomains *int32


What is the default value? Probably zero. Please specify.

Yes, it's 0.
Sure, I'll add the description.

alculquicondor · 2021-11-22T22:19:10Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+###### What happens if we reenable the feature if it was previously rolled back?
+
+N/A.


Please answer this question.

alculquicondor · 2021-11-22T22:19:20Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+###### Are there any tests for feature enablement/disablement?
+
+No.


There should be unit and integration

alculquicondor · 2021-11-22T22:19:59Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+### Scalability
+
+<!--


You can leave this for later, but consider working on it.

Sure, I added some answers.

sanposhiho · 2021-11-28T18:39:13Z

@alculquicondor
Thanks for the reviews. I updated PR by your comments.
Please take a look again when you're available. 🙇

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

alculquicondor · 2021-11-29T16:55:55Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+- `global min matching num` denotes the minumun number of matching Pods.
+
+For `whenUnsatisfiable: DoNotSchedule`, Pod Topology Spread treats `global min matching num` as 0
+when the number of domains that have matching Pods is less than `minDomains`.


how do you calculate this? probably in PreFilter? Can it be done without increasing the complexity of the algorithm?

Ditto. Don't add the "that have matching Pods" clause.

Suggested change

when the number of domains that have matching Pods is less than `minDomains`.

when the number of domains is less than `minDomains`.

@alculquicondor

how do you calculate this? probably in PreFilter? Can it be done without increasing the complexity of the algorithm?

The current implementation calculates global min matching num in PreFilre.

And we use this global min matching num for calculation for filtering here.
https://github.com/kubernetes/kubernetes/blob/108c284a330a82ce1a1f80238e4f54bf5e8b045a/pkg/scheduler/framework/plugins/podtopologyspread/filtering.go#L322

So, I think the implementation will be like this. It's simple and can be implemented without increasing the complexity.

minMatchNum := paths[0].MatchNum // https://github.com/kubernetes/kubernetes/blob/108c284a330a82ce1a1f80238e4f54bf5e8b045a/pkg/scheduler/framework/plugins/podtopologyspread/filtering.go#L317 if minMatchNum < *c.MinDomains { minMatchNum = 0 }

Sorry if I was not clear. If I ask a question, it means that it's not clear in the KEP, and thus it needs to be updated.

Okay. ~~I'll update~~ updated🙇

alculquicondor · 2021-11-29T16:58:38Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+And, in `preScoreState`, there is the `IgnoredNodes` field and Nodes in `preScoreState.IgnoredNodes` will literally be ignored and the score for those Nodes will be 0.
+
+For `whenUnsatisfiable: ScheduleAnyway`, Pod Topology Spread adds Nodes on the domains which have the same or more number of matching Pods as `maxSkew` to `preScoreState.IgnoredNodes`


How do you calculate which nodes to ignore?

I found that I hadn't thought about it enough, so I revised the whole implementation detail of score phase. 🙏

alculquicondor · 2021-11-29T17:01:13Z

Here are some examples of implementation details:

Although, I think those changes are trivial in comparison to what you have to do for this KEP.

Huang-Wei · 2021-11-29T17:38:09Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+Pod Topology Spread has the semantics of "global minimum", which means the minimum number of pods that match the label selector in a topology domain.
+
+When the number of domains that have matching Pods is less than `minDomains`,


It's not mandatory to have matching pods on a domain (you can have 0 pods on a domain), so

Suggested change

When the number of domains that have matching Pods is less than `minDomains`,

When the number of domains is less than `minDomains`,

+1 you are right. Although would "the number of domains with matching nodes" be better?

I guess you meant "the number of domains with matching ~~nodes~~ topology keys"?

well, ultimately the domains are defined by node labels. So I'm happy with any of those wordings.

Understood. Thanks.
I'll change the wording to "the number of domains with matching topology keys".

Huang-Wei · 2021-11-29T17:41:16Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+Pod Topology Spread has the semantics of "global minimum", which means the minimum number of pods that match the label selector in a topology domain.
+
+When the number of domains that have matching Pods is less than `minDomains`,
+Pod Topology Spread treats "global minimum" as 0.


Suggested change

Pod Topology Spread treats "global minimum" as 0.

Pod Topology Spread treats "global minimum" as 0; otherwise, "global minimum"

is equal to the number of matching pods on a domain.

the minimum number of matching pods on a domain.

Huang-Wei · 2021-11-29T17:44:14Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+- `global min matching num` denotes the minumun number of matching Pods.
+
+For `whenUnsatisfiable: DoNotSchedule`, Pod Topology Spread treats `global min matching num` as 0
+when the number of domains that have matching Pods is less than `minDomains`.


Ditto. Don't add the "that have matching Pods" clause.

Suggested change

when the number of domains that have matching Pods is less than `minDomains`.

when the number of domains is less than `minDomains`.

Huang-Wei · 2021-11-29T17:52:42Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+                  foo: bar
+```
+
+Until 10 Pods have been created, scheduler tries to schedule maximum two Pods per Node.


You'd better not mention how many (10 in this case) pods get scheduled b/c it quite depends on how many domains are present, and you didn't mention that.

I'd like to revise this deployment example to be 10 replicas, and assume there are 3 nodes in the cluster. So the first 6 pods will be scheduled, and the rest 4 pods can only be scheduled when 2 more nodes join the cluster (provisioned by CA for instance).

Okay. fixed it. 🙇

Huang-Wei · 2021-11-29T17:53:48Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+- [ ] Feature gate (also fill in values in `kep.yaml`)
+  - Feature gate name: `MinDomainsInPodTopologySpread`
+  - Components depending on the feature gate: `kube-scheduler`


API server should also be involved, like: if the feature gate is disabled in API server, the field won't be persisted.

sanposhiho · 2021-12-02T11:04:06Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+Pod Topology Spread changes the evaluation way to this.
+
+```
+('existing matching num' * 'topology weight' + 'maxSkew' - 1) * ('if existing matching num >= maxSkew (1 or 0)' + 1)


Maybe we should have a discussion about the multiplier which is used if existing matching num >= maxSkew.
doubling it now, but there is no big reason. I just want score to be big if existing matching num >= maxSkew.

If we make it too big, it will affect all final score calculated in Normalized Score.

An alternative is to make all the nodes that have existing matching num >= maxSkew (when number of domains is less than minDomains) have the max Score, so that the normalized score is potentially zero. Although, given normalization, getting a zero normalized score also depends on the minimum score.

To cap, you would do:

min(maxSkew, existingMatchingNum) * topologyWeight + maxSkew - 1

Circling back on what I said earlier, you should prefer not to add code to the KEP. But a not-so-simple formula that has conditionals can be more readable as a code snippet. As long as you are not providing a significant diff, some code is ok in a KEP.

ahg-g · 2021-12-14T15:45:18Z

/cc @x13n

x13n · 2022-01-03T12:09:42Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+The feature can be disabled in Alpha and Beta versions.
+In terms of Stable versions, users can choose to opt-out by not setting the
+`pod.spec.topologySpreadConstraints.maxDomains` field.


nit: minDomains

wojtek-t · 2022-01-04T07:55:04Z

/assign

alculquicondor · 2022-01-06T19:01:15Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+Users can define a minimum number of domains with `minDomains` parameter.
+
+Pod Topology Spread has the semantics of "global minimum", which means the minimum number of pods that match the label selector in a topology domain.


Add:
However, the global minimum is only calculated for the nodes that exist and match the node affinity. In other words, if a topology domain was scaled down to zero (for example, because of low utilization), this topology domain is unknown to the scheduler, thus it's not considered in the global minimum calculations.

The new minDomains field can help with this problem: [continues the existing text]

alculquicondor · 2022-01-06T19:03:58Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+```
+
+- `existing matching num` denotes the number of current existing matching Pods on the domain.
+- `if self-match` denotes if the labels of Pod is match with selector of the constraint.


s/is match/matches

alculquicondor · 2022-01-06T19:05:52Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+In Score, the score for each constraint is evaluated, and the sum of those scores will be a Node score.
+
+Basically, the score for each constraint is evaluated this way:


Remove the word Basically

alculquicondor · 2022-01-06T19:06:52Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+```
+
+When the number of domains with matching topology keys is less than `minDomains`,
+Pod Topology Spread doubles that score (so that final score will be a low score) if this criteria is met:


s/final/normalized

alculquicondor · 2022-01-06T19:14:04Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+Like preFilter, we need to calculate the number of domains with matching topology keys and the minimum number of matching Pods in preScore,
+so that Pod Topology Spread can determine the evaluation way with them.
+
+This extra calculation may affect the performance of the preScore, because the current preScore only see Nodes which have passed the Filter.


Uhm.... we need to make sure to do it only if a constraint has minDomains set.

Is it really worth though? Since the original formula already gives a similar semantic, and because scoring is ignored by cluster-autoscaler, I now think we should leave scores unmodified.

@Huang-Wei, thoughts?

Well.. I totally agree with "the original formula already gives a similar semantic". But, isn't it strange for users that there is no score change even after setting minDomains?
Or specify in the documentation that minDomain has an effect only when using DoNotSchedule and no actual effect when SchedulingAnyway?

we need to make sure to do it only if a constraint has minDomains set.

Agree, we should do that if we go the way of changing scores.

Yes, we can specify in the documentation that it only applies to DoNotSchedule.

But, isn't it strange for users that there is no score change even after setting minDomains?

The scores are not user visible regardless. The difference is so subtle that I don't think they would notice a difference in most scenarios even if there is a change.

WDYT @Huang-Wei ?

Yes, restrict it to DoNotSchedule adheres to our initial motivation; while investing efforts on tuning subtle scroring difference doesn't make a lot of sense.

Okay. Understood.
I changed the doc to set the goal to DoNotSchedule only.

If it's not too much effort, add a summary of the discussion about DoNotSchedule in the "Alternatives" section. Explain what the proposal was, and why ultimately we decided it was not worth pursuing.

BTW, while it might feel disappointing that we are discarding the change, it was absolutely crucial that we explored the options. So thank you a lot for your effort.

add a summary of the discussion about DoNotSchedule in the "Alternatives" section. Explain what the proposal was, and why ultimately we decided it was not worth pursuing.

Sure. I added it to Alternatives. Could you please take a look 🙏

BTW..

happy to contribute to our best choice :)

alculquicondor · 2022-01-06T19:15:25Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/kep.yaml

+reviewers:
+  - "@alculquicondor"
+  - "@Huang-Wei"
+  - "@MaciekPytel"


swap for @x13n

sanposhiho · 2022-01-08T12:52:25Z

@alculquicondor
I've updated it as your suggestions. (except #3030 (comment))

alculquicondor · 2022-01-12T17:03:07Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+- Users can specify `minDomains` to limit the number of domains.
+- Users can use it as a mandatory requirement with `WhenUnsatisfiable=DoNotSchedule`.


Suggested change

- Users can specify `minDomains` to limit the number of domains.

- Users can use it as a mandatory requirement with `WhenUnsatisfiable=DoNotSchedule`.

- Users can specify `minDomains` to limit the number of domains when using `WhenUnsatisfiable=DoNotSchedule`.

alculquicondor · 2022-01-12T17:04:26Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+......
+  // When the number of domains with matching topology keys is less than `minDomains`,
+  // Pod Topology Spread treats "global minimum" as 0.
+  // As a result,


remove this line

alculquicondor · 2022-01-12T17:05:49Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+  // As a result,
+  // As a result, when the number of domains is less than `minDomains`,
+  // scheduler doesn't schedule a matching Pod to Nodes on the domains that have the same or more number of matching Pods as `maxSkew`.
+  // Default value is 0.


Add: When value is different than 0, WhenUnsatisfiable must be DoNotSchedule

alculquicondor · 2022-01-12T17:06:25Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+#### Alpha (v1.24):
+
+- [ ] Add new parameter `MinDomains` to `TopologySpreadConstraint` and feature gating.
+- [ ] Score extension point implementation.


remove this line

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

sanposhiho · 2022-01-13T03:51:38Z

@alculquicondor Thanks. Updated as your suggestions.

alculquicondor · 2022-01-13T15:26:51Z

/approve

Please squash commits :)

ping @wojtek-t for PRR

sanposhiho · 2022-01-14T01:43:16Z

Squashed 👍

wojtek-t

Just two nits - other than that it lgtm.

wojtek-t · 2022-01-14T11:00:33Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+
+The feature can be disabled in Alpha and Beta versions.


Add sth like:

"The feature can be disabled by restarting kube-apiserver and kube-scheduler with feature-gate off".

wojtek-t · 2022-01-14T11:01:06Z

keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md

+
+###### Are there any tests for feature enablement/disablement?
+
+There should be unit and integration tests.


No - tests will be added.

sanposhiho · 2022-01-14T11:09:30Z

@wojtek-t
Thanks, updated as your suggestions.

wojtek-t · 2022-01-14T14:51:23Z

It will need more work for Beta, but it's good for alpha from PRR perspective.

/lgtm
/approve PRR

k8s-ci-robot · 2022-01-14T14:51:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, sanposhiho, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-scheduling/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 1, 2021

k8s-ci-robot requested review from ahg-g and Huang-Wei November 1, 2021 22:38

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Nov 1, 2021

sanposhiho commented Nov 1, 2021

View reviewed changes

keps/prod-readiness/sig-scheduling/3022.yaml Show resolved Hide resolved

sanposhiho force-pushed the kep-3022 branch from 7527668 to f7388c5 Compare November 1, 2021 22:51

This was referenced Nov 1, 2021

Min domains in PodTopologySpread #3022

Closed

Tuning the number of domains in PodTopologySpread kubernetes/kubernetes#105291

Closed

sanposhiho force-pushed the kep-3022 branch from f7388c5 to 067b1c3 Compare November 1, 2021 23:04

sanposhiho commented Nov 1, 2021

View reviewed changes

keps/sig-scheduling/3022-tuning-the-number-of-domains-in-pod-topology-spread/README.md Outdated Show resolved Hide resolved

k8s-ci-robot assigned alculquicondor Nov 2, 2021

sanposhiho force-pushed the kep-3022 branch 3 times, most recently from 62acb98 to fee195e Compare November 3, 2021 10:56

sanposhiho changed the title ~~KEP-3022: Tuning the number of domains on Pod Topology Spread~~ KEP-3022: Min domains in PodTopologySpread Nov 8, 2021

alculquicondor reviewed Nov 8, 2021

View reviewed changes

Huang-Wei reviewed Nov 9, 2021

View reviewed changes

alculquicondor reviewed Nov 22, 2021

View reviewed changes

alculquicondor reviewed Nov 29, 2021

View reviewed changes

Huang-Wei reviewed Nov 29, 2021

View reviewed changes

sanposhiho commented Dec 2, 2021

View reviewed changes

k8s-ci-robot requested a review from x13n December 14, 2021 15:45

x13n reviewed Jan 3, 2022

View reviewed changes

k8s-ci-robot assigned wojtek-t Jan 4, 2022

alculquicondor reviewed Jan 6, 2022

View reviewed changes

alculquicondor reviewed Jan 12, 2022

View reviewed changes

sanposhiho force-pushed the kep-3022 branch from ee6927e to 63bbb5b Compare January 14, 2022 01:37

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 14, 2022

sanposhiho force-pushed the kep-3022 branch from 63bbb5b to ade497e Compare January 14, 2022 01:39

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2022

wojtek-t reviewed Jan 14, 2022

View reviewed changes

KEP-3022: Tuning the number of domains on Pod Topology Spread

f14bfee

sanposhiho force-pushed the kep-3022 branch from ade497e to f14bfee Compare January 14, 2022 11:08

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 14, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 14, 2022

k8s-ci-robot merged commit 366d029 into kubernetes:master Jan 14, 2022

k8s-ci-robot added this to the v1.24 milestone Jan 14, 2022

sanposhiho mentioned this pull request Jan 22, 2022

Add MinDomains API to TopologySpreadConstraints field kubernetes/kubernetes#107674

Merged

guidola mentioned this pull request Jan 25, 2023

Not possible to tolerate domain loss in TopologySpreadConstraints and keep capacity after addition of minDomains kubernetes/kubernetes#115320

Closed


		Users can define a minimum number of domains with `minDomains` parameter.

		When the number of domains that have matching Pods is less than `minDomains`,


		#### Alpha (v1.24):

		- [ ] Add new parameter `NinDomains` to `TopologySpreadConstraint` and future gating.

		With [Pod Topology Spread](/keps/sig-scheduling/895-pod-topology-spread), users can define the rule to spread pods across your cluster among failure-domains.
		And, we propose to add the parameter `minDomains` to limit the minimum number of domains in Pod Topology Spread.


		### Goals

		Users can define a minimum number of domains with `minDomains` parameter.

		I am using cluster autoscaler and I want to force spreading a deployment over at least 10 Nodes.

		## Design Details


		###### What happens if we reenable the feature if it was previously rolled back?

		N/A.


		###### Are there any tests for feature enablement/disablement?

		No.

	when the number of domains that have matching Pods is less than `minDomains`.
	when the number of domains is less than `minDomains`.


		And, in `preScoreState`, there is the `IgnoredNodes` field and Nodes in `preScoreState.IgnoredNodes` will literally be ignored and the score for those Nodes will be 0.

		For `whenUnsatisfiable: ScheduleAnyway`, Pod Topology Spread adds Nodes on the domains which have the same or more number of matching Pods as `maxSkew` to `preScoreState.IgnoredNodes`


		Pod Topology Spread has the semantics of "global minimum", which means the minimum number of pods that match the label selector in a topology domain.

		When the number of domains that have matching Pods is less than `minDomains`,

	Pod Topology Spread treats "global minimum" as 0.
	Pod Topology Spread treats "global minimum" as 0; otherwise, "global minimum"
	is equal to the number of matching pods on a domain.


		In Score, the score for each constraint is evaluated, and the sum of those scores will be a Node score.

		Basically, the score for each constraint is evaluated this way:

		- Users can specify `minDomains` to limit the number of domains.
		- Users can use it as a mandatory requirement with `WhenUnsatisfiable=DoNotSchedule`.


		###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

		The feature can be disabled in Alpha and Beta versions.


		###### Are there any tests for feature enablement/disablement?

		There should be unit and integration tests.

KEP-3022: Min domains in PodTopologySpread #3030

KEP-3022: Min domains in PodTopologySpread #3030

Conversation

sanposhiho commented Nov 1, 2021 • edited Loading

Note

alculquicondor commented Nov 2, 2021

Huang-Wei commented Nov 3, 2021

sanposhiho commented Nov 4, 2021

sanposhiho commented Nov 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Nov 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Nov 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Nov 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Dec 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Dec 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Nov 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Dec 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Dec 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Dec 14, 2021

Choose a reason for hiding this comment

sanposhiho commented Nov 1, 2021 •

edited

Loading

sanposhiho Nov 13, 2021 •

edited

Loading

sanposhiho Dec 1, 2021 •

edited

Loading

sanposhiho Dec 2, 2021 •

edited

Loading

sanposhiho Dec 2, 2021 •

edited

Loading

sanposhiho Dec 2, 2021 •

edited

Loading