Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3022: Min domains in PodTopologySpread #3030

Merged
merged 1 commit into from
Jan 14, 2022

Conversation

sanposhiho
Copy link
Member

@sanposhiho sanposhiho commented Nov 1, 2021

Hello team.

  • One-line PR description: add the kep KEP-3022: Tuning the number of domains in Pod Topology Spread

Note

This is my first KEP. Please let me know if I'm missing something.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 1, 2021
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Nov 1, 2021
@alculquicondor
Copy link
Member

/assign

@Huang-Wei
Copy link
Member

@sanposhiho thanks for driving this. Given that the consensus of #3022 so far is to start with minDomains, could you wait for one or two days (if case others have different options), and then update this KEP? I will review once it's updated.

@sanposhiho
Copy link
Member Author

@Huang-Wei Sure. Thanks.

@sanposhiho sanposhiho changed the title KEP-3022: Tuning the number of domains on Pod Topology Spread KEP-3022: Min domains in PodTopologySpread Nov 8, 2021
@sanposhiho
Copy link
Member Author

Updated KEP to focus on MinDomains.

owning-sig: sig-scheduling
status: provisional
creation-date: 2021-10-28
reviewers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need that many reviewers/approvers :)

Also, damemi is not a KEP approver (yet :))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can leave me out of this one.

## Motivation

Pod Topology Spread has [`maxSkew` parameter](https://github.com/kubernetes/enhancements/tree/11a976c74e1358efccf251d4c7611d05ce27feb3/keps/sig-scheduling/895-pod-topology-spread#maxskew), which control the degree to which Pods may be unevenly distributed.
But, there isn't a way to control the number of domains over which we should spread.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain why this is a problem.

The main reason is that, if a domain has 0 nodes, the scheduler would still schedule on existing domains, without giving the cluster autoscaler a chance to provide a node for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll try to be more specific.


Users can define a minimum number of domains with `minDomains` parameter.

When the number of domains that have matching Pods is less than `minDomains`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not part of the Goal, this is part of the Design.

// - when `whenUnsatisfiable` equals `ScheduleAnyway`, scheduler prefers Nodes on the domains that don't have matching Pods.
// i.e. scheduler gives a high score to Nodes on the domains that don't have matching Pods in the `Score` phase.
// +optional
MinDomains int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*int32

......
// MinDomains describes the minimum number of domains.
// When the number of domains that have matching Pods is less than `minDomains`,
// - when `whenUnsatisfiable` equals `DoNotSchedule`, scheduler doesn't schedule a matching Pod to Nodes on the domains that have matching Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm... rethinking a bit more about this, I think the semantics are off.

Let's say maxSkew=2 and there is only one node. I should be able to schedule 2 pods in that one node. The third pod should fail for force a scale up.

So we have to consider how many domains have pods. If it's less than minDomains, the global minimum is 0 https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#spread-constraints-for-pods

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aldo is right. Currently (without minDomains), the semantics of "global minimum" literally indicates the minimum number of matching pods among all topology domains.

While with minDomains, the semantics is sort of dynamic:

  • if the number of domains >= minDomains, "global minimum" stay the same as before
  • if the number of domains < minDomains, "global minimum" is treated as 0

Copy link
Member Author

@sanposhiho sanposhiho Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clear explanation.

Okay, what I thought was

  • maxSkew does not affect the behavior of minDomains.
  • minDomains tries to increase the number of domains as much as possible, if the number of domains < minDomains. So, it always tries to bind matching Pods to the domains that don't have any matching Pods in that case.

But, I also think your opinion was better. I'll fix the description. 🙏

```yaml
spec:
topologySpreadConstraints:
- minDomains: 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxSkew is mandatory


#### Alpha (v1.24):

- [ ] Add new parameter `NinDomains` to `TopologySpreadConstraint` and future gating.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos: MinDomain and feature

#### Beta (v1.25):

- [ ] This feature will be enabled by default as a Beta feature in v1.25.
- [ ] Add necessary integration/end-to-end tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually require integration tests for Alpha. It would be great to have E2E tests that include a cluster autoscaler, but I'm not sure if that's possible.

kep-number: 3022
authors:
- "@sanposhiho"
owning-sig: sig-scheduling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add sig-autoscaling as participating sig and @MaciekPytel as reviewer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

### Feature Enablement and Rollback

<!--
This section must be completed when targeting alpha to a release.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can leave this for the time when the KEP moves to implementable, but you can work on it already if you want.

Comment on lines 56 to 57
With [Pod Topology Spread](/keps/sig-scheduling/895-pod-topology-spread), users can define the rule to spread pods across your cluster among failure-domains.
And, we propose to add the parameter `minDomains` to limit the minimum number of domains in Pod Topology Spread.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary section basically describes a fact that XYZ is introduced to achieve a particular goal. So don't need to mention what PodTopologySpread is:

A new field `MinDomains` is introduced to `PodSpec.TopologySpreadConstraint[*]` to limit
the minimum number of topology domains. It functions in a mandatory or best-efforts manner,
depending on the type of `WhenUnsatisfiable`.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.


### Goals

Users can define a minimum number of domains with `minDomains` parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goals usually list a few bullets, each with a neat sentence without too many technical details.

I am using cluster autoscaler and I want to force spreading a deployment over at least 10 Nodes.

## Design Details

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please complete this section.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll move current Goals(contains technical details) to this part.


### API

New parameter called `MinDomains` is introduced.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is introdcued to ...

i.e. scheduler filters Nodes on the domains that have matching Pods in the `Filter` phase.
- when `whenUnsatisfiable` equals `ScheduleAnyway`, scheduler prefers Nodes on the domains that don't have matching Pods.
i.e. scheduler gives a high score to Nodes on the domains that don't have matching Pods in the `Score` phase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a Non-Goals section here and mention maxDomains is not considered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

......
// MinDomains describes the minimum number of domains.
// When the number of domains that have matching Pods is less than `minDomains`,
// - when `whenUnsatisfiable` equals `DoNotSchedule`, scheduler doesn't schedule a matching Pod to Nodes on the domains that have matching Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aldo is right. Currently (without minDomains), the semantics of "global minimum" literally indicates the minimum number of matching pods among all topology domains.

While with minDomains, the semantics is sort of dynamic:

  • if the number of domains >= minDomains, "global minimum" stay the same as before
  • if the number of domains < minDomains, "global minimum" is treated as 0

Users can set `MinDomains` and `whenUnsatisfiable: DoNotSchedule` to achieve it.

```yaml
spec:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a full Deployment example to show the replicas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

spec:
topologySpreadConstraints:
- minDomains: 10
topologyKey: node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use standard label: kubernetes.io/hostname

@sanposhiho
Copy link
Member Author

@Huang-Wei @alculquicondor
Thanks for your detailed reviews 🙇
I updated PR with your reviews. Please take a look.

owning-sig: sig-scheduling
participating-sigs:
- sig-autoscaling
status: provisional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can aim for implementable already so that we don't have to ping the reviewers again.


Pod Topology Spread has [`maxSkew` parameter](https://github.com/kubernetes/enhancements/tree/11a976c74e1358efccf251d4c7611d05ce27feb3/keps/sig-scheduling/895-pod-topology-spread#maxskew), which control the degree to which Pods may be unevenly distributed.
But, there isn't a way to control the number of domains over which we should spread.
In some cases, users want to limit the number of domains for the cluster autoscaler and force spreading Pods over a minimum number of domains.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases, users want to force spreading Pods over a minimum number of domains and, if there aren't enough already present, make the cluster-autoscaler provision them.

When the number of domains that have matching Pods is less than `minDomains`,
Pod Topology Spread treats "global minimum" as 0.

As the result,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a result

foo: bar
```

Until 10 Pods have been created, this flow is expected to repeats itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming that the cluster initially only has 0 or 1 node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I changed the explanation a bit. How about this?

MinDomains *int32
}
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include some implementation details?

Please don't add code to the KEP, but you can have some thoughts on which files/functions need to change and if we expect to have any performance degradation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I added an implementation detail section that explains how to implement this feature in a bit more detail.

###### How can this feature be enabled / disabled in a live cluster?

- [ ] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fill this up

// - when `whenUnsatisfiable` equals `ScheduleAnyway`,
// scheduler prefers Nodes on the domains that don't have the same number of matching Pods as `maxSkew`.
// +optional
MinDomains *int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default value? Probably zero. Please specify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's 0.
Sure, I'll add the description.


###### What happens if we reenable the feature if it was previously rolled back?

N/A.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please answer this question.


###### Are there any tests for feature enablement/disablement?

No.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be unit and integration


### Scalability

<!--
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leave this for later, but consider working on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I added some answers.

@sanposhiho
Copy link
Member Author

@alculquicondor
Thanks for the reviews. I updated PR by your comments.
Please take a look again when you're available. 🙇

- `global min matching num` denotes the minumun number of matching Pods.

For `whenUnsatisfiable: DoNotSchedule`, Pod Topology Spread treats `global min matching num` as 0
when the number of domains that have matching Pods is less than `minDomains`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you calculate this? probably in PreFilter? Can it be done without increasing the complexity of the algorithm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Don't add the "that have matching Pods" clause.

Suggested change
when the number of domains that have matching Pods is less than `minDomains`.
when the number of domains is less than `minDomains`.

Copy link
Member Author

@sanposhiho sanposhiho Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor

how do you calculate this? probably in PreFilter? Can it be done without increasing the complexity of the algorithm?

The current implementation calculates global min matching num in PreFilre.

And we use this global min matching num for calculation for filtering here.
https://github.com/kubernetes/kubernetes/blob/108c284a330a82ce1a1f80238e4f54bf5e8b045a/pkg/scheduler/framework/plugins/podtopologyspread/filtering.go#L322

So, I think the implementation will be like this. It's simple and can be implemented without increasing the complexity.

		minMatchNum := paths[0].MatchNum // https://github.com/kubernetes/kubernetes/blob/108c284a330a82ce1a1f80238e4f54bf5e8b045a/pkg/scheduler/framework/plugins/podtopologyspread/filtering.go#L317
		if minMatchNum < *c.MinDomains {
			minMatchNum = 0
		}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I was not clear. If I ask a question, it means that it's not clear in the KEP, and thus it needs to be updated.

Copy link
Member Author

@sanposhiho sanposhiho Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I'll update updated🙇


And, in `preScoreState`, there is the `IgnoredNodes` field and Nodes in `preScoreState.IgnoredNodes` will literally be ignored and the score for those Nodes will be 0.

For `whenUnsatisfiable: ScheduleAnyway`, Pod Topology Spread adds Nodes on the domains which have the same or more number of matching Pods as `maxSkew` to `preScoreState.IgnoredNodes`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you calculate which nodes to ignore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that I hadn't thought about it enough, so I revised the whole implementation detail of score phase. 🙏

@alculquicondor
Copy link
Member

Here are some examples of implementation details:

Although, I think those changes are trivial in comparison to what you have to do for this KEP.


Pod Topology Spread has the semantics of "global minimum", which means the minimum number of pods that match the label selector in a topology domain.

When the number of domains that have matching Pods is less than `minDomains`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not mandatory to have matching pods on a domain (you can have 0 pods on a domain), so

Suggested change
When the number of domains that have matching Pods is less than `minDomains`,
When the number of domains is less than `minDomains`,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 you are right. Although would "the number of domains with matching nodes" be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you meant "the number of domains with matching nodes topology keys"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, ultimately the domains are defined by node labels. So I'm happy with any of those wordings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Thanks.
I'll change the wording to "the number of domains with matching topology keys".

Pod Topology Spread has the semantics of "global minimum", which means the minimum number of pods that match the label selector in a topology domain.

When the number of domains that have matching Pods is less than `minDomains`,
Pod Topology Spread treats "global minimum" as 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pod Topology Spread treats "global minimum" as 0.
Pod Topology Spread treats "global minimum" as 0; otherwise, "global minimum"
is equal to the number of matching pods on a domain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the minimum number of matching pods on a domain.

- `global min matching num` denotes the minumun number of matching Pods.

For `whenUnsatisfiable: DoNotSchedule`, Pod Topology Spread treats `global min matching num` as 0
when the number of domains that have matching Pods is less than `minDomains`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Don't add the "that have matching Pods" clause.

Suggested change
when the number of domains that have matching Pods is less than `minDomains`.
when the number of domains is less than `minDomains`.

foo: bar
```

Until 10 Pods have been created, scheduler tries to schedule maximum two Pods per Node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd better not mention how many (10 in this case) pods get scheduled b/c it quite depends on how many domains are present, and you didn't mention that.

I'd like to revise this deployment example to be 10 replicas, and assume there are 3 nodes in the cluster. So the first 6 pods will be scheduled, and the rest 4 pods can only be scheduled when 2 more nodes join the cluster (provisioned by CA for instance).

Copy link
Member Author

@sanposhiho sanposhiho Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. fixed it. 🙇


- [ ] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: `MinDomainsInPodTopologySpread`
- Components depending on the feature gate: `kube-scheduler`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API server should also be involved, like: if the feature gate is disabled in API server, the field won't be persisted.

Pod Topology Spread changes the evaluation way to this.

```
('existing matching num' * 'topology weight' + 'maxSkew' - 1) * ('if existing matching num >= maxSkew (1 or 0)' + 1)
Copy link
Member Author

@sanposhiho sanposhiho Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have a discussion about the multiplier which is used if existing matching num >= maxSkew.
doubling it now, but there is no big reason. I just want score to be big if existing matching num >= maxSkew.

If we make it too big, it will affect all final score calculated in Normalized Score.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative is to make all the nodes that have existing matching num >= maxSkew (when number of domains is less than minDomains) have the max Score, so that the normalized score is potentially zero. Although, given normalization, getting a zero normalized score also depends on the minimum score.

To cap, you would do:

min(maxSkew, existingMatchingNum) * topologyWeight + maxSkew - 1

Circling back on what I said earlier, you should prefer not to add code to the KEP. But a not-so-simple formula that has conditionals can be more readable as a code snippet. As long as you are not providing a significant diff, some code is ok in a KEP.

@ahg-g
Copy link
Member

ahg-g commented Dec 14, 2021

/cc @x13n


The feature can be disabled in Alpha and Beta versions.
In terms of Stable versions, users can choose to opt-out by not setting the
`pod.spec.topologySpreadConstraints.maxDomains` field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: minDomains

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@wojtek-t
Copy link
Member

wojtek-t commented Jan 4, 2022

/assign


Users can define a minimum number of domains with `minDomains` parameter.

Pod Topology Spread has the semantics of "global minimum", which means the minimum number of pods that match the label selector in a topology domain.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add:
However, the global minimum is only calculated for the nodes that exist and match the node affinity. In other words, if a topology domain was scaled down to zero (for example, because of low utilization), this topology domain is unknown to the scheduler, thus it's not considered in the global minimum calculations.

The new minDomains field can help with this problem: [continues the existing text]

```

- `existing matching num` denotes the number of current existing matching Pods on the domain.
- `if self-match` denotes if the labels of Pod is match with selector of the constraint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/is match/matches


In Score, the score for each constraint is evaluated, and the sum of those scores will be a Node score.

Basically, the score for each constraint is evaluated this way:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the word Basically

```

When the number of domains with matching topology keys is less than `minDomains`,
Pod Topology Spread doubles that score (so that final score will be a low score) if this criteria is met:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/final/normalized

Like preFilter, we need to calculate the number of domains with matching topology keys and the minimum number of matching Pods in preScore,
so that Pod Topology Spread can determine the evaluation way with them.

This extra calculation may affect the performance of the preScore, because the current preScore only see Nodes which have passed the Filter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm.... we need to make sure to do it only if a constraint has minDomains set.

Is it really worth though? Since the original formula already gives a similar semantic, and because scoring is ignored by cluster-autoscaler, I now think we should leave scores unmodified.

@Huang-Wei, thoughts?

Copy link
Member Author

@sanposhiho sanposhiho Jan 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well.. I totally agree with "the original formula already gives a similar semantic". But, isn't it strange for users that there is no score change even after setting minDomains?
Or specify in the documentation that minDomain has an effect only when using DoNotSchedule and no actual effect when SchedulingAnyway?

we need to make sure to do it only if a constraint has minDomains set.

Agree, we should do that if we go the way of changing scores.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can specify in the documentation that it only applies to DoNotSchedule.

But, isn't it strange for users that there is no score change even after setting minDomains?

The scores are not user visible regardless. The difference is so subtle that I don't think they would notice a difference in most scenarios even if there is a change.

WDYT @Huang-Wei ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, restrict it to DoNotSchedule adheres to our initial motivation; while investing efforts on tuning subtle scroring difference doesn't make a lot of sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Understood.
I changed the doc to set the goal to DoNotSchedule only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too much effort, add a summary of the discussion about DoNotSchedule in the "Alternatives" section. Explain what the proposal was, and why ultimately we decided it was not worth pursuing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, while it might feel disappointing that we are discarding the change, it was absolutely crucial that we explored the options. So thank you a lot for your effort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a summary of the discussion about DoNotSchedule in the "Alternatives" section. Explain what the proposal was, and why ultimately we decided it was not worth pursuing.

Sure. I added it to Alternatives. Could you please take a look 🙏

BTW..

happy to contribute to our best choice :)

reviewers:
- "@alculquicondor"
- "@Huang-Wei"
- "@MaciekPytel"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swap for @x13n

@sanposhiho
Copy link
Member Author

@alculquicondor
I've updated it as your suggestions. (except #3030 (comment))

Comment on lines 71 to 72
- Users can specify `minDomains` to limit the number of domains.
- Users can use it as a mandatory requirement with `WhenUnsatisfiable=DoNotSchedule`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Users can specify `minDomains` to limit the number of domains.
- Users can use it as a mandatory requirement with `WhenUnsatisfiable=DoNotSchedule`.
- Users can specify `minDomains` to limit the number of domains when using `WhenUnsatisfiable=DoNotSchedule`.

......
// When the number of domains with matching topology keys is less than `minDomains`,
// Pod Topology Spread treats "global minimum" as 0.
// As a result,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

// As a result,
// As a result, when the number of domains is less than `minDomains`,
// scheduler doesn't schedule a matching Pod to Nodes on the domains that have the same or more number of matching Pods as `maxSkew`.
// Default value is 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: When value is different than 0, WhenUnsatisfiable must be DoNotSchedule

#### Alpha (v1.24):

- [ ] Add new parameter `MinDomains` to `TopologySpreadConstraint` and feature gating.
- [ ] Score extension point implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

@sanposhiho
Copy link
Member Author

@alculquicondor Thanks. Updated as your suggestions.

@alculquicondor
Copy link
Member

/approve

Please squash commits :)

ping @wojtek-t for PRR

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 14, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2022
@sanposhiho
Copy link
Member Author

sanposhiho commented Jan 14, 2022

Squashed 👍

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two nits - other than that it lgtm.


###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The feature can be disabled in Alpha and Beta versions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add sth like:

"The feature can be disabled by restarting kube-apiserver and kube-scheduler with feature-gate off".


###### Are there any tests for feature enablement/disablement?

There should be unit and integration tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - tests will be added.

@sanposhiho
Copy link
Member Author

@wojtek-t
Thanks, updated as your suggestions.

@wojtek-t
Copy link
Member

It will need more work for Beta, but it's good for alpha from PRR perspective.

/lgtm
/approve PRR

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 14, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, sanposhiho, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants