Add annotation to exclude nodes from being schedulable #122

rajiteh · 2017-07-25T00:29:26Z

This PR introduces a new annotation that cluster admins can add to specific nodes to prevent them from being "Schedulable" by the agent. This is specially useful when you have certain nodes that only run master components (ie: node-role.kubernetes.io/master=true) and want them to participate in the update orchestration.

coreosbot · 2017-07-25T00:29:27Z

Can one of the admins verify this patch?

dghubble · 2017-07-25T23:17:42Z

We already have a plan in place that allows users to control the update-agent daemonset and where it runs (e.g. node selector on container-linux-update.v1.coreos.com/agent=true is one option).

Adding a knob so an admin can add a label to have the agent still run, but not mark a node schedulable again doesn't seem to buy much you couldn't get by just editing the agent=true/false toggle. In both cases, the admin marking the node down for maintenance (or Unshcedulable for other reasons) has to add a special label. But here, the Container Linux version can still upgrade. I worry slightly about how this will play out with plans to health check that workloads can run post-reboot and adding complexity to the pre-reboot / post-reboot hooks design for custom health checker pods (#121). There's also the case that if this is set on nodes erroneously, no issues occur immediately but the next upgrade leaves all nodes unschedulable.

I'd like to know more about the use case to be convinced its worth the complexity cost or having another knob. Would having fine grained control over where update-agent runs be sufficient?

euank · 2017-07-25T23:23:04Z

@dghubble I think you misunderstand the issue.

Right now, the update-agent, after rebooting a node, "un-cordons" it (marks it schedulable).

That's not desirable for nodes that previously were marked unschedulable by the administrator.

The simple goal here is "I want to update my master nodes with CLUO, but not have them become schedulable".

I think the right solution isn't adding a new knob, but rather "remembering" whether a node was unschedulable previously and restoring it to that previous state after the update (whether that be uncordoning it in the default case, or leaving it unschedulable if it already was before the drain).

euank · 2017-07-25T23:35:10Z

@rajiteh What do you think of adding an annotation of "node-was-unschedulable" that is set to true/false just before "drain", and then restoring the unschedulable state based on said annotation?

If that covers your use-case, I'd prefer that change to adding a new config option.

rajiteh · 2017-07-25T23:42:29Z

I think the right solution isn't adding a new knob, but rather "remembering" whether a node was unschedulable previously.

I thought about that approach too but decided it's better to be explicit than implicit. It also brings in more complexity, ie: a previously CLUO labelled Node gets cordoned by the admin.

rajiteh · 2017-07-25T23:48:01Z

@euank wouldn't a node be unschedulable regardless after an update reboot? Then we need a first run check. Either way, it still makes it impossible to manually cordon a node after it was "claimed" by the agent.

dghubble · 2017-07-25T23:56:53Z

@euank Yes, that's my understanding as well and aligns with the comments here. I have the same concerns and desire to hear the usage scenario.

If a node has been marked unscheduled by an administrator, that's typically to address a problem. When I'm in this scenario, it makes sense to remove update-agent from the node entirely while debugging the problem (or mask update-engine likely). In that scenario, there isn't a need for a special mode that allows the node to keep being part of coordinated upgrades (let alone do the "right" thing with Schedulability). Other mechanisms handle this scenario well enough.

Now, that's just one usage scenario. What's the usage scenario justifying this knob. How would it be described in docs to the admin. Because it has downsides, like I mentioned. This issue gets at the heart of whether a cluster upgrade encompases ensuring nodes are in a healthy schedulable state (#121 allows users to define "health") or whether a complete upgrade can occur even though maybe no nodes are successfully doing work (in theory).

IMO, if we do decide this is a concern, having no label or knob but instead always restoring the prior Scheduable state (like @euank said) seems the most "right".

euank · 2017-07-26T00:06:10Z

@rajiteh
What I'm suggesting is that the agent stores, as a new annotation (meant as an internal implementation detail, not ever to be touched), whether the agent moved the node from schedulable -> unschedulable during a drain operation.

This happens late enough that it will in general pick up admin changes. If the agent runs, then the admin cordons the node, and then a reboot operation occurs afterwards, the agent would still leave the node cordoned. It need not be all that racy simply by updating both unschedulable and the new annotation in a single atomic update of the node object (which would fail if the administrator had cordoned/uncordoned in the meanwhile).

I don't think it adds all that much complexity and I think that it ends up being significantly less surprising behavior than having to learn of a new unschedulable configuration when using the agent.

@dghubble Apologies for my hasty and incorrect reading of your comment. Whoops!

That being said, I think the historical use of "unschedulable" on master nodes by most kubernetes deployment tools has removed much of the meaning of that taint. At this point we basically have to treat that taint as not indicating a real problem, but rather as (master node || drain in progress).

Perhaps the proper solution would even be for us to acknowledge that taint is overused and simply use our own taint, the "container-linux-updating" taint, which the agent could then tolerate as well as any other important components.... however, since afaik there isn't a way to tolerate all taints, the fact that master components currently tolerate the scheduleable taint means we're kinda stuck with it.

Apologies for that last bit being kinda thinking-out-loud.

dghubble · 2017-07-26T00:06:14Z

For the controllers / masters, CoreOS clusters run the kubelet with --register-with-taints=node-role.kubernetes.io/master=:NoSchedule btw.

rajiteh · 2017-07-26T00:12:56Z

@dghubble My use case specifically involves the fact that we have dedicated "master" nodes, which only run the api-server, scheduler, kube-proxy and controller-manager. Our provisioning scripts create Pod manifests for these services at /etc/kubernetes/manifests in each "master" node and starts kubelets with the arguments --enforce-node-allocatable='' and --register-schedulable=false. This is a pattern described by the docs here: https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/

When CLUO agent is deployed to the cluster as a DaemonSet, it still creates a pod in these nodes and end up un-cordoning them. While we want our "master" nodes to be managed by CLUO, we don't want to make them available for scheduling to the rest of the cluster.

EDIT: added bit more info on kubelet bootstrapping

rajiteh · 2017-07-26T00:20:46Z

@euank Ah. Sorry I didn't understand your comment at first. I think your approach makes sense. It satisfies my requirements so I'm good with that.

@dghubble I see. --register-with-taints=node-role.kubernetes.io/master=:NoSchedule is something we haven't looked in to. Perhaps that will resolve this problem (at-least for us) without any changes(?)

Again, thank you for your input on this.

dghubble · 2017-07-26T00:49:49Z

CoreOS example clusters and Tectonic clusters use that --register-with-taints flag on controller kubelets and aren't seeing behavior where workloads start running on controllers after upgrade. That'd be a bug and we'd wanna solve it outright over requiring annotations/labels.

@rajiteh yes, I'd be curious to know if this is a viable solution for you. Its more of a whitelisting style of isolation where each control plane component which is permitted to run on the controller "tolerates" the taint. However, I don't have a sense of how many folks outside CoreOS are isolating their controllers and worker workloads in this way.

This sounds like we've made an undocumented assumption about how users are doing workload isolation, we should mention it in requirements, and maybe add a few notes about differences between tainting controllers and marking them unscheduled.

rajiteh · 2017-07-26T14:20:40Z

@dghubble @euank I do think its still valid use case that CLUO shouldn't un-cordon Nodes that weren't cordoned by it. I'm closing this PR because, like suggested, there are better ways of achieving that rather than using annotations. Also, my problem specifically is solved by using the taints, thus no changes were necessary.

Thank you for your time and input! Cheers!

Add annotation to exclude nodes from being schedulable

c9292df

lucab mentioned this pull request Jul 25, 2017

update-operator: Require before and after reboot annotations #121

Merged

rajiteh closed this Jul 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add annotation to exclude nodes from being schedulable #122

Add annotation to exclude nodes from being schedulable #122

rajiteh commented Jul 25, 2017 •

edited

Loading

coreosbot commented Jul 25, 2017

dghubble commented Jul 25, 2017

euank commented Jul 25, 2017 •

edited

Loading

euank commented Jul 25, 2017

rajiteh commented Jul 25, 2017

rajiteh commented Jul 25, 2017

dghubble commented Jul 25, 2017 •

edited

Loading

euank commented Jul 26, 2017 •

edited

Loading

dghubble commented Jul 26, 2017

rajiteh commented Jul 26, 2017 •

edited

Loading

rajiteh commented Jul 26, 2017

dghubble commented Jul 26, 2017 •

edited

Loading

rajiteh commented Jul 26, 2017

Add annotation to exclude nodes from being schedulable #122

Add annotation to exclude nodes from being schedulable #122

Conversation

rajiteh commented Jul 25, 2017 • edited Loading

coreosbot commented Jul 25, 2017

dghubble commented Jul 25, 2017

euank commented Jul 25, 2017 • edited Loading

euank commented Jul 25, 2017

rajiteh commented Jul 25, 2017

rajiteh commented Jul 25, 2017

dghubble commented Jul 25, 2017 • edited Loading

euank commented Jul 26, 2017 • edited Loading

dghubble commented Jul 26, 2017

rajiteh commented Jul 26, 2017 • edited Loading

rajiteh commented Jul 26, 2017

dghubble commented Jul 26, 2017 • edited Loading

rajiteh commented Jul 26, 2017

rajiteh commented Jul 25, 2017 •

edited

Loading

euank commented Jul 25, 2017 •

edited

Loading

dghubble commented Jul 25, 2017 •

edited

Loading

euank commented Jul 26, 2017 •

edited

Loading

rajiteh commented Jul 26, 2017 •

edited

Loading

dghubble commented Jul 26, 2017 •

edited

Loading