-
Notifications
You must be signed in to change notification settings - Fork 49
update-operator: Require before and after reboot annotations #121
Conversation
pkg/operator/operator.go
Outdated
err = k8sutil.UpdateNodeRetry(k.nc, n.Name, func(node *v1api.Node) { | ||
if !wantsRebootSelector.Matches(fields.Set(node.Annotations)) { | ||
glog.Warningf("Node %v no longer wanted to reboot while we were trying to mark it so: %v", node.Name, node.Annotations) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this return
be a continue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is part of an anonymous function that starts on line 317. It's a little weird, but this node annotation update needs to make sure the node still wants to reboot before it gets marked to do so.
pkg/action/action.go
Outdated
const ( | ||
// ActionsSucceeded is returned when all actions have set their configured | ||
// annotation to true | ||
ActionsSucceeded ActionsResult = iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest using something which is not the success status code as the default value for this custom type, otherwise any dumb programming bug may end up with false-positive results:
var a ActionsResult
fmt.Printf("%t", a == ActionsSucceeded)
What about swapping the order of the two values or introducing an ActionUnknownStatus
as the default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to add an ActionsUnknown value. That way the zero value is not something which is used to represent an actual state, and instead represents the absence of that state. Right now it does the same thing, since everywhere that uses this value just checks result == ActionsSucceeded
pkg/action/action.go
Outdated
// true - the action has completed successfully | ||
// anything else - the action has not yet completed | ||
// if all of the annotations exist and are true, return ActionsSucceeded | ||
// if any action dne or is not set to true, return ActionsPending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please spell out "dne".
@@ -103,6 +97,12 @@ func (k *Klocksmith) process(stop <-chan struct{}) error { | |||
return err | |||
} | |||
|
|||
// we are schedulable now. | |||
glog.Info("Marking node as schedulable") | |||
if err := k8sutil.Unschedulable(k.nc, k.node, false); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #122, do (and should) we have some way to ensure we are only marking schedulable nodes that we previously marked as unschedulable (i.e. we are not overriding some unrelated user marking)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an interesting question. It seems like we just indiscriminately mark nodes schedulable and not schedulable for our own uses right now, and that is something that is worth thinking about. I do think it's out of scope for this particular change though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Derp, sorry. I actually did forget my glasses at home today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdemos ack, let's move this to its own ticket/PR. Perhaps there is aready one, I didn't check (my bad). Can you take care of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some initial comments inline but I'll do a deeper pass at a later point. Regarding your two WIP points: "yes please" to both, and we use cobra externsively in all other projects. |
Here is a specific use-case for this, for your own inspiration: When CL is upgraded on a node, the torcx portion of the agent will need to make sure the appropriate docker addon is fetched. This has the following requirements:
It would be helpful if the new CL version was an argument, but it's not absolutely necessary |
cf13ee1
to
ae329cb
Compare
Found out that flagutil in coreos/pkg has a slice argument flags extension already in it. I'm just using that instead of moving to a new library for now. |
ce759ea
to
84226c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an initial pass, I want us to improve the nomenclature here.
cmd/update-operator/main.go
Outdated
@@ -15,6 +15,8 @@ import ( | |||
) | |||
|
|||
var ( | |||
preRebootActions flagutil.StringSliceFlag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are annotations, not "actions" (let's avoid this term in the real implementation, it doesn't mean much)
cmd/update-operator/main.go
Outdated
@@ -25,12 +27,16 @@ var ( | |||
) | |||
|
|||
func main() { | |||
flag.Var(&preRebootActions, "require-pre-reboot", "list of comma-separated kubernetes annotations required to be set to 'true' before the operator will authorize a reboot") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop mention of authorization.
before-reboot-annotations - "List of comma separated Kubernetes node annotations that must be set to true before a reboot is allowed."
cmd/update-operator/main.go
Outdated
@@ -25,12 +27,16 @@ var ( | |||
) | |||
|
|||
func main() { | |||
flag.Var(&preRebootActions, "require-pre-reboot", "list of comma-separated kubernetes annotations required to be set to 'true' before the operator will authorize a reboot") | |||
flag.Var(&postRebootActions, "require-post-reboot", "list of comma-separated kubernetes annotations required to be set to 'true' before the operator will consider a reboot successful") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after-reboot-annotations - "List of comma separated Kubernetes node annotations that must be set to true before a node is marked scheduleable and the operator lock is released."
examples/post-reboot-action.yaml
Outdated
@@ -0,0 +1,29 @@ | |||
apiVersion: extensions/v1beta1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Examples in a separate PR which has a tutorial and docs?
pkg/action/action.go
Outdated
} | ||
|
||
// New creates a new action with the given annotation | ||
func New(annotation string) *Action { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were using Action as a placeholder phrase in design docs, not as the actual implementation name. Its rather vague. Especially because CLUO is not actuating the actions, its simply waiting for annotation requirements to be fulfilled.
pkg/constants/constants.go
Outdated
@@ -51,6 +51,14 @@ const ( | |||
// It is an opaque string, but might be semver. | |||
AnnotationNewVersion = Prefix + "new-version" | |||
|
|||
// Key set by the update-operator when pre- and post-reboot actions should run | |||
LabelRebootAction = Prefix + "reboot-action" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I didn't think we were actually exposing "action" in the label names. Why not use something the user would expect like:
nodeSelector:
container-linux-update.v1.coreos.com/before-reboot: "true"
or
nodeSelector:
container-linux-update.v1.coreos.com/after-reboot: "true"
pkg/operator/operator.go
Outdated
return fmt.Errorf("Failed listing nodes %v", err) | ||
} | ||
|
||
preRebootReq, err := labels.NewRequirement(constants.LabelRebootAction, selection.In, []string{constants.RebootActionPre}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a little helper to construct Requirements at compilation time
pkg/operator/operator.go
Outdated
// take some number of the rebootable nodes. | ||
// remove any old action annotations and add the reboot-action=pre label. | ||
// this starts the pre-reboot actions. | ||
err = k.markPreRebootActionable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how method names like these developed, but looking at it with fresh eyes, it seems nonsensical. If we went with the nomenclature I mentioned, this becomes markBeforeReboot()
which adds a before-reboot=true annotation.
pkg/operator/operator.go
Outdated
} | ||
|
||
// Verify no nodes are still in the process of rebooting to avoid rebooting N > maxRebootingNodes | ||
postRebootReq, err := labels.NewRequirement(constants.LabelRebootAction, selection.In, []string{constants.RebootActionPost}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all parts of the requirement are constants or known upfront, checkout MustRequirement
and a var declaration.
pkg/operator/operator.go
Outdated
// find nodes which just rebooted | ||
justRebootedNodes := k8sutil.FilterNodesByAnnotation(nodelist.Items, justRebootedSelector) | ||
// also filter out any nodes that are already labeled with reboot-action=post | ||
notPostRebootReq, err := labels.NewRequirement(constants.LabelRebootAction, selection.NotIn, []string{constants.RebootActionPost}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pkg/action/action.go
Outdated
// anything else - the action has not yet completed | ||
// if all of the annotations exist and are true, return ActionsSucceeded | ||
// if any action does not exist or is not set to true, return ActionsPending | ||
func ActionsCompleted(node v1api.Node, actions []*Action) ActionsResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In offline conversation we talked about removing this package, removing actions, and just having something kinda like HasAnnotations(node, annotations []Annotations) bool
or maybe just with strings
c44afc9
to
54f155b
Compare
v1api "k8s.io/client-go/pkg/api/v1" | ||
) | ||
|
||
// NewRequirementOrDie wraps a call to NewRequirement and panics if the Requirment | ||
// cannot be created. It is intended for use in variable initializations only. | ||
func NewRequirementOrDie(key string, op selection.Operator, vals []string) *labels.Requirement { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the standard library, they name these functions MustThing like MustTemplate, MustParse, etc. They also wrap the return (Object, error), not the constructor as constructor funcs can change over time.
I prefer you keep the previous MustRequirement function which had this in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a method in the field kubernetes package called ParseSelectorOrDie
that I was emulating with this call, since we use that one to construct selectors as well.
As for the explicit arguments vs the return, I felt like it was cleaner to just have one function call vs two, and the type system will trivially catch any issues with modifications to the underlying constructor that we would have to care about in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
pkg/operator/operator.go
Outdated
return nil | ||
} | ||
|
||
func hasAnnotations(node v1api.Node, annotations []string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add method comment about this being an AND or maybe call it hasAllAnnotations
pkg/operator/operator.go
Outdated
// annotations are set. if all annotations are set to true then remove the | ||
// before-reboot=true label and set reboot=ok=true, telling the agent it's | ||
// time to reboot. | ||
err = k.checkBeforeReboot() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the comment down to be the formal method comment. In the process, let's more clearly define the error - the current handling seems unexpected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed. Its clear that these 4 state transition functions can run in different orders and the current ordering is coincidental and errors are divided btw serious errors worth waiting for the process loop and those that are more benign.
I'd favor these methods returning any error that occurs though. Its easier to explain and to write tests if the errors are returned rather than being log lines. Waiting for the process loop shouldn't be too common since the places we can error now already have reties baked in and if we switch to Watches in the future, the fixed period goes away anyway. If we instead plan for them to be 4 independent loops, it'd be ok to spawn 4 goroutines for them, that way nobody starts depending on their ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of putting them in goroutines to enforce the fact that the ordering doesn't matter, but I feel like that is asking for weird race condition bugs. The ordering of the state checking and modification doesn't matter, but it would take a lot more work to make us confident that the functions are safe to run concurrently.
I agree with the error returning though. I'll change that right now.
54f155b
to
8594274
Compare
A few changes pushed just now - all functions called by the |
pkg/operator/operator.go
Outdated
delete(node.Labels, constants.LabelBeforeReboot) | ||
|
||
// make sure that the node still wants to reboot | ||
if !wantsRebootSelector.Matches(fields.Set(node.Annotations)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we just want to handle this during ordinary reconciliation. If we ever see a node that has reboot-ok=true, but no longer wants a reboot, we "unpick" it - seems like this might happen independent of this block of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd also want to cleanup before-reboot annotations that may have been applied. Putting before-reboot and after-boot labels on a node kinda starts the window of time where custom pods add annotations - seems like we should promise to run a cleanup method at the end of that window (i.e. when removing the before-reboot and after-reboot labels).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage of checking if it no longer wants to reboot here is that it checks at the last possible second, so there isn't a possibility of race conditions. We can also check it in the main loop, but we should still continue to check it here I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as cleaning up before-reboot and after-reboot annotations, it's currently doing it right before it applies the respective label to the node, to ensure an entirely clean environment, and that there aren't any old annotations lying around from the last run. It seems to me like if we only clean them up when we remove the label after completion, there is a possibility that a stray annotation might make it into the mix. Maybe I'm being paranoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main worry is that if this inconsistency can occur here, it can occur at other times too so we should patch it up in general reconciliation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't a possibility of race conditions
I mean, its still best effort here. Checking at the last second still means the live state can change before you put the label on. We're reducing the likelihood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I suppose that's true. We can add a function that makes sure everything is still in a valid state, ie we shouldn't have the before-reboot
label if reboot-needed=false
. That would bring checkBeforeReboot
more in line with checkAfterReboot
, which would be nice.
pkg/operator/operator.go
Outdated
// annotations are set. if all annotations are set to true then remove the | ||
// before-reboot=true label and set reboot=ok=true, telling the agent it's | ||
// time to reboot. | ||
glog.V(4).Info("Checking for existence of configured before-reboot annotations") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we check they're true, not that they exist
pkg/operator/operator.go
Outdated
chosenNodes := make([]*v1api.Node, 0, remainingRebootableCount) | ||
for i := 0; i < remainingRebootableCount && i < len(rebootableNodes); i++ { | ||
chosenNodes = append(chosenNodes, &rebootableNodes[i]) | ||
} | ||
|
||
// set before=reboot=true for the chosen nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit = -> -
pkg/operator/operator.go
Outdated
// If there is an error getting the list of nodes or updating any of them, an | ||
// error is immediately returned. | ||
func (k *Kontroller) markBeforeReboot() error { | ||
nodelist, err := k.nc.List(v1meta.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be better if these accepted the node list instead of each function doing its own fetch each time. Seems required right now since you're re-fetching 4 times to get the changes at the 4 stages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some ideas around that, but I think that for now we should just keep the list call in this function and maybe change it later.
pkg/operator/operator.go
Outdated
for _, n := range chosenNodes { | ||
err = k.mark(n.Name, constants.LabelBeforeReboot, k.beforeRebootAnnotations) | ||
if err != nil { | ||
return fmt.Errorf("Failed to mark node for before reboot checks: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something more like the others: "Failed to annotate node %q with %s", node.Name, "before-reboot=true"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just to give context of if it's doing the before-reboot or after-reboot checks. That information is provided in the error returned by mark
. I'll make sure the error in mark
includes the label name though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just be consistent in the logs and say "annotating node" or "labeling node". I'm mainly objecting to "mark" since it doesn't mean anything to a Kubernetes user
pkg/operator/operator.go
Outdated
if !wantsRebootSelector.Matches(fields.Set(node.Annotations)) { | ||
glog.Warningf("Node %v no longer wanted to reboot while we were trying to mark it so: %v", node.Name, node.Annotations) | ||
return | ||
func (k *Kontroller) mark(n string, label string, annotations []string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this take the node v1api.Node
instead of n string
please or change n to nodeName
pkg/operator/operator.go
Outdated
return | ||
} | ||
|
||
glog.V(6).Infof("Found nodes: %+v", nodelist.Items) | ||
// find nodes which just rebooted but haven't run after-reboot checks. | ||
// remove any old annotations and add the after-reboot=true label. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"any old annotations" -> before-reboot annotations
Replace throughout
pkg/operator/operator.go
Outdated
glog.V(6).Infof("Found nodes: %+v", nodelist.Items) | ||
// find nodes which just rebooted but haven't run after-reboot checks. | ||
// remove any old annotations and add the after-reboot=true label. | ||
glog.V(4).Info("Marking rebooted nodes with after-reboot label") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we consistently say "Labeling ..." or "Annotating ..." in log messages?
pkg/operator/operator.go
Outdated
} | ||
|
||
// for all the nodes which just rebooted, remove any old annotations and add the after-reboot=true label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by any old annotations, you mean after reboot annotations from a prior post reboot window?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about the idea we call one cleanupAnnotations method any place in code we remove the before-reboot or after-reboot label? That way the user's nodes don't have annotations lying around once the node is happily in steady state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and sure, you could still call cleanup here to be defensive and not assume they were cleanup on the prior run
glog.V(4).Infof("Deleting label %q for %q", constants.LabelAfterReboot, n.Name) | ||
glog.V(4).Infof("Setting annotation %q to false for %q", constants.AnnotationOkToReboot, n.Name) | ||
err = k8sutil.UpdateNodeRetry(k.nc, n.Name, func(node *v1api.Node) { | ||
delete(node.Labels, constants.LabelAfterReboot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, where do we wanna cleanup the after-reboot annotations? I'm missing that part
8594274
to
88c4cce
Compare
pkg/operator/operator.go
Outdated
return fmt.Errorf("Failed to mark node for before reboot checks: %v", err) | ||
} | ||
glog.Infof("Waiting for configured before-reboot annotations on node %q", n.Name) | ||
glog.V(4).Infof("Waiting for the following annotations: %v", k.beforeRebootAnnotations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log line doesn't make sense when people aren't using this feature:
Waiting for configured before-reboot annotations on node 10.132.109.209
beforeRebootNodes := k8sutil.FilterNodesByRequirement(nodelist.Items, beforeRebootReq) | ||
rebootingNodes = append(rebootingNodes, beforeRebootNodes...) | ||
afterRebootNodes := k8sutil.FilterNodesByRequirement(nodelist.Items, afterRebootReq) | ||
rebootingNodes = append(rebootingNodes, afterRebootNodes...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes to the logic here seem to cause a node to get stuck being considered to be rebooting. Investigating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I0803 22:04:43.194547 1 operator.go:515] Found 1 rebooted nodes, setting label "container-linux-update.v1.coreos.com/after-reboot" to "true"
I0803 22:04:43.301646 1 operator.go:524] Waiting for configured after-reboot annotations on node "10.132.109.209"
I0803 22:04:43.696275 1 operator.go:457] Found node "10.132.109.209" still rebooting, waiting
I0803 22:04:43.696338 1 operator.go:459] Found 1 (of max 1) rebooting nodes; waiting for completion
Actually, I think this is just the oddity of the logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a new issue, but we should indicate the lack of rebooting nodes if we're going to info log the count of rebooting nodes. It gives the appearance of being stuck at 1.
else {
glog.Infof("Found no rebooting nodes")
}
or perhaps log to indicate the next step has begun instead
glog.Infof("Finding nodes indicating a reboot is needed")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just going to remove the if statement entirely so that it always logs how many rebooted nodes it found. That's the same thing that mark before reboot does too.
pkg/operator/operator.go
Outdated
if err != nil { | ||
return fmt.Errorf("Failed to mark node for before reboot checks: %v", err) | ||
} | ||
glog.Infof("Waiting for configured before-reboot annotations on node %q", n.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just merge these two log lines and list the annotations in the normal case. Also, you can drop "configured", its assumed
pkg/operator/operator.go
Outdated
delete(node.Labels, constants.LabelAfterReboot) | ||
// cleanup the after-reboot annotations | ||
for _, annotation := range k.afterRebootAnnotations { | ||
delete(node.Annotations, annotation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verbose log these deletions
pkg/operator/operator.go
Outdated
delete(node.Labels, constants.LabelBeforeReboot) | ||
// cleanup the before-reboot annotations | ||
for _, annotation := range k.beforeRebootAnnotations { | ||
delete(node.Annotations, annotation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verbose log these deletions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this out manually annotating pods and tested the default behaviors without this feature. Logic looks good, I thinks its just usability nits now
350e022
to
3015570
Compare
This lets us more easily verify each individual step, and the ordering of the steps as a whole unit. It also makes it easier to add steps to the process loop
also renamed it from MustRequirement and changed the signature to encapsulate additional call to NewRequirement.
This commit adds pre- and post-reboot actions. With these changes, you will be able to configure actions which get run before a reboot is initiated and after one is completed, effectively gating it on either side. The actions taken are entirely arbitrary, and software that implements them only needs to be able to manipulate kubernetes annotations and read labels. The changes ended up being a little more of an overhaul then I originally intended, but I'm happy with the outcome.
3015570
to
f140cea
Compare
fixed usability nits and rebased onto master. @dghubble |
This pr adds pre- and post-reboot actions.
With these changes, you will be able to configure actions which get run before a reboot is initiated and after one is completed, effectively gating it on either side. The actions taken are entirely arbitrary, and software that implements them only needs to be able to manipulate kubernetes annotations and read labels.
The changes to the
update-operator
ended up being a little more of an overhaul than I originally intended. With the addition of the new behavior, theupdate-operator
has, broadly speaking, four state changes which it is responsible for. Those changes areI broke the process function out into four functions, one for each of those state changes, to make it easier to understand, validate, and modify in the future.
The
update-agent
has been modified not to mark a node schedulable until the post-reboot actions are complete. Currently, the node continues to be marked as scheduled until it starts draining, which is after the pre-reboot actions are run. It is possible to modify it to mark it as unschedulable while running pre-reboot actions, but it would require more work and I don't think it's necessary to do that. If someone can think of a good reason to mark it unschedulable please disagree with me.I will include documentation around the usage in a separate pr, as well as examples.