WIP: Add Change Management and Maintenance Schedules #1571

jupierce · 2024-02-23T21:10:12Z

No description provided.

openshift-ci · 2024-02-23T21:10:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jupierce. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

enhancements/update/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dhellmann

I'm concerned about the complexity of the first 2 workflows. I think everything here will work as described, but having 2 separate operators in the system evaluating schedules and rules about what updates are allowed when seems more complicated than having 1 thing toggling the on/off flag for those components. Centralizing the schedule management means we only need 1 thing to track time, and everything else can just watch it's input API for changes, including to that on/off flag, to know what it needs to be doing at any given time. It also gives us the option of delivering that centralized thing separately (in an operator, ACM, OCM, etc.) or the owner of self-managed clusters providing their own implementation applying some rules we wouldn't want to have to support.

dhellmann · 2024-02-27T20:17:31Z

enhancements/update/change-management-and-maintenance-schedules.md

+When this expectation is violated because of problematic user workloads, the update process is 
+often called into question. For example, if an update appears stalled after 12 hours, a
+user is likely to have a poor perception of the platform and open a support ticket before
+successfully diagnosing an underlying undrainable workload.


This feels like a bug in our existing upgrade status reporting.

We're working to simplify the experience for operations teams on several fronts:

oc adm update status which will provide a single place to find detailed status information and SOPs.

Another enhancement I will be writing where node drain timeouts are an easily enabled part of a fully managed update.

This enhancement which:

Creates the foundation for control-plane and worker-node separation which will exploited by other oc commands to support diverse personas.

Makes this type of issue less disruptive. Even if a drain takes 12 hours, you don't have to sit there and watch it. When it finishes, if it is outside your maintenance schedule, your cluster will not be subject to any more disruption.

dhellmann · 2024-02-27T20:29:35Z

enhancements/update/change-management-and-maintenance-schedules.md

+
+By acknowledging the need for scheduled maintenance in the platform, we reduce the need for Service
+Delivery to develop and maintain custom tooling to manage the platform while 
+simultaneously reducing simplifying management for all customer facing similar challenges.


If those tools already exist, could we just take over maintenance of those tools instead of moving the implementation of the feature?

I'm not an authority on its implementation, but I believe it has aspects spread across OCM, Hive, and MUO. I think the implementation would also differ for HCP.
Assuming we could take it over, it would not facilitate the separation of control-plane and worker-node updates since it only creates a window for an entire cluster update.

Some also in managed-cluster-config (MCC) which is plumbed through hive, and it's not a great experience to manage. We have to pick a set of verisons when a configuration can apply and then react when the cluster version is changed. We're in the process of reworking MCC so this is less painful but ultimately still is not a great experience to manage. Having the ability to just apply a change to a MachineConfig etc would be far better in my opinion.

enhancements/update/change-management-and-maintenance-schedules.md

dhellmann · 2024-02-27T21:23:25Z

enhancements/update/change-management-and-maintenance-schedules.md

+1. Because not all worker-nodes have been updated, changes are still reported as pending via metrics for NodePool.  **TODO: Review with HyperShift. Pausing progress should be possible, but a metric indicating changes still pending may not since they interact only through CAPI.**
+1. The HCP cluster runs with worker-nodes at mixed versions throughout the month. The N-1 skew between the old kubelet versions and control-plane is supported.
+1. **TODO: Review with Service Delivery. If the user requested another minor bump to their control-plane, how does OCM prevent unsupported version skew today?**
+1. On the next first Saturday, the worker-nodes updates are completed.


This detailed workflow narrative is amazing!

enhancements/update/change-management-and-maintenance-schedules.md

dhellmann · 2024-02-27T21:40:54Z

enhancements/update/change-management-and-maintenance-schedules.md

+
+### Implement maintenance schedules via an external control system (e.g. ACM)
+We do not have an offering in this space. ACM is presently devoted to cluster monitoring and does
+not participate in cluster lifecycle.


We could add it there, though, right? Or as an OLM-installed operator?

Yes, it could. The conclusion presented here was with guidance from Derek.

dhellmann · 2024-02-27T21:41:58Z

enhancements/update/change-management-and-maintenance-schedules.md

+cluster roll out a bad configuration change without knowing exactly how to stop the damage
+without causing further disruption. This is not a moment when you want to be figuring out how to format
+a date string, calculating timezones, or copying around cluster configuration so that you can restore
+it after you stop the bleeding.


What is driving the request for this feature, managed services or users with self-managed clusters?

It unfolded as follows (see motivation section for more depth on each point):

There is a negative perception of upgrades due to worker nodes being entangled with the end to end process. It also forces unnecessary maintenance on nodes and risks for customers.

We can eliminate those problems and empower users by separating the control-plane and worker-node updates.

@derekwaynecarr raised the concern that doing so would double operational burden. Also guidance that whatever improvements we make in standalone should be in HCP.

Maintenance windows were identified as a tool which could prevent new operational burden for customers who may not benefit from the separation. Maintenance windows are also a common configuration in other managed kubernetes solutions.

SRE, as part of all these conversations, indicates product level support for maintenance windows would help simplify their existing implementation.

Exploration of all other benefits of change management & scheduling which simplify operations.

So, item (1) was the genesis (particularly in standalone since SRE manages to insulate customers from gaps today), but we found additional value along the path.

jewzaam · 2024-03-01T19:52:20Z

enhancements/update/change-management-and-maintenance-schedules.md

+resources, but material change to the cluster will not be initiated until a time permitted
+by the Maintenance Schedule. Users do not require special insight into the implications of


To be clear on expectations, this includes the material change will not be applied if a node reboots for other reasons or a node is created (i.e. MHC replacement, scale up, new machineset / machine)?

Good catch. I need to make this explicit -- there is no guarantee. The schedule assures that no material change is initiated, but the state of rebooted nodes / new nodes is dependent on the profile and state of rollout. For example, in standalone, if a new release payload is targeted by the CVO, but cluster change management is paused, the CVO will not roll out machine configs which ultimately inform new nodes joining the cluster. Thus, new nodes joining the cluster will receive the old configuration.
If, however, the MCP is halfway through updating old nodes when change management is paused, new nodes joining the cluster will receive the new configuration.

Adding something like the following to the summary in the next push:

Change management enforcement _does not_ attempt to define or control the detailed state of the system. It only pertains to whether controllers which support change management will attempt to initiate material change themselves. For example, if changes are paused in the middle of a cluster update and a node is manually rebooted, change management does not define whether the node will rejoin the cluster with the new or old version.

Can we make that more deterministic somehow? How hard would it be to apply a simple rule like "new nodes will always be deployed with the latest config" or "new nodes will always be deployed with a consistent configuration"?

"new nodes will always be deployed with a consistent configuration"

That isn't hard. There will be a distinct moment where the MCS will start serving the new vs the old.

"new nodes will always be deployed with the latest config"

Since we need to implement this for multiple controllers, the safe bet is to leave that choice up to the implementation. Intuitively, any control loop can be paused, so, as written, the enhancement should be applicable across any controller. However, to provide this guarantee, we would need to say something like "no external sign of inconsistency can be visible until reconciliation is complete".

There is inherent friction with eventual consistency in that statement.

Consider the questions it raises in standalone:

In a paused state with MachineConfig changes pending, there is a clear difference between "new" vs "old" config. Since the "new" is not incorporated into an MCP yet, new machines booted into the cluster must necessarily receive the old.

Once the CVO unpauses and introduces a MachineConfig update, the MCP will be updated. If we pause at this point, what should a new machine receive? If you say "new", we've violated the proposed contract.

Once the MCO has updated 75% of the machines, should it still be serving the old config? Is consistency more important than the inevitable prolonging of the update when those new nodes running old config need to be drained & rebooted? Do we start serving new config while they newish nodes are being rebooted (if we do, its possible that we will serve old config indefinitely if new machines constantly scale into and out of the cluster)?

There are detailed answers we could give here that would make the standalone update process subtly more predictable. But is CAPI going to implement this contract? Would ?

In short, I think all can reasonably insist upon is the ability to pause your progress towards consistency. If we try to define what the behavior of the system must be when we pause you, things get complex.

It sounds like the rule needs to be more complex than the simple examples I gave. It still seems like we want to be able to explain clearly to users what they are going to get, though. How about "the most recent config that was being applied to nodes before the MCO was paused"? That is at least predictable, even if it depends on the timing of the pause step. And to cope with the timing, we could say that new nodes don't get a config until at least 1 existing node from the pool is updated to use it.

I can't find any documented contract for the behavior in the existing pausing primitives provided by MCP or CAPI, so I'll need to leave this for SMEs to weigh in on the what is possible here.

enhancements/update/change-management-and-maintenance-schedules.md

jewzaam · 2024-03-01T20:00:16Z

enhancements/update/change-management-and-maintenance-schedules.md

+
+By acknowledging the need for scheduled maintenance in the platform, we reduce the need for Service
+Delivery to develop and maintain custom tooling to manage the platform while 
+simultaneously reducing simplifying management for all customer facing similar challenges.


Some also in managed-cluster-config (MCC) which is plumbed through hive, and it's not a great experience to manage. We have to pick a set of verisons when a configuration can apply and then react when the cluster version is changed. We're in the process of reworking MCC so this is less painful but ultimately still is not a great experience to manage. Having the ability to just apply a change to a MachineConfig etc would be far better in my opinion.

enhancements/update/change-management-and-maintenance-schedules.md

jewzaam · 2024-03-01T20:58:27Z

enhancements/update/change-management-and-maintenance-schedules.md

+2. Requiring the use of maintenance schedules for OpenShift upgrades (the changes should be compatible with various upgrade methodologies – including being manually triggered).
+3. Allowing Standalone worker-nodes to upgrade to a different payload version than the control-plane (this is supported in HCP, but is not a goal for standalone).
+4. Exposing maintenance schedule controls from the oc CLI. This may be a future goal but is not required by this enhancement.
+5. Providing strict promises around the exact timing of upgrade processes. Maintenance schedules will be honored to a reasonable extent (e.g. upgrade actions will only be initiated during a window), but long-running operations may exceed the configured end of a maintenance schedule.


How far into a maintenance schedule's window will an update event trigger? I.e. if something stops the upgrade of workers to start, maybe the operator doing this work is degraded, but it then recovers.

How far into a maintenance schedule's window will an update event trigger?

Change can be initiated up to the last instant of the permissive window.

if something stops the upgrade of workers to start, maybe the operator doing this work is degraded, but it then recovers

I think of the implementation as pausing and resuming the reconciliation control loop(s). So relevant operators might go degraded, but only for reasons they go degraded prior to the introduction of change management.

Behavior for upgrades today is we will cancel if it takes too long to start. Would need to find out what "too long" actually is. We'll need to be careful about how this is communicated when we switch over to using this for managed. More thinking about the duration for control plane upgrades. I know we don't guarentee how long it will take, but perhaps something should consider how long those upgrades take and estimate if it would roughly fit into the "window" of the schedule.

Control plane averages about an hour. Are you thinking that OCM might want to prevent users from scheduling less than say, 2 hours to prevent their expectations from being violated?

For control plane I wouldn't set an expectation on an end time, just allow scheduling the start. Having an end implies an ability to stop the upgrade. This is not the intention of this EP.

jewzaam · 2024-03-01T21:08:40Z

enhancements/update/change-management-and-maintenance-schedules.md

+3. Allowing Standalone worker-nodes to upgrade to a different payload version than the control-plane (this is supported in HCP, but is not a goal for standalone).
+4. Exposing maintenance schedule controls from the oc CLI. This may be a future goal but is not required by this enhancement.
+5. Providing strict promises around the exact timing of upgrade processes. Maintenance schedules will be honored to a reasonable extent (e.g. upgrade actions will only be initiated during a window), but long-running operations may exceed the configured end of a maintenance schedule.
+6. Implementing logic to defend against impractical maintenance schedules (e.g. if a customer configures a 1-second maintenance schedule every year). Service Delivery may want to implement such logic to ensure upgrade progress can be made.


If we (SD) need to put something in place to manage this it's going to mean keeping some code somewhere. Today scheduling is via OCM. If this enhancement is accepted this becomes a platform capbility. We (SD) have not talked about the implications of such a feature on how we enable customer to manage upgrades. Can we see a world where SD gets out of the business of managing schedules completely at some point in the future? If so, having the right controls in place to ensure sane configurations will generally be a good idea.

That being said, we only care about the version of the cluster in a few cases.

if the cluster version is end of life (EOL)

if the cluster's control plane is running on RH SRE infra

In the #1 scenario, the cluster goes into Limited Support.
In the #2 scenario, it only applys for HCP toploogies where the pods of the control plane are running on RH infrastructure. SRE will force a control plane upgrade if the control plane version for signifigent bugs that need remediated, there is a vulnerability we need to remediate, or the control plane version is now EOL.

Can we see a world where SD gets out of the business of managing schedules completely at some point in the future?

Yes. SD could let customers fully define their own maintenance schedules and simply override them when necessary (e.g. using disabledUntil). The reason SD might want to play a larger role in constraining the setting is to make sure, when necessary, that forced updates are performed within a customer's desired window. Your #2 scenario, for example: you have to do the update and ideally want users to be OK with when it happens. If you force them to choose a non-zero change management window each month, hopefully they are less surprised when forced changes happen during that window.

jewzaam · 2024-03-04T14:48:31Z

enhancements/update/change-management-and-maintenance-schedules.md

+### Change Management Metrics
+Cluster wide change management information will be made available through cluster metrics. Each resource
+containing the stanza should expose the following metrics:
+- The number of seconds until the next known permitted change window. 0 if changes can currently be initiated. -1 if changes are paused indefinitely. -2 if no permitted window can be computed.


Why encode meaning into negative values vs having explicit state captured in another field? This can work of course but is mixing use: time until permitted change and status (paused).

My motivation was to simplify the promql necessary for useful alerting by an ops team. A simple rule might be warn "Invalid update schedule" when seconds_until_update < 0 or seconds_until_update_permitted > 60 days vs enumerating booleans for why things might be busted. SD would certainly know best here, so I'm open to alternatives. I just want any rule to survive the introduction of new error conditions.

From the observability perspective it does reduce cardinality to use the value to encode the state. As noted it will work. It gets a bit brittle when you now have a mapping through various layers and have to know that encoded meaning in the metric value.

jewzaam · 2024-03-04T15:00:02Z

enhancements/update/change-management-and-maintenance-schedules.md

+1. The update of this worker-node continues, but at 00:00 UTC Sunday, no further material changes are permitted by the change management policy and the worker-node update process is effectively paused.
+1. Because not all worker-nodes have been updated, changes are still reported as pending via metrics for NodePool.  **TODO: Review with HyperShift. Pausing progress should be possible, but a metric indicating changes still pending may not since they interact only through CAPI.**
+1. The HCP cluster runs with worker-nodes at mixed versions throughout the month. The N-1 skew between the old kubelet versions and control-plane is supported.
+1. **TODO: Review with Service Delivery. If the user requested another minor bump to their control-plane, how does OCM prevent unsupported version skew today?**


We should have skew protection (whatever it's called) disabled because we'll force a minor version upgrade of control planes in ROSA HCP when the control plane version is EOL. Validation of allowed skew is possible at other layers like OCM/CS, not sure what we permit at this time.

enhancements/update/change-management-and-maintenance-schedules.md

jewzaam · 2024-03-04T15:09:42Z

enhancements/update/change-management-and-maintenance-schedules.md

+1. SRE can address the issue with a system configuration file applied in a MachineConfig.
+1. SRE creates the MachineConfig for the customer and provides the customer the option to either (a) wait until their
+   configured maintenance schedule permits the material change from being initiated by the MachineConfigOperator
+   or (b) having SRE override the maintenance schedule and permitting its immediate application.   


I don't expect SRE will override the schedule in this scenario. The change would be staged and customer notified of options. Customer can either wait for their next update window or the customer can permit the change to progress.

jewzaam · 2024-03-04T15:12:59Z

enhancements/update/change-management-and-maintenance-schedules.md

+#### Service Delivery Emergency Patch
+1. SRE determines that a significant new CVE threatens the fleet.
+1. A new OpenShift release in each z-stream fixes the problem.
+1. SRE plans to override customer maintenance schedules in order to rapidly remediate the problem across the fleet.


SRE have a couple of options and it depends on if it's HCP or not.

If HCP and the issue is signifigent enough, yes we would stage the update and force the change to land regardless of the customer maintenance schedule.

For Classic, where control plane is in customer's account, there is an important question to answer: does this impact SRE's ability to provide support and/or impact SLA with customer?

If yes, there is a support and/or SLA impcat, then SRE would do the same as with HCP and pushed the upgrade through.

If no, the change would be staged and customer schedule will pick it up on the next update. Customer will be notified and has an option to update earlier. Customer can defer the update as usual.

enhancements/update/change-management-and-maintenance-schedules.md

- All times are now UTC. - Clarity on not defining system state -- only initiation. - Change from "endTime" to "duration" to easily create windows that span days. - Clarity on how/why CVO shows pending worker-node updates before machineconfig is updated. - Change 1000 day search metric behavior.

wking · 2024-04-04T00:12:00Z

enhancements/update/change-management-and-maintenance-schedules.md

+who wants to achieve the simple goal of ensuring no material changes take place outside a well-defined 
+maintenance schedule, do you want to have to the challenge of keeping every MachineConfigPool's 
+`changeManagement` stanza in perfect synchronization with the ClusterVersion's? What if a new MCP is created 
+without your knowledge?


MachineConfigPools consume inputs from many places, and ClusterVersion updates are only one aspect. https://amd64.ocp.releases.ci.openshift.org/ -> 4.16.0-ec.4 -> aws-ovn-serial -> Artifacts -> ... -> gather-extra artifacts:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-serial/1767648317749071872/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/machineconfigs.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name' | grep rendered-worker- | sort 2024-03-12T20:45:43Z rendered-worker-9724b3c3c79d2e58b093eb46d0c2c3ff 2024-03-12T20:51:58Z rendered-worker-33a6a11e659aca048b955f168eb770f3 2024-03-12T20:52:04Z rendered-worker-3e2eab9d46a1db0948347f34d268ea61 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-serial/1767648317749071872/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/machineconfigpools.json | jq -c '.items[] | select(.metadata.name == "worker").status.configuration.source[]' {"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"00-worker"} {"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"01-worker-container-runtime"} {"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"01-worker-kubelet"} {"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"97-worker-generated-kubelet"} {"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"98-worker-generated-kubelet"} {"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"99-worker-generated-registries"} {"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"99-worker-ssh"}

Instead of having it implicitly assumed that ClusterVersion is going to be the controlling resource taking precedence over local MachineConfigPool configuration, maybe changeManagement could explicitly declare that with references to higher-precedence managers?

Or, because that would make the system even more complicated, how likely is it that folks are creating new MachineConfigPools without looping in cluster lifecycle administrators? If cluster lifecycle administrators cannot agree amongst themselves, that is not a problem we can solve with tooling. If they do agree, but want unified control over the policy, I'm not as down on UpdatePolicy CRD as the current alternatives-section is, and possibly there's a way to mitigate the concerns around that while still avoiding implicit or explicit hierarchy complications.

There were two motivations to the current proposal:

It is intended to be implementable with existing CAPI primitives. CAPI exposes "paused" on the two main resources Cluster and MachineDeployment. My understanding is that pausing the Cluster pauses the MachineDeployment but not vice versa. This should allow the HyperShift operator to fulfill the simple changeManagement hierarchy by pausing and unpausing these resources. If the enhancement calls for independence of changeManagement stanzas, we would need to be confident CAPI would accept a significant paradigm shift. I would only feel compelled to do that if we thought there was a significant advantage.

The guiding principle of "simple things should be simple." Setting something globally, once, and not having to keep it in mind for an unrelated activity reduces the mental overhead associated with the product. A diligent enough team can always solve for with policy / clever automation / etc. But the more we ask this theoretical team to solve, the less pleasant the product is to use. "Cluster lifecycle administrator" is often going to be a single person trying to do lifecycle management along with everything else. Ideally, the enhancement will improve their QOL while also being flexible enough for plausible complicated use cases.

jharrington22 · 2024-04-12T17:33:33Z

enhancements/update/change-management-and-maintenance-schedules.md

+This approach has proven extraordinarily successful in providing a fast and reliable 
+control-plane update, but, in rare cases, the highly opinionated update process leads
+to less than ideal outcomes. 


Should we re frame this section to combine more of what is in "Node Update Separation to Address Problems in User Perception" as the case here is more to give customers control on when to update worker nodes due to possible interruptions to their workloads.

enhancements/update/change-management-and-maintenance-schedules.md

jharrington22 · 2024-05-07T17:41:49Z

enhancements/update/change-management-and-maintenance-schedules.md

+> "As an SRE, tasked with performing non-emergency corrective action, I want 
+> to be able to apply a desired configuration (e.g. PID limit change) and have that change roll out 
+> in a minimally disruptive way subject to the customer's configured maintenance schedule."


Typically for managed OpenShift we will provide the customer the ability to schedule this.

That action specific scheduling would go away once this is implemented or just get layered on top?

pid limit MC change goes at HH:MM mm/dd/yyyy

change window opens after above time

We would remove our implementation and replace it with a maintenance schedule.

enhancements/update/change-management-and-maintenance-schedules.md

jharrington22 · 2024-05-07T18:11:33Z

enhancements/update/change-management-and-maintenance-schedules.md

+
+1. A [Cluster Service Consumer](https://hypershift-docs.netlify.app/reference/concepts-and-personas/#personas) requests an HCP cluster via OCM.
+1. To comply with their company policy, the service consumer configures a maintenance schedule through OCM. 
+1. Their first preference, no updates at all, is rejected by OCM policy, and they are referred to service 


For worker nodes this would be a valid strategy for a customer. At some point they will be notified that they must upgrade if control plane version skew is too far miss aligned and the running control plane version is going end of life.

jharrington22 · 2024-05-07T18:35:14Z

enhancements/update/change-management-and-maintenance-schedules.md

+1. SRE applies a change to the relevant control-plane AND worker-node resource's `changeManagement` stanza
+   (both must be changed because of the change management hierarchy), setting `disabledUntil` to
+   a time 48 hours in the future. The configured change management schedule is ignored for 48 as the system 
+   initiates all necessary node changes.


This is interesting as SRE would have to ensure that the disabledUntil window is long enough to update all nodes. Typically we don't apply fleet wide changes like this immediately, we configure the rollout so they are applied after the next z-stream update.

Right - this would be for emergency purposes only. machineconfiguration.openshift.io/bypass-change-management is another option if you didn't need to announce an end of maintenance guarantee to customers.

enhancements/update/change-management-and-maintenance-schedules.md

JoelSpeed

I think this is a contender for longest EP, congrats 🎉

I've left a bunch of comments through this, and, from an API perspective at least, have a number of questions about the UX and some comments about it being a separate, referred resource. It's only as I got to the end that I see this is an alternative already discussed. I've got more homework now on seeing what prior discussion there was about this, my current view though is that embedding it within the object is going to create some awkward duplication that could potentially be avoided. Feels like there's a line though on where to separate

This definitely needs review from the MCO folks, there's significant overlap with a project they are currently working on, we need to make sure to de-duplicate the efforts.

There's also some assumptions about how the MCO works that I believe are incorrect, and some of the design may need to be revised to accommodate this

CC @sinnykumari @yuqi-zhang I think you both ought to review this one

enhancements/update/change-management-and-maintenance-schedules.md

JoelSpeed · 2024-05-23T17:28:13Z

enhancements/update/change-management-and-maintenance-schedules.md

+an independent `UpdatePolicy` CRD where administrator preferences can be captured. This approach [was explored](https://github.com/jupierce/oc-update-design-ideas/commit/a6364ee2f2c1ebf84ed6d50bc277f9736bf793bd). 
+Ultimately, it felt less intuitive across profiles. Consider a CRD called `UpdatePolicy` that tries to be a central config.
+
+1. Is it namespaced? If it is, that feels odd for an object that that should control the cluster. If it is not namespaced, the policy feels misplaced for HCP HostedCluster resources which live in a namespace.


How does this work for MachineConfig in HCP, since they make sense to be cluster scoped, but not in the case of HCP? This feels the same

MachineConfigs serialized as configmaps in the hosted cluster namespace: https://hypershift-docs.netlify.app/how-to/automated-machine-management/configure-machines/ .

Eww 😅

I was expecting there to be a XYZ and a ClusterXYZ where XYZ is for managing the cluster you're on and ClusterXYZ is for managing guest clusters and is namespaced, naming could be confusing though 🤔

enhancements/update/change-management-and-maintenance-schedules.md

jewzaam · 2024-06-14T17:00:43Z

enhancements/update/change-management-and-maintenance-schedules.md

+| date           | *      | *       | Honor permit and exclude values, but only after the specified date. For example, permit: `null` and exclude: `null` implies the strategy is indefinitely permissive after the specified date. |
+
+
+#### MachineConfigPool Assisted Strategy Configuration


Not seeing where a forced drain would be configured. In a managed context we need have this as an option, if only for our own clusters. We can decide what default value to use (today it is 1 hour I believe).

Forced drain is a separate EP that I hope will also reduce MUO: #1616 .

sdodson

There are parts I need to re-read to ensure I understand the details, but given a lot of my comments here have been pending for a while I'm going to go ahead and post them then come back for a second pass.

sdodson · 2024-02-27T18:17:37Z

enhancements/update/change-management-and-maintenance-schedules.md

+authors:
+  - @jupierce
+reviewers: 
+  - TBD


@sinnykumari @LalatenduMohanty @csrwng I think cover first pass reviewers
Probably want to run this by @rhamilto even if we potentially decide that we don't want a web console UX around this in cluster. To me I think it'd be good to at least expose node maintenance schedules.

enhancements/update/change-management-and-maintenance-schedules.md

sdodson · 2024-05-21T17:29:32Z

enhancements/update/change-management-and-maintenance-schedules.md

+> "As an SRE, tasked with performing non-emergency corrective action, I want 
+> to be able to apply a desired configuration (e.g. PID limit change) and have that change roll out 
+> in a minimally disruptive way subject to the customer's configured maintenance schedule."


That action specific scheduling would go away once this is implemented or just get layered on top?

pid limit MC change goes at HH:MM mm/dd/yyyy

change window opens after above time

sdodson · 2024-07-01T18:19:01Z

enhancements/update/change-management-and-maintenance-schedules.md

+- "Material Change". A longer definition is provided in the Summary, but in short, any configuration
+  change a platform operator wishes to apply which would necessitate the reboot or replacement of one 
+  or more nodes is considered a material change. For example, updating the control-plane version is 
+  a material change as it requires rebooting master nodes. 


Sometimes there can be control plane updates without associated reboots, it's rare given we update the kernel every other calendar week, but it's possible that a z-stream update could update only pod scoped resources. We'd still consider that to be a material change, correct?

As an added note, in practice I think I've only ever seen 1 z-stream update not have any OS changes. Changes to openshift-clients/hyperkube/crio/etc. happen pretty frequently so even if there were no kernel changes something in the OS gets rebuilt.

This does imply that we cannot use "did a reboot happen" as whether an update completed or not, but I don't think anything monitors for that.

enhancements/update/change-management-and-maintenance-schedules.md

sdodson · 2024-07-01T19:31:33Z

enhancements/update/change-management-and-maintenance-schedules.md

+   therefore, reflect expected changes to the worker-node MCPs. Anticipating this change ahead of time is necessary for an operations
+   team to be able to set an alert with the semantics (worker-node-update changes are pending & time remaining until changes are permitted < 2d).
+   The MCO will expose its own metric for changes pending when manifests are updated. But this metric will only indicate when 
+   there are machines in the pool that have not achieved the desired configuration. An operations team trying to implement the 2d 


Whenever we say telemetry (What gets sent back to Red Hat) we're constrained by cardinality limits which would likely mean that we get one metric from MCO indicating the number of pools which are not currently in an Updated state. However metrics within the cluster could be more fine grained as proposed elsewhere in this doc, per pool and I'm fine with them including time since last fully reconciled and estimated time until next reconciliation whenever there are updates pending and it's not actively rolling out those changes.

sdodson · 2024-07-02T12:57:15Z

enhancements/update/change-management-and-maintenance-schedules.md

+15. As part of updating cluster manifests, MachineConfigs have been modified. The MachineConfigOperator (MCO) re-renders a
+    configuration for worker-nodes. However, because the MCP maintenance schedule precludes initiating material changes,
+    it will not begin to update Machines with that desired configuration.
+16. The MCO will set a metric indicating that desired changes are pending. 


I don't believe we have those metrics today. We'd be designing them as part of this enhancement.

sdodson · 2024-07-02T17:22:06Z

enhancements/update/change-management-and-maintenance-schedules.md

+1. The cluster lifecycle administrator manually drains and reboots nodes in the cluster. As they come back online,
+   the MachineConfigServer offers them the desiredConfig requested by the manual policy.


Yeah, this would just be the administrator annotates (labels?) each node with the new desiredConfig. MCO and MCD handle the rest from there. Administrator is responsible for annotating nodes in order they want and rate limiting that labeling however they want.

yuqi-zhang

Sorry for the late review. Just wanted to add some points/questions from the MCO perspective

yuqi-zhang · 2024-07-04T17:16:39Z

enhancements/update/change-management-and-maintenance-schedules.md

+- "Material Change". A longer definition is provided in the Summary, but in short, any configuration
+  change a platform operator wishes to apply which would necessitate the reboot or replacement of one 
+  or more nodes is considered a material change. For example, updating the control-plane version is 
+  a material change as it requires rebooting master nodes. 


As an added note, in practice I think I've only ever seen 1 z-stream update not have any OS changes. Changes to openshift-clients/hyperkube/crio/etc. happen pretty frequently so even if there were no kernel changes something in the OS gets rebuilt.

This does imply that we cannot use "did a reboot happen" as whether an update completed or not, but I don't think anything monitors for that.

enhancements/update/change-management-and-maintenance-schedules.md

yuqi-zhang · 2024-07-04T17:57:36Z

enhancements/update/change-management-and-maintenance-schedules.md

+
+The design choice is informed by a thought experiment: As a cluster lifecycle administrator for a Standalone cluster,
+who wants to achieve the simple goal of ensuring no material changes take place outside a well-defined 
+maintenance schedule, do you want to have to the challenge of keeping every MachineConfigPool's 


I have a somewhat similar question to Joel here. As I understand it, worker (or other custom pools) can set their independent changeManagements, and the global (cluster) one is AND'ed into it?

What if I wanted to have different cadences for my control plane and workers (e.g. control planes on Mon-Fri but not weekends, and workers on weekends but not Mon-Fri)?

And since master pool is separate, this also goes back to my earlier question on whether that's independent (i.e. the control plane nodes themselves being upgraded is a separate profile than the control plane components rolling out)

enhancements/update/change-management-and-maintenance-schedules.md

yuqi-zhang · 2024-07-04T18:21:31Z

enhancements/update/change-management-and-maintenance-schedules.md

+1. The cluster lifecycle administrator manually drains and reboots nodes in the cluster. As they come back online,
+   the MachineConfigServer offers them the desiredConfig requested by the manual policy.


The description here more matches with what hosted control planes are doing (except that HCP would spin up an entirely new node)

yuqi-zhang · 2024-07-04T18:24:04Z

enhancements/update/change-management-and-maintenance-schedules.md

+
+#### Hypershift / Hosted Control Planes
+
+In the HCP topology, the HostedCluster and NodePool resources are enhanced to support the change management strategies


I would like to call out that:

HCP has both replace and inplace upgrade nodepools, and while that shouldn't matter for maint. schedule, they do upgrade differently, and

HCP differs from standalone in cert management, where the certs there will trigger creation of new nodes, and no rotation will happen if the upgrades are paused (due to a maint. schedule), so you run more risk of certs expiring down the line (although I don't know if they actually have certificate rotation handling for all certs today)

yuqi-zhang · 2024-07-04T18:28:24Z

enhancements/update/change-management-and-maintenance-schedules.md

+
+#### Change Management on Master MachineConfigPool
+In order to allow control-plane updates as a single material change, the MCO will only honor change the management configuration for the 
+master MachineConfigPool if user generated MachineConfigs are the cause for a pending change. To accomplish this, 


We don't really have a mechanism for this today, since all configuration change calculations are done via rendered ones, and we just do a merge of all base configs.

I'm not sure I fully understand this section. Basically any pool-level configurations on the master pool are allowed the same way as worker, but they aren't honored if there's an incoming upgrade? But any user changes turn us back into worker calculation? (where the CVO level maint. schedule and the master pool are union'ed?)

yuqi-zhang · 2024-07-04T18:30:24Z

enhancements/update/change-management-and-maintenance-schedules.md

+
+When `nodeDisruptionPolicy` indicates that a `MachineConfiguration` should not trigger
+a node reboot, it becomes a non-material change from the perspective of a maintenance
+schedule. In other words, it can be applied immediately, even outside a permissive window.


I don't think that's safe, since disruption is not just reboots, but also node drains, etc. all of which you can specify via the policy. Besides, the MCO isn't that granular that it can break down a subset of MC changes to happen.

I think node disruption policy should only define how a change is acted on, but otherwise is gated by the maint. schedule like everyone else.

yuqi-zhang · 2024-07-04T18:33:14Z

enhancements/update/change-management-and-maintenance-schedules.md

+a node reboot, it becomes a non-material change from the perspective of a maintenance
+schedule. In other words, it can be applied immediately, even outside a permissive window.
+
+### Risks and Mitigations


I'd also like to double check how we manage slow updates/half updated pools vs blocked upgrades/failing nodes, etc.

It reads to me that we'll have to reconsider how we do status calculation with the new window, since half-upgraded pools will become the norm, which means we'll have to figure out how/when/what we consider errors, and add extra pool status/operator status to reflect some of this

wking · 2024-07-08T19:41:23Z

enhancements/update/change-management-and-maintenance-schedules.md

+metdata:
+  # ChangeManagementPolicies are namespaced resources. They will normally reside in the 
+  # namespace associated with the controller initiating material changes. 
+  # For example, in Standalone namespace/openshift-machine-config for the MCO and 


nit: there's no openshift-machine-config namespace; all the MCO components are in openshift-machine-config-operator. E.g. in a 4.17.0-ec.1 CI run:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1806673110347288576/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata.name' | grep -1 machine-config openshift-machine-api openshift-machine-config-operator openshift-marketplace

Thanks @wking - update in next push.

wking · 2024-07-08T19:51:32Z

enhancements/update/change-management-and-maintenance-schedules.md

+    # setting, it overrides the "policy" stanza.
+    # Date must be RFC3339. 
+    # This field is modeled on HCP's existing HostedCluster.spec.pausedUntil which uses a string. 
+    pausedUntil: <datetime|bool>str  


With disabledUntil as a "I want to touch this despite the lack of formal maintenance window" access hatch, I think it makes sense to have that one down on the covered resource in changeManagement. But I'm less clear on why pausedUntil's "despite the usual maintenance window, I have reasons to want to go slowly for a bit" would be down in changeManagement. Maybe pausedUntil belongs up on ChangeManagementPolicy? Or maybe exclude already covers that use case at the ChangeManagementPolicy level, and we don't need pausedUntil at all?

I wanted pausedUntil to be a knob even without a ChangeManagementPolicy defined. It is in the resource itself because I believe:

It offers a simple way for users to control roll-outs without having to get into the details of creating ChangeManagementPolicy definitions. I want them to be able to easily toggle pausing a rollout of an MCP without the rabbit hole of understanding ChangeManagementPolicy - that rabbit hole is only worth it for someone who wants a real MaintenanceSchedule.

Moving pausedUntil into ChangeManagementPolicy makes the meaning even more challenging to convey / interpret. For example, how would you interpret "pausedUntil: true" for a Restrictive strategy vs a Permissive strategy? A paused restrictive strategy sounds like it would be evaluated as permissive.

It permits CLI tools to pause and unpause rollouts regardless of whether an administrator has configured a ChangeManagementPolicy. I think this is critical in order to implement verbs like oc adm update worker-nodes start|stop|pause and ensure they have MCP granularity.

openshift-bot · 2024-08-08T01:15:46Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2024-08-15T08:45:34Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

jupierce · 2024-08-20T19:00:05Z

/remove-lifecycle rotten
/remove-lifecycle stale

jupierce · 2024-08-20T19:02:21Z

/remove-lifecycle stale

enhancements/update/change-management-and-maintenance-schedules.md

JoelSpeed · 2024-08-23T12:12:57Z

enhancements/update/change-management-and-maintenance-schedules.md

+   upgrade methodologies – including being manually triggered).
+1. Allowing Standalone worker-nodes to upgrade to a different payload version than the control-plane 
+   (this is supported in HCP, but is not a goal for standalone).
+1. Exposing maintenance schedule controls from the oc CLI. This may be a future goal but is not required by this enhancement.


With this, were you thinking with regards to having special commands built into the CLI that would allow pausing/disabling etc?

That's what I'd assume, oc adm commands that layer on top of the APIs.

Yes, exactly. A series of subverbs under oc adm update have been proposed for OTA: https://docs.google.com/presentation/d/1BtVSGWaG6LD0qye9BHPU5ssnEYTGPOlhHeogFfECUnk/edit#slide=id.g25513f80e3d_0_572 .

JoelSpeed · 2024-08-23T12:13:52Z

enhancements/update/change-management-and-maintenance-schedules.md

+## Proposal
+
+### Change Management Overview
+Establish a new namespaced Custom Resource Definition (CRD) called `ChangeManagementPolicy` which allows cluster lifecycle


Can you add an explanation of why this is namespaced, for a standalone cluster it doesn't really make sense to be namespaced does it? Depending on which API server this is on in HCP, it could also make sense to be cluster scoped, I'm guessing it's on the management side though and that's why we want it to be namespaced

Also does RBAC get nasty here? The policy would have to be in the namespace of the controller that adheres to it or be granted permissions to read from other namespaces.

Yes, they are on the management side for HCP and that drove the namespace decision. Adding a note to this effect in the next push.

JoelSpeed · 2024-08-23T12:14:35Z

enhancements/update/change-management-and-maintenance-schedules.md

+changes on the cluster.
+
+Add a `changeManagement` stanza to several existing resources in the OpenShift ecosystem which can reference
+the new `ChangeManagementPolicy` resource to restrict when and how their associated controllers can initiate


Did we decide that each resource should be able to select just a single policy, or did we want to be able to select multiple policies?

Described a bit later - manageable resources can refer to only one policy. Permitting multiple means returning to the non-deterministic calculation of overlapping intervals. Something I'm happy to avoid with a single policy.

JoelSpeed · 2024-08-23T12:18:02Z

enhancements/update/change-management-and-maintenance-schedules.md

+kind: ChangeManagementPolicy
+metdata:
+  # ChangeManagementPolicies are namespaced resources. They will normally reside in the 
+  # namespace associated with the controller initiating material changes. 


When we think of resources being namespace, it is often because we want different actors to act upon them, and so namespacing them gives us that control and allows isolation between different actors. Since most (all?) resources in a standalone cluster that will have changeManagement stanzas are cluster scoped, it makes me think the actor on these resources should have cluster permissions and therefore this resource probably should also be cluster scoped?

The actor would be namespaced for HCP. So unless we want to have a namespaced (HostedChangeManagementPolicy?) and non-namespaced version of the CRD, namespaced was my compromise.

A separate, namespace scoped version could be achievable, it may require some tooling improvements to make it easier to manage, but, nothing is impossible right?

JoelSpeed · 2024-08-23T13:17:11Z

enhancements/update/change-management-and-maintenance-schedules.md

+    made for all nodes, the MCO will update nodes that have the oldest current configuration first. This ensures
+    that even if the desired configuration has changed multiple times while maintenance was not permitted,
+    no nodes are starved of updates. Consider the alternative where (a) worker-node updates required > 24h,
+    (b) updates to nodes are performed alphabetically, and (c) MachineConfigs are frequently being changed


Assuming a new version comes along later, v3, will the MCO recognise that 10 are on v1, 10 are on v2, and update those on v1 first, or will it be random? Then you might end up with a mix of v1, v2 and v3?

What is our reporting like for understanding which version a Node is on? I could imagine a situation where we get a bug and the customers says yes, cluster is on version X.Y.Z, and we can see a bug there that was fixed in say X.Y.Z-1, but their nodes are mixed across several versions prior to that, how do we make it clear that the node that broke in this instance is on an unpatched release and therefore the customer needs to upgrade to fix the bug?

And that said, what if there's a change that affects the rendered config, that also relies on controllers being updated, or vice versa, how do we handle skew there 🤔

JoelSpeed · 2024-08-23T13:23:41Z

enhancements/update/change-management-and-maintenance-schedules.md

+      # for calendar management metadata. Only a subset of the RFC is supported.
+      # See "RRULE Constraints" section for details.
+      # If unset, all dates are permitted and only exclude constrains permissive windows.
+      recurrence: <rrule|null>


Still think we want to make a structured API that we convert into RRULE btw, it will be more native and I think we can make it intuitive and have an easier time validating the user intent and restricting what we do and don't allow, rather than exposing the entirety of RRULE

JoelSpeed · 2024-08-23T13:25:00Z

enhancements/update/change-management-and-maintenance-schedules.md

+    
+    # Excluded date ranges override RRULE selections.
+    exclude:
+    # Dates should be specified in YYYY-MM-DD. Each date is excluded from 00:00 UTC for 24 hours.


Is there any reason you'd want to exclude a window short than 24 hours, how could we model that?

If it came in as an important RFE, the solution could be to change the time formats supported by fromDate and toDate to include HH:MM:SS type information. I couldn't think of a compelling use case to warrant the increased complexity. "Black Friday", "Someone is on vacation" -- lend themselves to day granularity. The day granularity is also a nod to the fact that I wanted this to be hard to get wrong. If you are going to the trouble to specify an exclude, making a mistake is probably a really bad outcome.

JoelSpeed · 2024-08-23T13:29:01Z

enhancements/update/change-management-and-maintenance-schedules.md

+trigger assisted rollouts scenarios for worker-nodes (i.e. where they control the timing of a rollout
+completely but progress is made using MCO automation). 
+
+`oc adm update` will be enhanced with `worker-node` subverbs to initiate and pause MCO work. 


This would cover infra nodes too? What about other classes of nodes that we define arbitrarily? How does the user break down to various MCP levels?

It would cover infra. The proposed verbs act on MCPs, so a user would be able to act separately on infra vs other worker MCPs. If a user did want further granularity, they would move into the Manual realm where they are free to script drains/reboots based on any criteria.

JoelSpeed · 2024-08-23T13:32:25Z

enhancements/update/change-management-and-maintenance-schedules.md

+
+#### Change Management Bypass Annotation
+In some situations, it may be necessary for a MachineConfig to be applied regardless of the active change
+management policy for a MachineConfigPool. In such cases, `machineconfiguration.openshift.io/bypass-change-management`


Does the API not already have ways to bypass the change management? Why not leverage those? Or are we now looking granularly at the MachineConfig level? I think we'll have to square that with the MCO team as I don't think that's a feature they have today

There are several ways to bypass change management. This mechanism was a way to make "control-plane as a single material change" less of a snowflake (i.e. it happens as a single material change because of a standard annotation vs just being hard coded).
It also meant to give SRE/ops some guarantees for critical machine config updates without having to guess at a "disabledUntil" date. I will add a note to that effect to the enhancement.

Rolling out critical machine config changes for worker nodes also is made easier with this annotation.
Instead of, for example, trying to predict a disabledUntil date, an SRE/operations team can use this
annotation to specify their goal with precision. Compare this with disabledUntil, which
(a) an operations team would generally overestimate in order to ensure that updates complete and
(b) may cause subsequent, non-critical, machine configuration changes to cause further workload disruption
(node reboots) that are unwarranted.

openshift-ci · 2024-09-13T00:09:56Z

@jupierce: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`e1092ab`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

iamkirkbater · 2024-09-23T13:54:52Z

enhancements/update/change-management-and-maintenance-schedules.md

+- "Material Change". A longer definition is provided in the Summary, but, in short, any configuration
+  change a platform operator wishes to apply which would necessitate the reboot or replacement of one 
+  or more nodes is considered a material change. For example, updating the control-plane version is 
+  a material change as it requires rebooting master nodes. 


Is there a differentiation between standalone and HCP needed here? If I understand correctly, a control-plane upgrade on a HCP would not require rebooting nodes specifically, but the pods for that HCP. Or am I just splitting unnecessary hairs here?

I'd consider control plane updates to be a material change whether or not it's standalone or HCP, not because pods or nodes are restarted, but because new code or config gets deployed and behavior will change. That's why we seek to be able to confine HostedControlPlane, ClusterVersion, MachineConfigPool, and NodePool updates to the permissive periods in a given schedule.

Probably splitting hairs too, but would a rolling reboot of nodes with no change be considered a material change? Maybe not for control plane nodes where things are well tuned for zero disruption, including CI tests that monitor for disruption, but user workloads are often less tolerant of rolling restarts. Either way, if initiated via oc adm reboot-machine-config-pool they'd both be confined to permissive windows because this actually applies a benign MachineConfig change and the MCO would observe any defined schedule.

Add Change Management and Maintenance Schedules

bdb8ffd

openshift-ci bot requested review from LalatenduMohanty and wking February 23, 2024 21:10

jupierce changed the title ~~Add Change Management and Maintenance Schedules~~ WIP: Add Change Management and Maintenance Schedules Feb 23, 2024

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 23, 2024

dhellmann reviewed Feb 27, 2024

View reviewed changes

jewzaam reviewed Mar 1, 2024

View reviewed changes

jewzaam reviewed Mar 4, 2024

View reviewed changes

jupierce force-pushed the maintenance_schedules_enhancement branch from 4415b72 to 037fd22 Compare March 8, 2024 20:31

jupierce force-pushed the maintenance_schedules_enhancement branch from 037fd22 to 6e4f207 Compare March 8, 2024 21:39

Add external CRD alterantive details

a888a12

wking reviewed Apr 4, 2024

View reviewed changes

jharrington22 reviewed Apr 12, 2024

View reviewed changes

jharrington22 reviewed May 7, 2024

View reviewed changes

James review

e4fe894

jupierce force-pushed the maintenance_schedules_enhancement branch from 4aabfd6 to e4fe894 Compare May 14, 2024 23:55

jupierce requested review from sinnykumari, JoelSpeed and PratikMahajan May 21, 2024 19:43

JoelSpeed reviewed May 23, 2024

View reviewed changes

sinnykumari reviewed May 28, 2024

View reviewed changes

enhancements/update/change-management-and-maintenance-schedules.md Outdated Show resolved Hide resolved

petr-muller reviewed Jun 3, 2024

View reviewed changes

enhancements/update/change-management-and-maintenance-schedules.md Show resolved Hide resolved

Joel review

89167f1

jewzaam reviewed Jun 14, 2024

View reviewed changes

jupierce mentioned this pull request Jun 14, 2024

WIP: MachineConfigPool Surge and NodeDrainTimeout Support #1616

Closed

Justin Pierce added 2 commits June 14, 2024 17:03

Adding terminology section

ec9eee7

Add maxUnavailable note

f24a7d6

sdodson reviewed Jul 2, 2024

View reviewed changes

Typos & clarifications

b31fd63

yuqi-zhang reviewed Jul 4, 2024

View reviewed changes

Justin Pierce added 3 commits July 4, 2024 16:51

Fix implementation detail regarding MCS

1c098ae

updates from yuqi-zhang review

5bd373f

Extract ChanageManagementPolicy into external CRD

6a0c121

jupierce force-pushed the maintenance_schedules_enhancement branch from ed1f465 to 6a0c121 Compare July 7, 2024 17:07

wking reviewed Jul 8, 2024

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 8, 2024

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 15, 2024

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 20, 2024

JoelSpeed reviewed Aug 23, 2024

View reviewed changes

Review inspired additions

e1092ab

iamkirkbater reviewed Sep 23, 2024

View reviewed changes

		resources, but material change to the cluster will not be initiated until a time permitted
		by the Maintenance Schedule. Users do not require special insight into the implications of

		\| date \| * \| * \| Honor permit and exclude values, but only after the specified date. For example, permit: `null` and exclude: `null` implies the strategy is indefinitely permissive after the specified date. \|


		#### MachineConfigPool Assisted Strategy Configuration

		1. The cluster lifecycle administrator manually drains and reboots nodes in the cluster. As they come back online,
		the MachineConfigServer offers them the desiredConfig requested by the manual policy.


		#### Hypershift / Hosted Control Planes

		In the HCP topology, the HostedCluster and NodePool resources are enhanced to support the change management strategies

WIP: Add Change Management and Maintenance Schedules #1571

Are you sure you want to change the base?

WIP: Add Change Management and Maintenance Schedules #1571

Conversation

jupierce commented Feb 23, 2024

openshift-ci bot commented Feb 23, 2024

dhellmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoelSpeed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdodson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuqi-zhang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Aug 8, 2024

openshift-bot commented Aug 15, 2024

jupierce commented Aug 20, 2024