From bdb8ffdd274a91b4fed8c559e142a87c661e259d Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Tue, 20 Feb 2024 17:13:38 -0500 Subject: [PATCH 01/12] Add Change Management and Maintenance Schedules --- ...ge-management-and-maintenance-schedules.md | 1158 +++++++++++++++++ 1 file changed, 1158 insertions(+) create mode 100644 enhancements/update/change-management-and-maintenance-schedules.md diff --git a/enhancements/update/change-management-and-maintenance-schedules.md b/enhancements/update/change-management-and-maintenance-schedules.md new file mode 100644 index 0000000000..243704a4a6 --- /dev/null +++ b/enhancements/update/change-management-and-maintenance-schedules.md @@ -0,0 +1,1158 @@ +--- +title: change-management-and-maintenance-schedules +authors: + - @jupierce +reviewers: + - TBD +approvers: + - @sdodson + - @jharrington22 +api-approvers: + - TBD +creation-date: 2024-02-29 +last-updated: 2024-02-29 + +tracking-link: + - TBD + +--- + +# Change Management and Maintenance Schedules + +## Summary +Implement high level APIs for change management which allow +standalone and Hosted Control Plane (HCP) clusters a measure of configurable control +over when control-plane or worker-node configuration rollouts are initiated. +As a primary mode of configuring change management, implement an option +called Maintenance Schedules which define reoccurring windows of time (and specifically +excluded times) in which potentially disruptive changes in configuration can be initiated. + +Material changes not permitted by change management configuration are left in a +pending state until such time as they are permitted by the configuration. + +Change management enforcement _does not_ guarantee that all initiated +material changes are completed by the close of a permitted change window (e.g. a worker-node +may still be draining or rebooting) at the close of a maintenance schedule, +but it does prevent _additional_ material changes from being initiated. + +A "material change" may vary by cluster profile and subsystem. For example, a +control-plane update (all components and control-plane nodes updated) is implemented as +a single material change (e.g. the close of a scheduled permissive window +will not suspend its progress). In contrast, the rollout of worker-node updates is +more granular (you can consider it as many individual material changes) and +the end of a permitted change window will prevent additional worker-node updates +from being initiated. + +Changes vital to the continued operation of the cluster (e.g. certificate rotation) +are not considered material changes. Ignoring operational practicalities (e.g. +the need to fix critical bugs or update a cluster to supported software versions), +it should be possible to safely leave changes pending indefinitely. That said, +Service Delivery and/or higher level management systems may choose to prevent +such problematic change management settings from being applied by using +validating webhooks. + +## Motivation +This enhancement is designed to improve user experience during the OpenShift +upgrade process and other key operational moments when configuration updates +may result in material changes in cluster behavior and potential disruption +for non-HA workloads. + +The enhancement offers a direct operational tool to users while indirectly +supporting a longer term separation of control-plane and worker-node updates +for **Standalone** cluster profiles into distinct concepts and phases of managing +an OpenShift cluster (HCP clusters already provide this distinction). The motivations +for both aspects will be covered, but a special focus will be made on the motivation +for separating Standalone control-plane and worker-node updates as, while not fully realized +by this enhancement alone, ultimately provides additional business value helping to +justify an investment in the new operational tool. + +### Supporting the Eventual Separation of Control-Plane and Worker-Node Updates +One of the key value propositions of this proposal pre-supposes a successful +decomposition of the existing, fully self-managed, Standalone update process into two +distinct phases as understood and controlled by the end-user: +(1) control-plane update and (2) worker-node updates. + +To some extent, Maintenance Schedules (a key supported option for change management) +are a solution to a problem that will be created by this separation: there is a perception that it would also +double the operational burden for users updating a cluster (i.e. they have +two phases to initiate and monitor instead of just one). In short, implementing the +Maintenance Schedules concept allows users to succinctly express if and how +they wish to differentiate these phases. + +Users well served by the fully self-managed update experience can disable +change management (i.e. not set an enforced maintenance schedule), specifying +that control-plane and worker node updates can take place at +any time. Users who need more control may choose to update their control-plane +regularly (e.g. to patch CVEs) with a permissive change management configuration +for the control-plane while using a tight maintenance schedule for worker-nodes +to only update during specific, low utilization, periods. + +Since separating the node update phases is such an important driver for +Maintenance Schedules, their motivations are heavily intertwined. The remainder of this +section, therefore, delves into the motivation for this separation. + +#### The Case for Control-Plane and Worker-Node Separation +From an overall platform perspective, we believe it is important to drive a distinction +between updates of the control-plane and worker-nodes. Currently, an update is initiated +and OpenShift's ostensibly fully self-managed update mechanics take over (CVO laying +out new manifests, cluster operators rolling out new operands, etc.) culminating with +worker-nodes being drained a rebooted by the machine-config-operator (MCO) to align +them with the version of OpenShift running on the control-plane. + +This approach has proven extraordinarily successful in providing a fast and reliable +control-plane update, but, in rare cases, the highly opinionated update process leads +to less than ideal outcomes. + +##### Node Update Separation to Address Problems in User Perception +Our success in making OpenShift control-plane updates reliable, exhaustive focus on quality aside, +is also made possible by the platform's exclusive ownership of the workloads that run on the control-plane +nodes. Worker-nodes, on the other hand, run an endless variety of non-platform, user defined workloads - many of +which are not necessarily perfectly crafted. For example, workloads with pod disruption budgets (PDBs) that +prevent node drains and workloads which are not fundamentally HA (i.e. where draining part of the workload creates +disruption in the service it provides). + +Ultimately, we cannot solve the issue of problematic user workload configurations because +they are intentionally representable with Kubernetes APIs (e.g. it may be the user's actual intention to prevent a pod +from being drained, or it may be too expensive to make a workload fully HA). When confronted with +problematic workloads, the current, fully self-managed, OpenShift update process can appear to the end-user +to be unreliable or slow. This is because the self-managed update process takes on the end-to-end responsibility +of updating the control-plane and worker-nodes. Given the automated and somewhat opaque nature of this +update, it is reasonable for users to expect that the process is hands-off and will complete in a timely +manner regardless of their workloads. + +When this expectation is violated because of problematic user workloads, the update process is +often called into question. For example, if an update appears stalled after 12 hours, a +user is likely to have a poor perception of the platform and open a support ticket before +successfully diagnosing an underlying undrainable workload. + +By separating control-plane and worker-node updates into two distinct phases for an operator to consider, +we can more clearly communicate (1) the reliability and speeed of OpenShift control-plane updates and +(2) the shared responsibility, along with the end user, of successfully updating worker-nodes. + +As an analogy, when you are late to work because of delays in a subway system, you blame the subway system. +They own the infrastructure and schedules and have every reason to provide reliable and predictable transport. +If, instead, you are late to work because you step into a fully automated car that gets stuck in traffic, you blame the +traffic. The fully self-managed update process suggests to the end user that it is a subway -- subtly insulating +them from the fact that they might well hit traffic (problematic user workloads). Separating the update journey into +two parts - a subway portion (the control-plane) and a self-driving car portion (worker-nodes), we can quickly build the +user's intuition about their responsibilities in the latter part of the journey. For example, leaving earlier to +avoid traffic or staying at a hotel after the subway to optimize their departure for the car ride. + +##### Node Update Separation to Improve Risk Mitigation Strategies +With any cluster update, there is risk -- software is changing and even subtle differences in behavior can cause +issues given an unlucky combination of factors. Teams responsible for cluster operations are familiar with these +risks and owe it to their stakeholders to minimize them where possible. + +The current, fully self-managed, update process makes one obvious risk mitigation strategy +a relatively advanced strategy to employ: only updating the control-plane and leaving worker-nodes as-is. +It is possible by pausing machine config pools, but this is certainly not an intuitive step for users. Farther back +in OpenShift 4's history, the strategy was not even safe to perform since it could lead to worker-node +certificates to expiring. + +By separating the control-plane and worker-node updates into two separate steps, we provide a clear +and intuitive method of deferring worker-node updates: not initiating them. Leaving this to the user's +discretion, within safe skew-bounds, gives them the flexibility to make the right choices for their +unique circumstances. + +#### Enhancing Operational Control +The preceding section delved deeply into a motivation for Change Management / Maintenance Schedules based on our desire to +separate control-plane and worker-node updates without increasing operational burden on end-users. However, +Change Management, by providing control over exactly when updates & material changes to nodes in +the cluster can be initiated, provide value irrespective of this strategic direction. The benefit of +controlling exactly when changes are applied to critical systems is universally appreciated in enterprise +software. + +Since these are such well established principles, I will summarize the motivation as helping +OpenShift meet industry standard expectations with respect to limiting potentially disruptive change +outside well planned time windows. + +It could be argued that rigorous and time sensitive management of OpenShift cluster API resources could prevent +unplanned material changes, but Change Management / Maintenance Schedules introduce higher level, platform native, and more +intuitive guard rails. For example, consider the common pattern of a gitops configured OpenShift cluster. +If a user wants to introduce a change to a MachineConfig, it is simple to merge a change to the +resource without appreciating the fact that it will trigger a rolling reboot of nodes in the cluster. + +Trying to merge this change at a particular time of day and/or trying to pause and unpause a +MachineConfigPool to limit the impact of that merge to a particular time window requires +significant forethought by the user. Even with that forethought, if an enterprise wants +changes to only be applied during weekends, additional custom mechanics would need +to be employed to ensure the change merged during the weekend without needing someone present. + +Contrast this complexity with the user setting a Change Management / Maintenance Schedule on the cluster. The user +is then free to merge configuration changes and gitops can apply those changes to OpenShift +resources, but material change to the cluster will not be initiated until a time permitted +by the Maintenance Schedule. Users do not require special insight into the implications of +configuring platform resources as the top-level Maintenance Schedule control will help ensure +that potentially disruptive changes are limited to well known time windows. + +#### Reducing Service Delivery Operational Tooling +Service Delivery, as part of our OpenShift Dedicated, ROSA and other offerings is keenly aware of +the issues motivating the Change Management / Maintenance Schedule concept. This is evidenced by their design +and implementation of tooling to fill the gaps in the platform the preceding sections +suggest exist. + +Specifically, Service Delivery has developed UXs outside the platform which allow customers +to define a preferred maintenance window. For example, when requesting an update, the user +can specify the desired start time. This is honored by Service Delivery tooling (unless +there are reasons to supersede the customer's preference). + +By acknowledging the need for scheduled maintenance in the platform, we reduce the need for Service +Delivery to develop and maintain custom tooling to manage the platform while +simultaneously reducing simplifying management for all customer facing similar challenges. + +### User Stories +For readability, "cluster lifecycle administrator" is used repeatedly in the user stories. This +term can apply to different roles depending on the cluster environment and profile. In general, +it is the person or team making most material changes to the cluster - including planning and +choosing when to enact phases of the OpenShift platform update. + +For HCP, the role is called the [Cluster Service Consumer](https://hypershift-docs.netlify.app/reference/concepts-and-personas/#personas). For +Standalone clusters, this role would normally be filled by one or more `system:admin` users. There +may be several layers of abstraction between the cluster lifecycle administrator and changes being +actuated on the cluster (e.g. gitops, OCM, Hive, etc.), but the role will still be concerned with limiting +risks and disruption when rolling out changes to their environments. + +> "As a cluster lifecycle administrator, I want to ensure any material changes to my cluster +> (control-plane or worker-nodes) are only initiated during well known windows of low service +> utilization to reduce the impact of any service disruption." + +> "As a cluster lifecycle administrator, I want to ensure any material changes to my +> control-plane are only initiated during well known windows of low service utilization to +> reduce the impact of any service disruption." + +> "As a cluster lifecycle administrator, I want to ensure that no material changes to my +> cluster occur during a known date range even if it falls within our +> normal maintenance schedule due to an anticipated atypical usage (e.g. Black Friday)." + +> "As a cluster lifecycle administrator, I want to pause additional material changes from +> taking place when it is no longer practical to monitor for service disruptions. For example, +> if a worker-node update is proving to be problematic during a valid permissive window, I would +> like to be able to pause that change manually so that the team will not have to work on the weekend." + +> "As a cluster lifecycle administrator, I need to stop all material changes on my cluster +> quickly and indefinitely until I can understand a potential issue. I not want to consider dates or +> timezones in this delay as they are not known and irrelevant to my immediate concern." + +> "As a cluster lifecycle administrator, I want to ensure any material changes to my +> control-plane are only initiated during well known windows of low service utilization to +> reduce the impact of any service disruption. Furthermore, I want to ensure that material +> changes to my worker-nodes occur on a less frequent cadence because I know my workloads +> are not HA." + +> "As an SRE, tasked with performing non-emergency corrective action, I want +> to be able to apply a desired configuration (e.g. PID limit change) and have that change roll out +> in a minimally disruptive way subject to the customer's configured maintenance schedule." + +> "As an SRE, tasked with performing emergency corrective action, I want to be able to +> quickly disable a configured maintenance schedule, apply necessary changes, have them roll out immediately, +> and restore the maintenance schedule to its previous configuration." + +> "As a leader within the Service Delivery organization, tasked with performing emergency corrective action +> across our fleet, I want to be able to bypass and then restore customer maintenance schedules +> with minimal technical overhead." + +> "As a cluster lifecycle administrator who is well served by a fully managed update without change management, +> I want to be minimally inconvenienced by the introduction of change management / maintenance schedules." + +> "As a cluster lifecycle administrator who is not well served by a fully managed update and needs exacting +> control over when material changes occur on my cluster where opportunities do NOT arise at reoccurring intervals, +> I want to employ a change management strategy that defers material changes until I perform a manual action." + +> "As a cluster lifecycle administrator, I want to easily determine the next time at which maintenance operations +> will be permitted to be initiated, based on the configured maintenance schedule, by looking at the +> status of relevant API resources or metrics." + +> "As a cluster lifecycle administrator, I want to easily determine whether there are material changes pending for +> my cluster, awaiting a permitted window based on the configured maintenance schedule, by looking at the +> status of relevant API resources or metrics." + +> "As a cluster lifecycle administrator, I want to easily determine whether a maintenance schedule is currently being +> enforced on my cluster by looking at the status of relevant API resources or metrics." + +> "As a cluster lifecycle administrator, I want to be able to alert my operations team when changes are pending, +> when and the number of seconds to the next permitted window approaches, or when a maintenance schedule is not being +> enforced on my cluster." + +> "As a cluster lifecycle administrator, I want to be able to diagnose why pending changes have not been applied +> if I expected them to be." + +> "As a cluster administrator or privileged user familiar with OpenShift prior to the introduction of change management, +> I want it to be clear when I am looking at the desired versus actual state of the system. For example, if I can see +> the state of the clusterversion or a machineconfigpool, it should be straightforward to understand why I am +> observing differences in the state of those resources compared to the state of the system." + +### Goals + +1. Indirectly support the strategic separation of control-plane and worker-node update phases for Standalone clusters by supplying a change control mechanism that will allow both control-plane and worker-node updates to proceed at predictable times without doubling operational overhead. +2. Directly support the strategic separation of control-plane and worker-node update phases by implementing a "manual" change management strategy where users who value the full control of the separation can manually actuate changes to them independently. +3. Empower OpenShift cluster lifecycle administrators with tools that simplify implementing industry standard notions of maintenance windows. +4. Provide Service Delivery a platform native feature which will reduce the amount of custom tooling necessary to provide maintenance windows for customers. +5. Deliver a consistent change management experience across all platforms and profiles (e.g. Standalone, ROSA, HCP). +6. Enable SRE to, when appropriate, make configuration changes on a customer cluster and have that change actually take effect only when permitted by the customer's change management preferences. +7. Do not subvert expectations of customers well served by the existing fully self-managed cluster update. +8. Ensure the architectural space for enabling different change management strategies in the future. + +### Non-Goals + +1. Allowing control-plane upgrades to be paused midway through an update. Control-plane updates are relatively rapid and pausing will introduce unnecessary complexity and risk. +2. Requiring the use of maintenance schedules for OpenShift upgrades (the changes should be compatible with various upgrade methodologies – including being manually triggered). +3. Allowing Standalone worker-nodes to upgrade to a different payload version than the control-plane (this is supported in HCP, but is not a goal for standalone). +4. Exposing maintenance schedule controls from the oc CLI. This may be a future goal but is not required by this enhancement. +5. Providing strict promises around the exact timing of upgrade processes. Maintenance schedules will be honored to a reasonable extent (e.g. upgrade actions will only be initiated during a window), but long-running operations may exceed the configured end of a maintenance schedule. +6. Implementing logic to defend against impractical maintenance schedules (e.g. if a customer configures a 1-second maintenance schedule every year). Service Delivery may want to implement such logic to ensure upgrade progress can be made. + +## Proposal + +### Change Management Overview +Add a `changeManagement` stanza to several resources in the OpenShift ecosystem: +- HCP's `HostedCluster`. Honored by HyperShift Operator and supported by underlying CAPI primitives. +- HCP's `NodePool`. Honored by HyperShift Operator and supported by underlying CAPI primitives. +- Standalone's `ClusterVersion`. Honored by Cluster Version Operator. +- Standalone's `MachineConfigPool`. Honored by Machine Config Operator. + +The implementation of `changeManagement` will vary by profile +and resource, however, they will share a core schema and provide a reasonably consistent user +experience across profiles. + +The schema will provide options for controlling exactly when changes to API resources on the +cluster can initiate material changes to the cluster. Changes that are not allowed to be +initiated due to a change management control will be called "pending". Subsystems responsible +for initiating pending changes will await a permitted window according to the change's +relevant `changeManagement` configuration(s). + +### Change Management Strategies +Each resource supporting change management will add the `changeManagement` stanza and support a minimal set of change management strategies. +Each strategy may require an additional configuration element within the stanza. For example: +```yaml +spec: + changeManagement: + strategy: "MaintenanceSchedule" + pausedUntil: false + disabledUntil: false + config: + maintenanceSchedule: + ..options to configure a detailed policy for the maintenance schedule.. +``` + +All change management implementations must support `Disabled` and `MaintenanceSchedule`. Abstracting +change management into strategies allows for simplified future expansion or deprecation of strategies. +Tactically, `strategy: Disabled` provides a convenient syntax for bypassing any configured +change management policy without permanently deleting its configuration. + +For example, if SRE needs to apply emergency corrective action on a cluster with a `MaintenanceSchedule` change +management strategy configured, they can simply set `strategy: Disabled` without having to delete the existing +`maintenanceSchedule` stanza which configures the previous strategy. Once the correct action has been completed, +SRE simply restores `strategy: MaintenanceSchedule` and the previous configuration begins to be enforced. + +Configurations for multiple management strategies can be recorded in the `config` stanza, but +only one strategy can be active at a given time. + +Each strategy will support a policy for pausing or unpausing (permitting) material changes from being initiated. +This will be referred to as the strategy's enforcement state (or just "state"). The enforcement state for a +strategy can be either "paused" or "unpaused" (a.k.a. "permissive"). The `Disabled` strategy enforcement state +is always permissive -- allowing material changes to be initiated (see [Change Management +Hierarchy](#change-management-hierarchy) for caveats). + +All change management strategies, except `Disabled`, are subject to the following `changeManagement` fields: +- `changeManagement.disabledUntil: `: When `disabledUntil: true` or `disabledUntil: `, the interpreted strategy for + change management in the resource is `Disabled`. Setting a future date in `disabledUntil` offers a less invasive (i.e. no important configuration needs to be changed) method to + disable change management constraints (e.g. if it is critical to roll out a fix) and a method that + does not need to be reverted (i.e. it will naturally expire after the specified date and the configured + change management strategy will re-activate). +- `changeManagement.pausedUntil: `: Unless the effective active strategy is Disabled, `pausedUntil: true` or `pausedUntil: `, change management must + pause material changes. + +### Change Management Status +Change Management information will also be reflected in resource status. Each resource +which contains the stanza in its `spec` will expose its current impact in its `status`. +Common user interfaces for aggregating and displaying progress of these underlying resources +should be updated to proxy that status information to the end users. + +### Change Management Metrics +Cluster wide change management information will be made available through cluster metrics. Each resource +containing the stanza should expose the following metrics: +- The number of seconds until the next known permitted change window. 0 if changes can currently be initiated. -1 if changes are paused indefinitely. -2 if no permitted window can be computed. +- Whether any change management strategy is enabled. +- Which change management strategy is enabled. +- If changes are pending due to change management controls. + +### Change Management Hierarchy +Material changes to worker-nodes are constrained by change management policies in their associated resource AND +at the control-plane resource. For example, in a standalone profile, if a MachineConfigPool's change management +configuration apparently permits material changes from being initiated at a given moment, that is only the case +if ClusterVersion is **also** permitting changes from being initiated at that time. + +The design choice is informed by a thought experiment: As a cluster lifecycle administrator for a Standalone cluster, +who wants to achieve the simple goal of ensuring no material changes take place outside a well-defined +maintenance schedule, do you want to have to the challenge of keeping every MachineConfigPool's +`changeManagement` stanza in perfect synchronization with the ClusterVersion's? What if a new MCP is created +without your knowledge? + +The hierarchical approach allows a single master change management policy to be in place across +both the control-plane and worker-nodes. + +Conversely, material changes CAN take place on the control-plane when permitted by its associated +change management policy even while material changes are not being permitted by worker-nodes +policies. + +It is thus occasionally necessary to distinguish a resource's **configured** vs **effective** change management +state. There are two states: "paused" and "unpaused" (a.k.a. permissive; meaning that material changes be initiated). +For a control-plane resource, the configured and effective enforcement states are always the same. For worker-node +resources, the configured strategy may be disabled, but the effective enforcement state can be "paused" due to +an active strategy in the control-plane resource being in the "paused" state. + +| control-plane state | worker-node state | worker-node effective state | results | +|-------------------------|---------------------------|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| unpaused | unpaused | unpaused | Traditional, fully self-managed change rollouts. Material changes can be initiated immediately upon configuration change. | +| paused (any strategy) | **unpaused** | **paused** | Changes to both the control-plane and worker-nodes are constrained by the control-plane strategy. | +| unpaused | paused (any strategy) | paused | Material changes can be initiated immediately on the control-plane. Material changes on worker-nodes are subject to the worker-node policy. | +| paused (any strategy) | paused (any strategy) | paused | Material changes to the control-plane are subject to change control strategy for the control-plane. Material changes to the worker-nodes are subject to **both** the control-plane and worker-node strategies - if either precludes material change initiation, changes are left pending. | + +#### Maintenance Schedule Strategy +The maintenance schedule strategy is supported by all resources which support change management. The strategy +is configured by specifying an RRULE identifying permissive datetimes during which material changes can be +initiated. The cluster lifecycle administrator can also exclude specific date ranges, during which +material changes will be paused. + +#### Disabled Strategy +This strategy indicates that no change management strategy is being enforced by the resource. It always implies that +the enforcement state at the resource level is unpaused / permissive. This does not always +mean that material changes are permitted due to change management hierarchies. For example, a MachineConfigPool +with `strategy: Disabled` would still be subject to a `strategy: MaintenanceStrategy` in the ClusterVersion resource. + +#### Assisted Strategy - MachineConfigPool +Minimally, this strategy will be supported by MachineConfigPool. If and when the strategy is supported by other +change management capable resources, the configuration schema for the policy may differ as the details of +what constitutes and informs change varies between resources. + +This strategy is motivated by the desire to support the separation of control-plane and worker-node updates both +conceptually for users and in real technical terms. One way to do this for users who do not benefit from the +`MaintenanceSchedule` strategy is to ask them to initiate, pause, and resume the rollout of material +changes to their worker nodes. Contrast this with the fully self-managed state today, where worker-nodes +(normally) begin to be updated automatically and directly after the control-plane update. + +Clearly, if this was the only mode of updating worker-nodes, we could never successfully disentangle the +concepts of control-plane vs worker-node updates in Standalone environments since one implies the other. + +In short (details will follow in the implementation section), the assisted strategy allows users to specify the +exact rendered [`desiredConfig` the MachineConfigPool](https://github.com/openshift/machine-config-operator/blob/5112d4f8e562a2b072106f0336aeab451341d7dc/docs/MachineConfigDaemon.md#coordinating-updates) should be advertising to the MachineConfigDaemon on +nodes it is associated with. Like the `MaintenanceSchedule` strategy, it also respects the `pausedUntil` +field. + +#### Manual Strategy - MachineConfigPool +Minimally, this strategy will be supported by MachineConfigPool. If and when the strategy is supported by other +change management capable resources, the configuration schema for the policy may differ as the details of +what constitutes and informs change varies between resources. + +Like the Assisted strategy, this strategy is implemented to support the conceptual and technical separation +of control-plane and worker-nodes. The MachineConfigPool Manual strategy allows users to explicitly specify +their `desiredConfig` to be used for ignition of new and rebooting nodes. While the Manual strategy is enabled, +the MachineConfigOperator will not trigger the MachineConfigDaemon to drain or reboot nodes automatically. + +Because the Manual strategy initiates changes on its own behalf, `pausedUntil` has no effect. From a metrics +perspective, this strategy reports as paused indefinitely. + +### Workflow Description + +#### OCM HCP Standard Change Management Scenario + +1. A [Cluster Service Consumer](https://hypershift-docs.netlify.app/reference/concepts-and-personas/#personas) requests an HCP cluster via OCM. +1. To comply with their company policy, the service consumer configures a maintenance schedule through OCM. +1. Their first preference, no updates at all, is rejected by OCM policy, and they are referred to service + delivery documentation explaining minimum requirements. +1. The user specifies a policy which permits changes to be initiated any time Saturday UTC on the control-plane. +1. To limit perceived risk, they try to specify a separate policy permitting worker-nodes updates only on the **first** Sunday of each month. +1. OCM rejects the configuration because, due to change management hierarchy, worker-node maintenance schedules can only be a proper subset of control-plane maintenance schedules. +1. The user changes their preference to a policy permitting worker-nodes updates only on the **first** Saturday of each month. +1. OCM accepts the configuration. +1. OCM configures the HCP (HostedCluster/NodePool) resources via the Service Delivery Hive deployment to contain a `changeManagement` stanza + and an active/configured `MaintenanceSchedule` strategy. +1. Hive updates the associated HCP resources. +1. Company workloads are added to the new cluster and the cluster provides value. +1. To leverage a new feature in OpenShift, the service consumer plans to update the minor version of the platform. +1. Via OCM, the service consumer requests the minor version update. They can do this at any time with confidence that the maintenance + schedule will be honored. They do so on Wednesday. +1. OCM (through various layers) updates the target release payload in the HCP HostedCluster and NodePool. +1. The HyperShift Operator detects the desired changes but recognizes that the `changeManagement` stanza + precludes the updates from being initiated. +1. Curious, the service consumer checks the projects ClusterVersion within the HostedCluster and reviews its `status` stanza. It shows that changes are pending and the time of the next window in which changes can be initiated. +1. Separate metrics specific to change management indicate that changes are pending for both resources. +1. The non-Red Hat operations team has alerts setup to fire when changes are pending and the number of + seconds before the next permitted window is less than 2 days away. +1. These alerts fire after Thursday UTC 00:00 to inform the operations team that changes are about to be applied to the control-plane. +1. It is not the first week of the month, so there is no alert fired for the NodePool pending changes. +1. The operations team is comfortable with the changes being rolled out on the control-plane. +1. On Saturday 00:00 UTC, the HyperShift operator initiates changes the control-plane update. +1. The update completes without issue. +1. Changes remain pending for the NodePool resource. +1. As the first Saturday of the month approaches, the operations alerts fire to inform the team of forthcoming changes. +1. The operations team realizes that a corporate team needs to use the cluster heavily during the weekend for a business critical deliverable. +1. The service consumer logs into OCM and adds an exclusion for the upcoming Saturday. +1. Interpreting the new exclusion, the metric for time remaining until a permitted window increases to count down to the following month's first Saturday. +1. A month passes and the pending cause the configured alerts to fire again. +1. The operations team is comfortable with the forthcoming changes. +1. The first Saturday of the month 00:00 UTC arrives. The HyperShift operator initiates the worker-node updates based on the pending changes in the cluster NodePool. +1. The HCP cluster has a large number of worker nodes and draining and rebooting them is time-consuming. +1. At 23:59 UTC Saturday night, 80% of worker-nodes have been updated. Since the maintenance schedule still permits the initiation of material changes, another worker-node begins to be updated. +1. The update of this worker-node continues, but at 00:00 UTC Sunday, no further material changes are permitted by the change management policy and the worker-node update process is effectively paused. +1. Because not all worker-nodes have been updated, changes are still reported as pending via metrics for NodePool. **TODO: Review with HyperShift. Pausing progress should be possible, but a metric indicating changes still pending may not since they interact only through CAPI.** +1. The HCP cluster runs with worker-nodes at mixed versions throughout the month. The N-1 skew between the old kubelet versions and control-plane is supported. +1. **TODO: Review with Service Delivery. If the user requested another minor bump to their control-plane, how does OCM prevent unsupported version skew today?** +1. On the next first Saturday, the worker-nodes updates are completed. + +#### OCM Standalone Standard Change Management Scenario + +1. User interactions with OCM to configure a maintenance schedule are identical to [OCM HCP Standard Change Management Scenario](#ocm-hcp-standard-change-management-scenario). + This scenario differs after OCM accepts the maintenance schedule configuration. Control-plane updates are permitted to be initiated to any Saturday UTC. + Worker-nodes must wait until the first Saturday of the month. +1. OCM (through various layers) configures the ClusterVersion and worker MachineConfigPool(s) (MCP) for the cluster with appropriate `changeManagement` stanzas. +1. Company workloads are added to the new cluster and the cluster provides value. +1. To leverage a new feature in OpenShift, the service consumer plans to update the minor version of the platform. +1. Via OCM, the service consumer requests the minor version update. They can do this at any time with confidence that the maintenance + schedule will be honored. They do so on Wednesday. +1. OCM (through various layers) updates the ClusterVersion resource on the cluster indicating the new release payload in `desiredUpdate`. +1. The Cluster Version Operator (CVO) detects that its `changeManagement` stanza does not permit the initiation of the change. +1. The CVO sets a metric indicating that changes are pending for ClusterVersion. Irrespective of pending changes, the CVO also exposes a + metric indicating the number of seconds until the next window in which material changes can be initiated. +1. Since MachineConfigs do not match in the desired update and the current manifests, the CVO also sets a metric indicating that MachineConfig + changes are pending. This is done because the MachineConfigOperator (MCO) cannot anticipate the coming manifest changes and cannot, + therefore, reflect expected changes to the worker-node MCPs. Anticipating this change ahead of time is necessary for an operation + team to be able to set an alert with the semantics (worker-node-update changes are pending & time remaining until changes are permitted < 2d). + The MCO will expose its own metric for changes pending when manifests are updated. But this metric will only indicate when + there are machines in the pool that have not achieved the desired configuration. An operations team trying to implement the 2d + early warning for worker-nodes must use OR on these metrics to determine whether changes are actually pending. +1. The MCO, irrespective of pending changes, exposes a metric for each MCP to indicate the number of seconds remaining until it is + permitted to initiate changes to nodes in that MCP. +1. A privileged user on the cluster notices different options available for `changeManagement` in the ClusterVersion and MachineConfigPool + resources. They try to set them but are prevented by either RBAC or an admission webhook (details for Service Delivery). If they wish + to change the settings, they must update them through OCM. +1. The privileged user does an `oc describe ...` on the resources. They can see that material changes are pending in ClusterVersion for + the control-plane and for worker machine config. They can also see the date and time that the next material change will be permitted. + The MCP will not show a pending change at this time, but will show the next time at which material changes will be permitted. +1. The next Saturday is _not_ the first Saturday of the month. The CVO detects that material changes are permitted at 00:00 UTC and + begins to apply manifests. This effectively initiates the control-plane update process, which is considered a single + material change to the cluster. +1. The control-plane update succeeds. The CVO, having reconciled its state, unsets metrics suggesting changes are pending. +1. As part of updating cluster manifests, MachineConfigs have been modified. The MachineConfigOperator (MCO) re-renders a + configuration for worker-nodes. However, because the MCP maintenance schedule precludes initiating material changes, + it will not begin to update Machines with that desired configuration. +1. The MCO will set a metric indicating that desired changes are pending. +1. `oc get -o=yaml/describe` will both provide status information indicating that changes are pending for the MCP and + the time at which the next material changes can be initiated according to the maintenance schedule. +1. On the first Saturday of the next month, 00:00 UTC, the MCO determines that material changes are permitted. + Based on limits like maxUnavailable, the MCO begins to annotate nodes with the desiredConfiguration. The + MachineConfigDaemon takes over from there, draining, and rebooting nodes into the updated release. +1. There are a large number of nodes in the cluster and this process continues for more than 24 hours. On Saturday + 23:59, the MCO applies a round of desired configurations annotations to Nodes. At 00:00 on Sunday, it detects + that material changes can no longer be initiated, and pauses its activity. Node updates that have already + been initiated continue beyond the maintenance schedule window. +1. Since not all nodes have been updated, the MCO continues to expose a metric informing the system of + pending changes. +1. In the subsequent days, the cluster is scaled up to handle additional workload. The new nodes receive + the most recent, desired configuration. +1. On the first Saturday of the next month, the MCO resumes its work. In order to ensure that forward progress is + made for all nodes, the MCO will update nodes that have the oldest current configuration first. This ensures + that even if the desired configuration has changed multiple times while maintenance was not permitted, + no nodes are starved of updates. Consider the alternative where (a) worker-node updates required > 24h, + (b) updates to nodes are performed alphabetically, and (c) MachineConfigs are frequently being changed + during times when maintenance is not permitted. This strategy could leave nodes sorting last + lexicographically no opportunity to receive updates. This scenario would eventually leave those nodes + more prone to version skew issues. +1. During this window of time, all node updates are initiated, and they complete successfully. + +#### Service Delivery Emergency Patch +1. SRE determines that a significant new CVE threatens the fleet. +1. A new OpenShift release in each z-stream fixes the problem. +1. SRE plans to override customer maintenance schedules in order to rapidly remediate the problem across the fleet. +1. The new OpenShift release(s) are configured across the fleet. Clusters with permissive maintenance + schedules begin to apply the changes immediately. +1. Clusters with change management policies precluding updates are SRE's next focus. +1. During each region's evening hours, to limit disruption, SRE changes the `changeManagement` strategy + field across relevant resources to `Disabled`. Changes that were previously pending are now + permitted to be initiated. +1. Cluster operators who have alerts configured to fire when there is no change management policy in place + will do so. +1. As clusters are successfully remediated, SRE restores the `MaintenanceSchedule` strategy for its resources. + + +#### Service Delivery Immediate Remediation +1. A customer raises a ticket for a problem that is eventually determined to be caused by a worker-node system configuration. +1. SRE can address the issue with a system configuration file applied in a MachineConfig. +1. SRE creates the MachineConfig for the customer and provides the customer the option to either (a) wait until their + configured maintenance schedule permits the material change from being initiated by the MachineConfigOperator + or (b) having SRE override the maintenance schedule and permitting its immediate application. +1. The customer chooses immediate application. +1. SRE applies a change to the relevant control-plane AND worker-node resource's `changeManagement` stanza + (both must be changed because of the change management hierarchy), setting `disabledUntil` to + a time 48 hours in the future. The configured change management schedule is ignored for 48 as the system + initiates all necessary node changes. + +#### Service Delivery Deferred Remediation +1. A customer raises a ticket for a problem that is eventually determined to be caused by a worker-node system configuration. +1. SRE can address the issue with a system configuration file applied in a MachineConfig. +1. SRE creates the MachineConfig for the customer and provides the customer the option to either (a) wait until their + configured maintenance schedule permits the material change from being initiated by the MachineConfigOperator + or (b) having SRE override the maintenance schedule and permitting its immediate application. +1. The problem is not pervasive, so the customer chooses the deferred remediation. +1. The change is initiated and nodes are rebooted during the next permissive window. + + +#### On-prem Standalone GitOps Change Management Scenario +1. An on-prem cluster is fully managed by gitops. As changes are committed to git, those changes are applied to cluster resources. +1. Configurable stanzas of the ClusterVersion and MachineConfigPool(s) resources are checked into git. +1. The cluster lifecycle administrator configures `changeManagement` in both the ClusterVersion and worker MachineConfigPool + in git. The MaintenanceSchedule strategy is chosen. The policy permits control-plane and worker-node updates only after + 19:00 Eastern US. +1. During the working day, users may contribute and merge changes to MachineConfigs or even the `desiredUpdate` of the + ClusterVersion. These resources will be updated in a timeline manner via GitOps. +1. Despite the resource changes, neither the CVO nor MCO will begin to initiate the material changes on the cluster. +1. Privileged users who may be curious as to the discrepancy between git and the cluster state can use `oc get -o=yaml/describe` + on the resources. They observe that changes are pending and the time at which changes will be initiated. +1. At 19:00 Eastern, the pending changes begin to be initiated. This rollout abides by documented OpenShift constraints + such as the MachineConfigPool `maxUnavailable` setting. + +#### On-prem Standalone Manual Strategy Scenario +1. A small, business critical cluster is being run on-prem. +1. There are no reoccurring windows of time when the cluster lifecycle administrator can tolerate downtime. + Instead, updates are negotiated and planned far in advance. +1. The cluster workloads are not HA and unplanned drains are considered a business risk. +1. To prevent surprises, the cluster lifecycle administrator sets the Manual strategy on the worker MCP. +1. Given the sensitivity of the operation, the lifecycle administrator wants to manually drain and reboot + nodes to accomplish the update. +1. The cluster lifecycle administrator sends a company-wide notice about the period during which service may be disrupted. +1. The user determines the most recent rendered worker configuration. They configure the `manual` change + management policy to use that exact configuration as the `desiredConfig`. +1. The MCO is thus being asked to ignite any new node or rebooted node with the desired configuration, but it + is **not** being permitted to apply that configuration to existing nodes because it is change management, in effect, + is paused indefinitely by the manual strategy. +1. The MCO metric for the MCP indicating the number of seconds remaining until changes can be initiated is `-1` - indicating + that there is presently no time in the future where it will initiate material changes. The operations team + has an alert configured if this value `!= -1`. +1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running + the most recently rendered configuration. This is irrespective of the `desiredConfig` in the `manual` + policy. Abstractly, it means, if change management were disabled, whether changes be initiated. +1. The cluster lifecycle administrator manually drains and reboots nodes in the cluster. As they come back online, + the MachineConfigServer offers them the desiredConfig requested by the manual policy. +1. After updating all nodes, the cluster lifecycle administrator does not need make any additional + configuration changes. They can leave the `changeManagement` stanza in their MCP as-is. + +#### On-prem Standalone Assisted Strategy Scenario +1. A large, business critical cluster is being run on-prem. +1. There are no reoccurring windows of time when the cluster lifecycle administrator can tolerate downtime. + Instead, updates are negotiated and planned far in advance. +1. The cluster workloads are not HA and unplanned drains are considered a business risk. +1. To prevent surprises, the cluster lifecycle administrator sets the Assisted strategy on the worker MCP. +1. In the `assisted` strategy change management policy, the lifecycle administrator configures `pausedUntil: true` + and the most recently rendered worker configuration in the policy's `renderedConfigsBefore: `. +1. The MCO is being asked to ignite any new node or any rebooted node with the latest rendered configuration + before the present datetime. However, because of `pausedUntil: true`, it is also being asked not to + automatically initiate that material change for existing nodes. +1. The MCO metric for the MCP indicating the number of seconds remaining until changes can be initiated is `-1` - indicating + that there is presently no time in the future where it will initiate material changes. The operations team + has an alert configured if this value `!= -1`. +1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running + the most recent, rendered configuration. This is irrespective of the `renderedConfigsBefore` in the `assisted` + configuration. Abstractly, it means, if change management were disabled, whether changes be initiated. +1. When the lifecycle administrator is ready to permit disruption, they set `pausedUntil: false`. +1. The MCO sets the number of seconds until changes are permitted to `0`. +1. The MCO begins to initiate worker node updates. This rollout abides by documented OpenShift constraints + such as the MachineConfigPool `maxUnavailable` setting. +1. Though new rendered configurations may be created, the assisted strategy will not act until the assisted policy + is updated to permit a more recent creation date. + +### API Extensions + +API Extensions are CRDs, admission and conversion webhooks, aggregated API servers, +and finalizers, i.e. those mechanisms that change the OCP API surface and behaviour. + +- Name the API extensions this enhancement adds or modifies. +- Does this enhancement modify the behaviour of existing resources, especially those owned + by other parties than the authoring team (including upstream resources), and, if yes, how? + Please add those other parties as reviewers to the enhancement. + + Examples: + - Adds a finalizer to namespaces. Namespace cannot be deleted without our controller running. + - Restricts the label format for objects to X. + - Defaults field Y on object kind Z. + +Fill in the operational impact of these API Extensions in the "Operational Aspects +of API Extensions" section. + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +In the HCP topology, the HostedCluster and NodePool resources are enhanced to support the change management strategies +`MaintenanceSchedule` and `Disabled`. + +#### Standalone Clusters + +In the Standalone topology, the ClusterVersion and MachineConfigPool resources are enhanced to support the change management strategies +`MaintenanceSchedule` and `Disabled`. The MachineConfigPool also supports the `Manual` and `Assisted` strategies. + +#### Single-node Deployments or MicroShift + +The ClusterVersion operator will honor the change management field just as in a standalone profile. If those profiles +have a MachineConfigPool, material changes the node could be controlled with a change management policy +in that resource. + +#### OCM Managed Profiles +OpenShift Cluster Manager (OCM) should expose a user interface allowing users to manage their change management policy. +Standard Fleet clusters will expose the option to configure the MaintenanceSchedule strategy - including +only permit and exclude times. + +- Service Delivery will reserve the right to disable this strategy for emergency corrective actions. +- Service Delivery should constrain permit & exclude configurations based on their internal policies. For example, customers may be forced to enable permissive windows which amount to at least 6 hours a month. + +### Implementation Details/Notes/Constraints + +#### ChangeManagement Stanza +The change management stanza will be introduced into ClusterVersion and MachineConfigPool (for standalone profiles) +and HostedCluster and NodePool (for HCP profiles). The structure of the stanza is: + +```yaml +spec: + changeManagement: + # The active strategy for change management (unless disabled by disabledUntil). + strategy: + + # If set to true or a future date, the effective change management strategy is Disabled. Date + # must be RFC3339. + disabledUntil: + + # If set to true or a future date, all strategies other than Disabled are paused. Date + # must be RFC3339. + pausedUntil: + + # If a strategy needs additional configuration information, it can read a + # key bearing its name in the config stanza. + config: + : + ...configuration policy for the strategy... +``` + +#### MaintenanceSchedule Strategy Configuration + +```yaml +spec: + changeManagement: + strategy: MaintenanceSchedule + config: + maintenanceSchedule: + # Specifies a reoccurring permissive window. + permit: + # RRULEs (https://www.rfc-editor.org/rfc/rfc5545#section-3.3.10) are commonly used + # for calendar management metadata. Only a subset of the RFC is supported. If + # unset, all dates are permitted and only exclude constrains permissive windows. + recurrence: + # Given the identification of a date by an RRULE, at what time (relative to timezoneOffset) can the + # permissive window begin. "00:00" if unset. + startTime: + # Given the identification of a date by an RRULE, at what time (relative to timezoneOffset) should the + # permissive window end. "23:59:59" if unset. + endTime: + + # Excluded date ranges override RRULE selections. + exclude: + # Dates should be specified in YYYY-MM-DD. Each date is excluded from 00:00 for 24 hours. + - fromDate: + # Non-inclusive until. If null, until defaults to the day after from (meaning a single day exclusion). + untilDate: + + # Specifies an RFC3339 style timezone offset to be applied across their datetime selections. + # "-07:00" indicates negative 7 hour offset from UTC. "+03:00" indicates positive 3 hour offset. If not set, defaults to "+00:00" (UTC). + timezoneOffset: + +``` + +Permitted times (i.e. times at which the strategy enforcement state can be permissive) are specified using a +subset of the [RRULE RFC5545](https://www.rfc-editor.org/rfc/rfc5545#section-3.3.10) and, optionally, a +starting and ending time of day. https://freetools.textmagic.com/rrule-generator is a helpful tool to +review the basic semantics RRULE is capable of expressing. https://exontrol.com/exicalendar.jsp?config=/js#calendar +offers more complex expressions. + +**RRULE Interpretation** +RRULE supports expressions that suggest recurrence without implying an exact date. For example: +- `RRULE:FREQ=YEARLY` - An event that occurs once a year on a specific date. +- `RRULE:FREQ=WEEKLY;INTERVAL=2` - An event that occurs every two weeks. + +All such expressions shall be evaluated with a starting date of Jan 1st, 1970 00:00. In other +words, `RRULE:FREQ=YEARLY` would be considered permissive, for one day, at the start of each new year. + +If no `startTime` or `endTime` is specified, any day selected by the RRULE will suggest a +permissive 24h window unless a date is in the `exclude` ranges. + +**RRULE Constraints** +A valid RRULE for change management: +- must identify a date, so, although RRULE supports `FREQ=HOURLY`, it will not be supported. +- cannot specify an end for the pattern. `RRULE:FREQ=DAILY;COUNT=3` suggests + an event that occurs every day for three days only. As such, neither `COUNT` nor `UNTIL` is + supported. +- cannot specify a permissive window more than 2 years away. + +**Overview of Interactions** +The MaintenanceSchedule strategy, along with `changeManagement.pausedUntil` allows a cluster lifecycle administrator to express +one of the following: + +| pausedUntil | permit | exclude | Enforcement State (Note that **effective** state must also take into account hierarchy) | +|----------------|--------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `null`/`false` | `null` | `null` | Permissive indefinitely | +| `true` | * | * | Paused indefinitely | +| `null`/`false` | set | `null` | Permissive during reoccurring windows time. Paused at all other times. | +| `null`/`false` | set | set | Permissive during reoccurring windows time modulo excluded date ranges during which it is paused. Paused at all other times. | +| `null`/`false` | `null` | set | Permissive except during excluded dates during which it is paused. | +| date | * | * | Honor permit and exclude values, but only after the specified date. For example, permit: `null` and exclude: `null` implies the strategy is indefinitely permissive after the specified date. | + + +#### MachineConfigPool Assisted Strategy Configuration + +```yaml +spec: + changeManagement: + strategy: Assisted + config: + assisted: + permit: + # The assisted strategy will allow the MCO to process any rendered configuration + # that was created before the specified datetime. + renderedConfigsBefore: + # When AllowSettings, rendered configurations after the preceding before date + # can be applied if and only if they do not contain changes to osImageURL. + policy: "AllowSettings|AllowNone" +``` + +The primary user of this strategy is `oc` with tentatively planned enhancements to include verbs +like: +```sh +$ oc adm update worker-nodes start ... +$ oc adm update worker-nodes pause ... +$ oc adm update worker-nodes rollback ... +``` + +These verbs can leverage the assisted strategy and `pausedUntil` to allow the manual initiation of worker-nodes +updates after a control-plane update. + +#### MachineConfigPool Manual Strategy Configuration + +```yaml +spec: + changeManagement: + strategy: Manual + config: + manual: + desiredConfig: +``` + +The manual strategy requests no automated initiation of updates. New and rebooting +nodes will only receive the desired configuration. From a metrics perspective, this strategy +is always paused state. + +#### Metrics + +`cm_change_pending` +Labels: +- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool +- object= +- system= +Value: +- `0`: no material changes are pending. +- `1`: changes are pending but being initiated. +- `2`: changes are pending and blocked based on this resource's change management policy. +- `3`: changes are pending and blocked based on another resource in the change management hierarchy. + +`cm_change_eta` +Labels: +- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool +- object= +- system= +Value: +- `-2`: Error determining the time at which changes can be initiated (e.g. cannot check with ClusterVersion / change management hierarchy). +- `-1`: Material changes are paused indefinitely OR no permissive window can be found within the next 1000 days (the latter ensures a brute force check of intersecting datetimes with hierarchy RRULEs is a valid method of calculating intersection). +- `0`: Any pending changes can be initiated now (e.g. change management is disabled or inside machine schedule window). +- `> 0`: The number seconds remaining until changes can be initiated. + +`cm_strategy_enabled` +Labels: +- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool +- object= +- system= +- strategy=MaintenanceSchedule|Manual|Assisted +Value: +- `0`: Change management for this resource is not subject to this enabled strategy (**does** consider hierarchy based disable). +- `1`: Change management for this resource is directly subject to this enabled strategy. +- `2`: Change management for this resource is indirectly subject to this enabled strategy (i.e. only via control-plane override hierarchy). +- `3`: Change management for this resource is directly and indirectly subject to this enabled strategy. + +#### Change Management Status +Each resource which exposes a `.spec.changeManagement` stanza should also expose `.status.changeManagement` . + +```yaml +status: + changeManagement: + # Always show control-plane level strategy. Disabled if disabledUntil is true. + clusterStrategy: + # If this a worker-node related resource (e.g. MCP), show local strategy. Disabled if disabledUntil is true. + workerNodeStrategy: + # Show effective state. + effectiveState: + description: "Human readable message explaining how strategies & configuration are resulting in the effective state." + # The start of the next permissive window, taking into account the hierarchy. "N/A" for indefinite pause or >1000 days. + permitChangesETA: + changesPending: +``` + +#### Change Management Bypass Annotation +In some situations, it may be necessary for a MachineConfig to be applied regardless of the active change +management policy for a MachineConfigPool. In such cases, `machineconfiguration.openshift.io/bypass-change-management` +can be set to any non-empty string. The MCO will progress until MCPs which select annotated +MachineConfigs have all machines running with a desiredConfig containing that MachineConfig's current state. + +This annotation will be present on `00-master` to ensure that, once the CVO updates the MachineConfig, +the remainder of the control-plane update will be treated as a single material change. + +### Special Handling +These cases are mentioned or implied elsewhere in the enhancement documentation, but they deserve special +attention. + +#### Change Management on Master MachineConfigPool +In order to allow control-plane updates as a single material change, the MCO will only honor change the management configuration for the +master MachineConfigPool if user generated MachineConfigs are the cause for a pending change. To accomplish this, +at least one MachineConfig updated by the CVO will have the `machineconfiguration.openshift.io/bypass-change-management` annotation +indicating that changes in the MachineConfig must be acted upon irrespective of the master MCP change management policy. + +#### Limiting Overlapping Window Search / Permissive Window Calculation +An operator implementing change management for a worker-node related resource must honor the change management hierarchy when +calculating when the next permissive window will occur (called elsewhere in the document, ETA). This is not +straightforward to compute when both the control-plane and worker-nodes have independent MaintenanceSchedule +configurations. + +We can, however, simplify the process by reducing the number of days in the future the algorithm must search for +coinciding permissive windows. 1000 days is a proposed cut-off. + +To calculate coinciding windows then, the implementation can use [rrule-go](https://github.com/teambition/rrule-go) +to iteratively find permissive windows at the cluster / control-plane level. These can be added to an +[interval-tree](https://github.com/rdleal/intervalst) . As dates are added, rrule calculations for the worker-nodes +can be performed. The interval-tree should be able to efficiently determine whether there is an +intersection between the permissive intervals it has stored for the control-plane and the time range tested for the +worker-nodes. + +Since it is possible there is no overlap, limits must be placed on this search. Once dates >1000 days from +the present moment are being tested, the operator can behave as if an indefinite pause has been requested. + +This outcome does not need to be recomputed unless the operator restarts Or one of the RRULE involved +is modified. + +If an overlap _is_ found, no additional intervals need to be added to the tree and it can be discarded. +The operator can store the start & end datetimes for the overlap and count down the seconds remaining +until it occurs. Obviously, this calculation must be repeated: +1. If either MaintenanceSchedule configuration is updated. +1. The operator is restarted. +1. At the end of a permissive window, in order to determine the next permissive window. + + +#### Service Delivery Option Sanitization +It is obvious that the range of flexible options provided by change management configurations offers +can create risks for inexperienced cluster lifecycle administrators. For example, setting a +standalone cluster to use the Assisted strategy and failing to trigger worker-node updates will +leave unpatched CVEs on worker-nodes much longer than necessary. It will also eventually lead to +the need to resolve version skew (Upgradeable=False will be reported by the API cluster operator). + +Service Delivery understands that expose the full range of options to cluster +lifecycle administrators could dramatically increase the overhead of managing their fleet. To +prevent this outcome, Service Delivery will only expose a subset of the change management +strategies. They will also implement sanitization of the configuration options a use can +supply to those strategies. For example, a simplified interface in OCM for building a +limited range of RRULEs that are compliant with Service Delivery's update policies. + +### Risks and Mitigations + +- Given the range of operators which must implement support for change management, inconsistent behavior or reporting may make it difficult for users to navigate different profiles. + - Mitigation: A shared library should be created and vendored for RRULE/exclude/next window calculations/metrics. +- Users familiar with the fully self-managed nature of OpenShift are confused by the lack of material changes be initiated when change management constraints are active. + - Mitigation: The introduction of change management will not change the behavior of existing clusters. Users must make a configuration change. +- Users may put themselves at risk of CVEs by being too conservative with worker-node updates. +- Users leveraging change management may be more likely to reach unsupported kubelet skew configurations vs fully self-managed cluster management. + +### Drawbacks + +The scope of the enhancement - cutting across several operators requires multiple, careful implementations. The enhancement +also touches code paths that have been refined for years which assume a fully self-managed cluster approach. Upsetting these +code paths prove challenging. + +## Open Questions [optional] + +1. Can the HyperShift Operator expose a metric expose a metric for when changes are pending for a subset of worker nodes on the cluster if it can only interact via CAPI resources? +2. Can the MCO interrogate the ClusterVersion change management configuration in order to calculate overlapping permissive intervals in the future? + +## Test Plan + +**Note:** *Section not required until targeted at a release.* + +Consider the following in developing a test plan for this enhancement: +- Will there be e2e and integration tests, in addition to unit tests? +- How will it be tested in isolation vs with other components? +- What additional testing is necessary to support managed OpenShift service-based offerings? + +No need to outline all of the test cases, just the general strategy. Anything +that would count as tricky in the implementation and anything particularly +challenging to test should be called out. + +All code is expected to have adequate tests (eventually with coverage +expectations). + +## Graduation Criteria + +**Note:** *Section not required until targeted at a release.* + +Define graduation milestones. + +These may be defined in terms of API maturity, or as something else. Initial proposal +should keep this high-level with a focus on what signals will be looked at to +determine graduation. + +Consider the following in developing the graduation criteria for this +enhancement: + +- Maturity levels +The API extensions will be made to existing, stable APIs. `changeManagement` is an optional +field in the resources which bear it and so do not break backwards compatibility. + +The lack of a change management field implies the Disabled strategy - which ensures +the existing, fully self-managed update behaviors are not constrained. That is, +under a change management strategy is configured, the behavior of existing clusters +will not be affected. + +### Dev Preview -> Tech Preview + +- Ability to utilize the enhancement end to end +- End user documentation, relative API stability +- Sufficient test coverage +- Gather feedback from users rather than just developers +- Enumerate service level indicators (SLIs), expose SLIs as metrics +- Write symptoms-based alerts for the component(s) + +### Tech Preview -> GA + +- More testing (upgrade, downgrade, scale) +- Sufficient time for feedback +- Available by default +- Backhaul SLI telemetry +- Document SLOs for the component +- Conduct load testing +- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +**For non-optional features moving to GA, the graduation criteria must include +end to end tests.** + +### Removing a deprecated feature + +- The `MachineConfigPool.spec.pause` can begin the deprecation process. Change Management strategies allow for a superset of its behaviors. +- We may consider deprecating `HostCluster.spec.pausedUntil`. HyperShift may consider retaining it with the semantics of pausing all reconciliation with CAPI resources vs just pausing material changes per the change management contract. + +## Upgrade / Downgrade Strategy + +Operators implementing support for change management will carry forward their +existing upgrade and downgrade strategies. + +## Version Skew Strategy + +Operators implementing change management for their resources will not face any +new _internal_ version skew complexities due to this enhancement, but change management +does increase the odds of prolonged and larger differential kubelet version skew. + +For example, particularly given the Manual or Assisted change management strategy, it +becomes easier for a cluster lifecycle administrator to forget to update worker-nodes +along with updates to the control-plane. + +At some point, this will manifest as the kube-apiserver presenting as Upgradeable=False, +preventing future control-plane updates. To reduce the prevalence of this outcome, +the additional responsibilities of the cluster lifecycle administrator when +employing change management strategies must be clearly documented along with SOPs +from recovering from skew issues. + +HyperShift does not have any integrated skew mitigation strategy in place today. HostedCluster +and NodePool support independent release payloads being configured and a cluster lifecycle +administrator can trivially introduce problematic skew by editing these resources. HyperShift +documentation warns against this, but we should expect a moderate increase in the condition +being reported on non-managed clusters (OCM can prevent this situation from arising by +assessing telemetry for a cluster and preventing additional upgrades while worker-node +configurations are inconsistent with the API server). + +## Operational Aspects of API Extensions + +The API extensions proposed by this enhancement should not substantially increase +the scope of work of operators implementing the change management support. The +operators will interact with the same underlying resources/CRDs but with +constraints around when changes can be initiated. As such, no significant _new_ +operational aspects are expected to be introduced. + +## Support Procedures + +Change management problems created by faulty implementations will need to be resolved by +analyzing operator logs. The operator responsible for a given resource will vary. Existing +support tooling like must-gather should capture the information necessary to understand +and fix issues. + +Change management problems where user expectations are not being met are designed to +be informed by the detailed `status` provided by the resources bearing the `changeManagement` +stanza in their `spec`. + +## Alternatives + +### Implement maintenance schedules via an external control system (e.g. ACM) +We do not have an offering in this space. ACM is presently devoted to cluster monitoring and does +not participate in cluster lifecycle. + +### Do not separate control-plane and worker-node updates into separate phases +As separating control-plane and worker-node updates into separate phases is an important motivation for this +enhancement, we could abandon this strategic direction. Reasons motivating this separation are explained +in depth in the motivation section. + +### Separate control-plane and worker-node updates into separate phases, but do not implement the change control concept +As explained in the motivation section, there is a concern that implementing this separation without +maintenance schedules will double the perceived operational overhead of OpenShift updates. + +This also creates work for our Service Delivery team without any platform support. + +### Separate control-plane and worker-node updates into separate phases, but implement a simpler MaintenanceSchedule strategy +We could implement change control without `disabledUntil`, `pausedUntil`, `exclude`, and perhaps more. However, +it is risky to impose a single opinionated workflow onto the wide variety of consumers of the platform. The workflows +described in this enhancement are not intended to be exotic or contrived but situations in which flexibility +in our configuration can achieve real world, reasonable goals. + +`disabledUntil` is designed to support our Service Delivery team who, on occasion, will need +to be able to bypass configured change controls. The feature is easy to use, does not require +deleting or restoring customer configuration (which may be error-prone), and can be safely +"forgotten" after being set to a date in the future. + +`pausedUntil`, among other interesting possibilities, offers a cluster lifecycle administrator the ability +to stop a problematic update from unfolding further. You may have watched a 100 node +cluster roll out a bad configuration change without knowing exactly how to stop the damage +without causing further disruption. This is not a moment when you want to be figuring out how to format +a date string, calculating timezones, or copying around cluster configuration so that you can restore +it after you stop the bleeding. + +### Implement change control, but do not implement the Manual and/or Assisted strategy for MachineConfigPool +Major enterprise users of our software do not update on a predictable, recurring window of time. Updates +require laborious certification processes and testing. Maintenance schedules will not serve these customers +well. However, these customers may still benefit significantly from the change management concept -- +unexpected / disruptive worker node drains and reboots have bitten even experienced OpenShift operators +(e.g. a new MachineConfig being contributed via gitops). + +These strategies inform decision-making through metrics and provide facilities for fine-grained control +over exactly when material change is rolled out to a cluster. + +The Assisted strategy is also specifically designed to provide a foundation for +the forthcoming `oc adm update worker-nodes` verbs. After separating the control-plane and +worker-node update phases, these verbs are intended to provide cluster lifecycle administrators the +ability to easily start, pause, cancel, and even rollback worker-node changes. + +Making accommodations for these strategies should be a subset of the overall implementation +of the MaintenanceSchedule strategy and they will enable a foundation for a range of +different persons not served by MaintenanceSchedule. + + +### Use CRON instead of RRULE +The CRON specification is typically used to describe when something should start and +does not imply when things should end. CRON also cannot, in a standard way, +express common semantics like "The first Saturday of every month." \ No newline at end of file From 6e4f20737eb6ab7a21a0a082722038510a2fe4e6 Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Fri, 8 Mar 2024 15:11:44 -0500 Subject: [PATCH 02/12] Incorporate review feedback - All times are now UTC. - Clarity on not defining system state -- only initiation. - Change from "endTime" to "duration" to easily create windows that span days. - Clarity on how/why CVO shows pending worker-node updates before machineconfig is updated. - Change 1000 day search metric behavior. --- ...ge-management-and-maintenance-schedules.md | 153 ++++++++++-------- 1 file changed, 84 insertions(+), 69 deletions(-) diff --git a/enhancements/update/change-management-and-maintenance-schedules.md b/enhancements/update/change-management-and-maintenance-schedules.md index 243704a4a6..9260f291fc 100644 --- a/enhancements/update/change-management-and-maintenance-schedules.md +++ b/enhancements/update/change-management-and-maintenance-schedules.md @@ -35,6 +35,12 @@ material changes are completed by the close of a permitted change window (e.g. a may still be draining or rebooting) at the close of a maintenance schedule, but it does prevent _additional_ material changes from being initiated. +Change management enforcement _does not_ attempt to define or control the detailed state of the +system. It only pertains to whether controllers which support change management +will attempt to initiate material change themselves. For example, if changes are paused in the middle +of a cluster update and a node is manually rebooted, change management does not define +whether the node will rejoin the cluster with the new or old version. + A "material change" may vary by cluster profile and subsystem. For example, a control-plane update (all components and control-plane nodes updated) is implemented as a single material change (e.g. the close of a scheduled permissive window @@ -186,7 +192,8 @@ configuring platform resources as the top-level Maintenance Schedule control wil that potentially disruptive changes are limited to well known time windows. #### Reducing Service Delivery Operational Tooling -Service Delivery, as part of our OpenShift Dedicated, ROSA and other offerings is keenly aware of +Service Delivery, operating Red Hat's Managed OpenShift offerings (OpenShift Dedicated (OSD), +Red Hat OpenShift on AWS (ROSA) and Azure Red Hat OpenShift (ARO) ) is keenly aware of the issues motivating the Change Management / Maintenance Schedule concept. This is evidenced by their design and implementation of tooling to fill the gaps in the platform the preceding sections suggest exist. @@ -198,7 +205,7 @@ there are reasons to supersede the customer's preference). By acknowledging the need for scheduled maintenance in the platform, we reduce the need for Service Delivery to develop and maintain custom tooling to manage the platform while -simultaneously reducing simplifying management for all customer facing similar challenges. +simultaneously simplifying management for all customer facing similar challenges. ### User Stories For readability, "cluster lifecycle administrator" is used repeatedly in the user stories. This @@ -505,60 +512,62 @@ perspective, this strategy reports as paused indefinitely. 1. User interactions with OCM to configure a maintenance schedule are identical to [OCM HCP Standard Change Management Scenario](#ocm-hcp-standard-change-management-scenario). This scenario differs after OCM accepts the maintenance schedule configuration. Control-plane updates are permitted to be initiated to any Saturday UTC. Worker-nodes must wait until the first Saturday of the month. -1. OCM (through various layers) configures the ClusterVersion and worker MachineConfigPool(s) (MCP) for the cluster with appropriate `changeManagement` stanzas. -1. Company workloads are added to the new cluster and the cluster provides value. -1. To leverage a new feature in OpenShift, the service consumer plans to update the minor version of the platform. -1. Via OCM, the service consumer requests the minor version update. They can do this at any time with confidence that the maintenance +2. OCM (through various layers) configures the ClusterVersion and worker MachineConfigPool(s) (MCP) for the cluster with appropriate `changeManagement` stanzas. +3. Company workloads are added to the new cluster and the cluster provides value. +4. To leverage a new feature in OpenShift, the service consumer plans to update the minor version of the platform. +5. Via OCM, the service consumer requests the minor version update. They can do this at any time with confidence that the maintenance schedule will be honored. They do so on Wednesday. -1. OCM (through various layers) updates the ClusterVersion resource on the cluster indicating the new release payload in `desiredUpdate`. -1. The Cluster Version Operator (CVO) detects that its `changeManagement` stanza does not permit the initiation of the change. -1. The CVO sets a metric indicating that changes are pending for ClusterVersion. Irrespective of pending changes, the CVO also exposes a +6. OCM (through various layers) updates the ClusterVersion resource on the cluster indicating the new release payload in `desiredUpdate`. +7. The Cluster Version Operator (CVO) detects that its `changeManagement` stanza does not permit the initiation of the change. +8. The CVO sets a metric indicating that changes are pending for ClusterVersion. Irrespective of pending changes, the CVO also exposes a metric indicating the number of seconds until the next window in which material changes can be initiated. -1. Since MachineConfigs do not match in the desired update and the current manifests, the CVO also sets a metric indicating that MachineConfig - changes are pending. This is done because the MachineConfigOperator (MCO) cannot anticipate the coming manifest changes and cannot, - therefore, reflect expected changes to the worker-node MCPs. Anticipating this change ahead of time is necessary for an operation +9. Since MachineConfigs likely do not match in the desired update and the current manifests (RHCOS changes occur 100% of the time for non-hotfix updates), + the CVO also sets a metric indicating that MachineConfig changes are pending. This is an assumption, but the price of being wrong + on rare occasions is very low (pending changes will be reported, but disappear shortly after a permissive window begins). + This is done because the MachineConfigOperator (MCO) cannot anticipate the coming manifest changes and cannot, + therefore, reflect expected changes to the worker-node MCPs. Anticipating this change ahead of time is necessary for an operations team to be able to set an alert with the semantics (worker-node-update changes are pending & time remaining until changes are permitted < 2d). The MCO will expose its own metric for changes pending when manifests are updated. But this metric will only indicate when there are machines in the pool that have not achieved the desired configuration. An operations team trying to implement the 2d early warning for worker-nodes must use OR on these metrics to determine whether changes are actually pending. -1. The MCO, irrespective of pending changes, exposes a metric for each MCP to indicate the number of seconds remaining until it is - permitted to initiate changes to nodes in that MCP. -1. A privileged user on the cluster notices different options available for `changeManagement` in the ClusterVersion and MachineConfigPool - resources. They try to set them but are prevented by either RBAC or an admission webhook (details for Service Delivery). If they wish - to change the settings, they must update them through OCM. -1. The privileged user does an `oc describe ...` on the resources. They can see that material changes are pending in ClusterVersion for - the control-plane and for worker machine config. They can also see the date and time that the next material change will be permitted. - The MCP will not show a pending change at this time, but will show the next time at which material changes will be permitted. -1. The next Saturday is _not_ the first Saturday of the month. The CVO detects that material changes are permitted at 00:00 UTC and - begins to apply manifests. This effectively initiates the control-plane update process, which is considered a single - material change to the cluster. -1. The control-plane update succeeds. The CVO, having reconciled its state, unsets metrics suggesting changes are pending. -1. As part of updating cluster manifests, MachineConfigs have been modified. The MachineConfigOperator (MCO) re-renders a - configuration for worker-nodes. However, because the MCP maintenance schedule precludes initiating material changes, - it will not begin to update Machines with that desired configuration. -1. The MCO will set a metric indicating that desired changes are pending. -1. `oc get -o=yaml/describe` will both provide status information indicating that changes are pending for the MCP and - the time at which the next material changes can be initiated according to the maintenance schedule. -1. On the first Saturday of the next month, 00:00 UTC, the MCO determines that material changes are permitted. - Based on limits like maxUnavailable, the MCO begins to annotate nodes with the desiredConfiguration. The - MachineConfigDaemon takes over from there, draining, and rebooting nodes into the updated release. -1. There are a large number of nodes in the cluster and this process continues for more than 24 hours. On Saturday - 23:59, the MCO applies a round of desired configurations annotations to Nodes. At 00:00 on Sunday, it detects - that material changes can no longer be initiated, and pauses its activity. Node updates that have already - been initiated continue beyond the maintenance schedule window. -1. Since not all nodes have been updated, the MCO continues to expose a metric informing the system of - pending changes. -1. In the subsequent days, the cluster is scaled up to handle additional workload. The new nodes receive - the most recent, desired configuration. -1. On the first Saturday of the next month, the MCO resumes its work. In order to ensure that forward progress is - made for all nodes, the MCO will update nodes that have the oldest current configuration first. This ensures - that even if the desired configuration has changed multiple times while maintenance was not permitted, - no nodes are starved of updates. Consider the alternative where (a) worker-node updates required > 24h, - (b) updates to nodes are performed alphabetically, and (c) MachineConfigs are frequently being changed - during times when maintenance is not permitted. This strategy could leave nodes sorting last - lexicographically no opportunity to receive updates. This scenario would eventually leave those nodes - more prone to version skew issues. -1. During this window of time, all node updates are initiated, and they complete successfully. +10. The MCO, irrespective of pending changes, exposes a metric for each MCP to indicate the number of seconds remaining until it is + permitted to initiate changes to nodes in that MCP. +11. A privileged user on the cluster notices different options available for `changeManagement` in the ClusterVersion and MachineConfigPool + resources. They try to set them but are prevented by a validating admission controller. If they wish + to change the settings, they must update them through OCM. +12. The privileged user does an `oc describe ...` on the resources. They can see that material changes are pending in ClusterVersion for + the control-plane and for worker machine config. They can also see the date and time that the next material change will be permitted. + The MCP will not show a pending change at this time, but will show the next time at which material changes will be permitted. +13. The next Saturday is _not_ the first Saturday of the month. The CVO detects that material changes are permitted at 00:00 UTC and + begins to apply manifests. This effectively initiates the control-plane update process, which is considered a single + material change to the cluster. +14. The control-plane update succeeds. The CVO, having reconciled its state, unsets metrics suggesting changes are pending. +15. As part of updating cluster manifests, MachineConfigs have been modified. The MachineConfigOperator (MCO) re-renders a + configuration for worker-nodes. However, because the MCP maintenance schedule precludes initiating material changes, + it will not begin to update Machines with that desired configuration. +16. The MCO will set a metric indicating that desired changes are pending. +17. `oc get -o=yaml/describe` will both provide status information indicating that changes are pending for the MCP and + the time at which the next material changes can be initiated according to the maintenance schedule. +18. On the first Saturday of the next month, 00:00 UTC, the MCO determines that material changes are permitted. + Based on limits like maxUnavailable, the MCO begins to annotate nodes with the desiredConfiguration. The + MachineConfigDaemon takes over from there, draining, and rebooting nodes into the updated release. +19. There are a large number of nodes in the cluster and this process continues for more than 24 hours. On Saturday + 23:59, the MCO applies a round of desired configurations annotations to Nodes. At 00:00 on Sunday, it detects + that material changes can no longer be initiated, and pauses its activity. Node updates that have already + been initiated continue beyond the maintenance schedule window. +20. Since not all nodes have been updated, the MCO continues to expose a metric informing the system of + pending changes. +21. In the subsequent days, the cluster is scaled up to handle additional workload. The new nodes receive + the most recent, desired configuration. +22. On the first Saturday of the next month, the MCO resumes its work. In order to ensure that forward progress is + made for all nodes, the MCO will update nodes that have the oldest current configuration first. This ensures + that even if the desired configuration has changed multiple times while maintenance was not permitted, + no nodes are starved of updates. Consider the alternative where (a) worker-node updates required > 24h, + (b) updates to nodes are performed alphabetically, and (c) MachineConfigs are frequently being changed + during times when maintenance is not permitted. This strategy could leave nodes sorting last + lexicographically no opportunity to receive updates. This scenario would eventually leave those nodes + more prone to version skew issues. +23. During this window of time, all node updates are initiated, and they complete successfully. #### Service Delivery Emergency Patch 1. SRE determines that a significant new CVE threatens the fleet. @@ -592,7 +601,7 @@ perspective, this strategy reports as paused indefinitely. 1. SRE can address the issue with a system configuration file applied in a MachineConfig. 1. SRE creates the MachineConfig for the customer and provides the customer the option to either (a) wait until their configured maintenance schedule permits the material change from being initiated by the MachineConfigOperator - or (b) having SRE override the maintenance schedule and permitting its immediate application. + or (b) modify change management to permit immediate application (e.g. setting `disabledUntil`). 1. The problem is not pervasive, so the customer chooses the deferred remediation. 1. The change is initiated and nodes are rebooted during the next permissive window. @@ -742,27 +751,29 @@ spec: # Specifies a reoccurring permissive window. permit: # RRULEs (https://www.rfc-editor.org/rfc/rfc5545#section-3.3.10) are commonly used - # for calendar management metadata. Only a subset of the RFC is supported. If - # unset, all dates are permitted and only exclude constrains permissive windows. + # for calendar management metadata. Only a subset of the RFC is supported. + # See "RRULE Constraints" section for details. + # If unset, all dates are permitted and only exclude constrains permissive windows. recurrence: - # Given the identification of a date by an RRULE, at what time (relative to timezoneOffset) can the + # Given the identification of a date by an RRULE, at what time (always UTC) can the # permissive window begin. "00:00" if unset. startTime: - # Given the identification of a date by an RRULE, at what time (relative to timezoneOffset) should the - # permissive window end. "23:59:59" if unset. - endTime: + # Given the identification of a date by an RRULE, after what offset from the startTime should + # the permissive window close. This can create permissive windows within days that are not + # identified in the RRULE. For example, recurrence="FREQ=Weekly;BYDAY=Sa;", + # startTime="20:00", duration="8h" would permit material change initiation starting + # each Saturday at 8pm and continuing through Sunday 4am (all times are UTC). The default + # duration is 24:00-startTime (i.e. to the end of the day). + duration: + # Excluded date ranges override RRULE selections. exclude: - # Dates should be specified in YYYY-MM-DD. Each date is excluded from 00:00 for 24 hours. + # Dates should be specified in YYYY-MM-DD. Each date is excluded from 00:00 UTC for 24 hours. - fromDate: # Non-inclusive until. If null, until defaults to the day after from (meaning a single day exclusion). untilDate: - # Specifies an RFC3339 style timezone offset to be applied across their datetime selections. - # "-07:00" indicates negative 7 hour offset from UTC. "+03:00" indicates positive 3 hour offset. If not set, defaults to "+00:00" (UTC). - timezoneOffset: - ``` Permitted times (i.e. times at which the strategy enforcement state can be permissive) are specified using a @@ -776,10 +787,10 @@ RRULE supports expressions that suggest recurrence without implying an exact dat - `RRULE:FREQ=YEARLY` - An event that occurs once a year on a specific date. - `RRULE:FREQ=WEEKLY;INTERVAL=2` - An event that occurs every two weeks. -All such expressions shall be evaluated with a starting date of Jan 1st, 1970 00:00. In other +All such expressions shall be evaluated with a starting date of Jan 1st, 1970 00:00Z. In other words, `RRULE:FREQ=YEARLY` would be considered permissive, for one day, at the start of each new year. -If no `startTime` or `endTime` is specified, any day selected by the RRULE will suggest a +If no `startTime` or `duration` is specified, any day selected by the RRULE will suggest a permissive 24h window unless a date is in the `exclude` ranges. **RRULE Constraints** @@ -854,6 +865,7 @@ Labels: - kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool - object= - system= + Value: - `0`: no material changes are pending. - `1`: changes are pending but being initiated. @@ -865,11 +877,12 @@ Labels: - kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool - object= - system= + Value: - `-2`: Error determining the time at which changes can be initiated (e.g. cannot check with ClusterVersion / change management hierarchy). -- `-1`: Material changes are paused indefinitely OR no permissive window can be found within the next 1000 days (the latter ensures a brute force check of intersecting datetimes with hierarchy RRULEs is a valid method of calculating intersection). +- `-1`: Material changes are paused indefinitely. - `0`: Any pending changes can be initiated now (e.g. change management is disabled or inside machine schedule window). -- `> 0`: The number seconds remaining until changes can be initiated. +- `> 0`: The number seconds remaining until changes can be initiated OR 1000*24*60*60 (1000 days) if no permissive window can be found within the next 1000 days (this ensures a brute force check of intersecting datetimes with hierarchy RRULEs is a valid method of calculating intersection). `cm_strategy_enabled` Labels: @@ -877,6 +890,7 @@ Labels: - object= - system= - strategy=MaintenanceSchedule|Manual|Assisted + Value: - `0`: Change management for this resource is not subject to this enabled strategy (**does** consider hierarchy based disable). - `1`: Change management for this resource is directly subject to this enabled strategy. @@ -884,7 +898,7 @@ Value: - `3`: Change management for this resource is directly and indirectly subject to this enabled strategy. #### Change Management Status -Each resource which exposes a `.spec.changeManagement` stanza should also expose `.status.changeManagement` . +Each resource which exposes a `.spec.changeManagement` stanza must also expose `.status.changeManagement` . ```yaml status: @@ -896,7 +910,7 @@ status: # Show effective state. effectiveState: description: "Human readable message explaining how strategies & configuration are resulting in the effective state." - # The start of the next permissive window, taking into account the hierarchy. "N/A" for indefinite pause or >1000 days. + # The start of the next permissive window, taking into account the hierarchy. "N/A" for indefinite pause. permitChangesETA: changesPending: ``` @@ -937,7 +951,8 @@ intersection between the permissive intervals it has stored for the control-plan worker-nodes. Since it is possible there is no overlap, limits must be placed on this search. Once dates >1000 days from -the present moment are being tested, the operator can behave as if an indefinite pause has been requested. +the present moment are being tested, the operator can behave as if the next window will occur in +1000 days (prevents infinite search for overlap). This outcome does not need to be recomputed unless the operator restarts Or one of the RRULE involved is modified. From a888a12ce334fea535ae0f0a3aae5d6336b31a78 Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Tue, 2 Apr 2024 14:35:41 -0400 Subject: [PATCH 03/12] Add external CRD alterantive details --- ...ge-management-and-maintenance-schedules.md | 2359 +++++++++-------- 1 file changed, 1186 insertions(+), 1173 deletions(-) diff --git a/enhancements/update/change-management-and-maintenance-schedules.md b/enhancements/update/change-management-and-maintenance-schedules.md index 9260f291fc..36db623303 100644 --- a/enhancements/update/change-management-and-maintenance-schedules.md +++ b/enhancements/update/change-management-and-maintenance-schedules.md @@ -1,1173 +1,1186 @@ ---- -title: change-management-and-maintenance-schedules -authors: - - @jupierce -reviewers: - - TBD -approvers: - - @sdodson - - @jharrington22 -api-approvers: - - TBD -creation-date: 2024-02-29 -last-updated: 2024-02-29 - -tracking-link: - - TBD - ---- - -# Change Management and Maintenance Schedules - -## Summary -Implement high level APIs for change management which allow -standalone and Hosted Control Plane (HCP) clusters a measure of configurable control -over when control-plane or worker-node configuration rollouts are initiated. -As a primary mode of configuring change management, implement an option -called Maintenance Schedules which define reoccurring windows of time (and specifically -excluded times) in which potentially disruptive changes in configuration can be initiated. - -Material changes not permitted by change management configuration are left in a -pending state until such time as they are permitted by the configuration. - -Change management enforcement _does not_ guarantee that all initiated -material changes are completed by the close of a permitted change window (e.g. a worker-node -may still be draining or rebooting) at the close of a maintenance schedule, -but it does prevent _additional_ material changes from being initiated. - -Change management enforcement _does not_ attempt to define or control the detailed state of the -system. It only pertains to whether controllers which support change management -will attempt to initiate material change themselves. For example, if changes are paused in the middle -of a cluster update and a node is manually rebooted, change management does not define -whether the node will rejoin the cluster with the new or old version. - -A "material change" may vary by cluster profile and subsystem. For example, a -control-plane update (all components and control-plane nodes updated) is implemented as -a single material change (e.g. the close of a scheduled permissive window -will not suspend its progress). In contrast, the rollout of worker-node updates is -more granular (you can consider it as many individual material changes) and -the end of a permitted change window will prevent additional worker-node updates -from being initiated. - -Changes vital to the continued operation of the cluster (e.g. certificate rotation) -are not considered material changes. Ignoring operational practicalities (e.g. -the need to fix critical bugs or update a cluster to supported software versions), -it should be possible to safely leave changes pending indefinitely. That said, -Service Delivery and/or higher level management systems may choose to prevent -such problematic change management settings from being applied by using -validating webhooks. - -## Motivation -This enhancement is designed to improve user experience during the OpenShift -upgrade process and other key operational moments when configuration updates -may result in material changes in cluster behavior and potential disruption -for non-HA workloads. - -The enhancement offers a direct operational tool to users while indirectly -supporting a longer term separation of control-plane and worker-node updates -for **Standalone** cluster profiles into distinct concepts and phases of managing -an OpenShift cluster (HCP clusters already provide this distinction). The motivations -for both aspects will be covered, but a special focus will be made on the motivation -for separating Standalone control-plane and worker-node updates as, while not fully realized -by this enhancement alone, ultimately provides additional business value helping to -justify an investment in the new operational tool. - -### Supporting the Eventual Separation of Control-Plane and Worker-Node Updates -One of the key value propositions of this proposal pre-supposes a successful -decomposition of the existing, fully self-managed, Standalone update process into two -distinct phases as understood and controlled by the end-user: -(1) control-plane update and (2) worker-node updates. - -To some extent, Maintenance Schedules (a key supported option for change management) -are a solution to a problem that will be created by this separation: there is a perception that it would also -double the operational burden for users updating a cluster (i.e. they have -two phases to initiate and monitor instead of just one). In short, implementing the -Maintenance Schedules concept allows users to succinctly express if and how -they wish to differentiate these phases. - -Users well served by the fully self-managed update experience can disable -change management (i.e. not set an enforced maintenance schedule), specifying -that control-plane and worker node updates can take place at -any time. Users who need more control may choose to update their control-plane -regularly (e.g. to patch CVEs) with a permissive change management configuration -for the control-plane while using a tight maintenance schedule for worker-nodes -to only update during specific, low utilization, periods. - -Since separating the node update phases is such an important driver for -Maintenance Schedules, their motivations are heavily intertwined. The remainder of this -section, therefore, delves into the motivation for this separation. - -#### The Case for Control-Plane and Worker-Node Separation -From an overall platform perspective, we believe it is important to drive a distinction -between updates of the control-plane and worker-nodes. Currently, an update is initiated -and OpenShift's ostensibly fully self-managed update mechanics take over (CVO laying -out new manifests, cluster operators rolling out new operands, etc.) culminating with -worker-nodes being drained a rebooted by the machine-config-operator (MCO) to align -them with the version of OpenShift running on the control-plane. - -This approach has proven extraordinarily successful in providing a fast and reliable -control-plane update, but, in rare cases, the highly opinionated update process leads -to less than ideal outcomes. - -##### Node Update Separation to Address Problems in User Perception -Our success in making OpenShift control-plane updates reliable, exhaustive focus on quality aside, -is also made possible by the platform's exclusive ownership of the workloads that run on the control-plane -nodes. Worker-nodes, on the other hand, run an endless variety of non-platform, user defined workloads - many of -which are not necessarily perfectly crafted. For example, workloads with pod disruption budgets (PDBs) that -prevent node drains and workloads which are not fundamentally HA (i.e. where draining part of the workload creates -disruption in the service it provides). - -Ultimately, we cannot solve the issue of problematic user workload configurations because -they are intentionally representable with Kubernetes APIs (e.g. it may be the user's actual intention to prevent a pod -from being drained, or it may be too expensive to make a workload fully HA). When confronted with -problematic workloads, the current, fully self-managed, OpenShift update process can appear to the end-user -to be unreliable or slow. This is because the self-managed update process takes on the end-to-end responsibility -of updating the control-plane and worker-nodes. Given the automated and somewhat opaque nature of this -update, it is reasonable for users to expect that the process is hands-off and will complete in a timely -manner regardless of their workloads. - -When this expectation is violated because of problematic user workloads, the update process is -often called into question. For example, if an update appears stalled after 12 hours, a -user is likely to have a poor perception of the platform and open a support ticket before -successfully diagnosing an underlying undrainable workload. - -By separating control-plane and worker-node updates into two distinct phases for an operator to consider, -we can more clearly communicate (1) the reliability and speeed of OpenShift control-plane updates and -(2) the shared responsibility, along with the end user, of successfully updating worker-nodes. - -As an analogy, when you are late to work because of delays in a subway system, you blame the subway system. -They own the infrastructure and schedules and have every reason to provide reliable and predictable transport. -If, instead, you are late to work because you step into a fully automated car that gets stuck in traffic, you blame the -traffic. The fully self-managed update process suggests to the end user that it is a subway -- subtly insulating -them from the fact that they might well hit traffic (problematic user workloads). Separating the update journey into -two parts - a subway portion (the control-plane) and a self-driving car portion (worker-nodes), we can quickly build the -user's intuition about their responsibilities in the latter part of the journey. For example, leaving earlier to -avoid traffic or staying at a hotel after the subway to optimize their departure for the car ride. - -##### Node Update Separation to Improve Risk Mitigation Strategies -With any cluster update, there is risk -- software is changing and even subtle differences in behavior can cause -issues given an unlucky combination of factors. Teams responsible for cluster operations are familiar with these -risks and owe it to their stakeholders to minimize them where possible. - -The current, fully self-managed, update process makes one obvious risk mitigation strategy -a relatively advanced strategy to employ: only updating the control-plane and leaving worker-nodes as-is. -It is possible by pausing machine config pools, but this is certainly not an intuitive step for users. Farther back -in OpenShift 4's history, the strategy was not even safe to perform since it could lead to worker-node -certificates to expiring. - -By separating the control-plane and worker-node updates into two separate steps, we provide a clear -and intuitive method of deferring worker-node updates: not initiating them. Leaving this to the user's -discretion, within safe skew-bounds, gives them the flexibility to make the right choices for their -unique circumstances. - -#### Enhancing Operational Control -The preceding section delved deeply into a motivation for Change Management / Maintenance Schedules based on our desire to -separate control-plane and worker-node updates without increasing operational burden on end-users. However, -Change Management, by providing control over exactly when updates & material changes to nodes in -the cluster can be initiated, provide value irrespective of this strategic direction. The benefit of -controlling exactly when changes are applied to critical systems is universally appreciated in enterprise -software. - -Since these are such well established principles, I will summarize the motivation as helping -OpenShift meet industry standard expectations with respect to limiting potentially disruptive change -outside well planned time windows. - -It could be argued that rigorous and time sensitive management of OpenShift cluster API resources could prevent -unplanned material changes, but Change Management / Maintenance Schedules introduce higher level, platform native, and more -intuitive guard rails. For example, consider the common pattern of a gitops configured OpenShift cluster. -If a user wants to introduce a change to a MachineConfig, it is simple to merge a change to the -resource without appreciating the fact that it will trigger a rolling reboot of nodes in the cluster. - -Trying to merge this change at a particular time of day and/or trying to pause and unpause a -MachineConfigPool to limit the impact of that merge to a particular time window requires -significant forethought by the user. Even with that forethought, if an enterprise wants -changes to only be applied during weekends, additional custom mechanics would need -to be employed to ensure the change merged during the weekend without needing someone present. - -Contrast this complexity with the user setting a Change Management / Maintenance Schedule on the cluster. The user -is then free to merge configuration changes and gitops can apply those changes to OpenShift -resources, but material change to the cluster will not be initiated until a time permitted -by the Maintenance Schedule. Users do not require special insight into the implications of -configuring platform resources as the top-level Maintenance Schedule control will help ensure -that potentially disruptive changes are limited to well known time windows. - -#### Reducing Service Delivery Operational Tooling -Service Delivery, operating Red Hat's Managed OpenShift offerings (OpenShift Dedicated (OSD), -Red Hat OpenShift on AWS (ROSA) and Azure Red Hat OpenShift (ARO) ) is keenly aware of -the issues motivating the Change Management / Maintenance Schedule concept. This is evidenced by their design -and implementation of tooling to fill the gaps in the platform the preceding sections -suggest exist. - -Specifically, Service Delivery has developed UXs outside the platform which allow customers -to define a preferred maintenance window. For example, when requesting an update, the user -can specify the desired start time. This is honored by Service Delivery tooling (unless -there are reasons to supersede the customer's preference). - -By acknowledging the need for scheduled maintenance in the platform, we reduce the need for Service -Delivery to develop and maintain custom tooling to manage the platform while -simultaneously simplifying management for all customer facing similar challenges. - -### User Stories -For readability, "cluster lifecycle administrator" is used repeatedly in the user stories. This -term can apply to different roles depending on the cluster environment and profile. In general, -it is the person or team making most material changes to the cluster - including planning and -choosing when to enact phases of the OpenShift platform update. - -For HCP, the role is called the [Cluster Service Consumer](https://hypershift-docs.netlify.app/reference/concepts-and-personas/#personas). For -Standalone clusters, this role would normally be filled by one or more `system:admin` users. There -may be several layers of abstraction between the cluster lifecycle administrator and changes being -actuated on the cluster (e.g. gitops, OCM, Hive, etc.), but the role will still be concerned with limiting -risks and disruption when rolling out changes to their environments. - -> "As a cluster lifecycle administrator, I want to ensure any material changes to my cluster -> (control-plane or worker-nodes) are only initiated during well known windows of low service -> utilization to reduce the impact of any service disruption." - -> "As a cluster lifecycle administrator, I want to ensure any material changes to my -> control-plane are only initiated during well known windows of low service utilization to -> reduce the impact of any service disruption." - -> "As a cluster lifecycle administrator, I want to ensure that no material changes to my -> cluster occur during a known date range even if it falls within our -> normal maintenance schedule due to an anticipated atypical usage (e.g. Black Friday)." - -> "As a cluster lifecycle administrator, I want to pause additional material changes from -> taking place when it is no longer practical to monitor for service disruptions. For example, -> if a worker-node update is proving to be problematic during a valid permissive window, I would -> like to be able to pause that change manually so that the team will not have to work on the weekend." - -> "As a cluster lifecycle administrator, I need to stop all material changes on my cluster -> quickly and indefinitely until I can understand a potential issue. I not want to consider dates or -> timezones in this delay as they are not known and irrelevant to my immediate concern." - -> "As a cluster lifecycle administrator, I want to ensure any material changes to my -> control-plane are only initiated during well known windows of low service utilization to -> reduce the impact of any service disruption. Furthermore, I want to ensure that material -> changes to my worker-nodes occur on a less frequent cadence because I know my workloads -> are not HA." - -> "As an SRE, tasked with performing non-emergency corrective action, I want -> to be able to apply a desired configuration (e.g. PID limit change) and have that change roll out -> in a minimally disruptive way subject to the customer's configured maintenance schedule." - -> "As an SRE, tasked with performing emergency corrective action, I want to be able to -> quickly disable a configured maintenance schedule, apply necessary changes, have them roll out immediately, -> and restore the maintenance schedule to its previous configuration." - -> "As a leader within the Service Delivery organization, tasked with performing emergency corrective action -> across our fleet, I want to be able to bypass and then restore customer maintenance schedules -> with minimal technical overhead." - -> "As a cluster lifecycle administrator who is well served by a fully managed update without change management, -> I want to be minimally inconvenienced by the introduction of change management / maintenance schedules." - -> "As a cluster lifecycle administrator who is not well served by a fully managed update and needs exacting -> control over when material changes occur on my cluster where opportunities do NOT arise at reoccurring intervals, -> I want to employ a change management strategy that defers material changes until I perform a manual action." - -> "As a cluster lifecycle administrator, I want to easily determine the next time at which maintenance operations -> will be permitted to be initiated, based on the configured maintenance schedule, by looking at the -> status of relevant API resources or metrics." - -> "As a cluster lifecycle administrator, I want to easily determine whether there are material changes pending for -> my cluster, awaiting a permitted window based on the configured maintenance schedule, by looking at the -> status of relevant API resources or metrics." - -> "As a cluster lifecycle administrator, I want to easily determine whether a maintenance schedule is currently being -> enforced on my cluster by looking at the status of relevant API resources or metrics." - -> "As a cluster lifecycle administrator, I want to be able to alert my operations team when changes are pending, -> when and the number of seconds to the next permitted window approaches, or when a maintenance schedule is not being -> enforced on my cluster." - -> "As a cluster lifecycle administrator, I want to be able to diagnose why pending changes have not been applied -> if I expected them to be." - -> "As a cluster administrator or privileged user familiar with OpenShift prior to the introduction of change management, -> I want it to be clear when I am looking at the desired versus actual state of the system. For example, if I can see -> the state of the clusterversion or a machineconfigpool, it should be straightforward to understand why I am -> observing differences in the state of those resources compared to the state of the system." - -### Goals - -1. Indirectly support the strategic separation of control-plane and worker-node update phases for Standalone clusters by supplying a change control mechanism that will allow both control-plane and worker-node updates to proceed at predictable times without doubling operational overhead. -2. Directly support the strategic separation of control-plane and worker-node update phases by implementing a "manual" change management strategy where users who value the full control of the separation can manually actuate changes to them independently. -3. Empower OpenShift cluster lifecycle administrators with tools that simplify implementing industry standard notions of maintenance windows. -4. Provide Service Delivery a platform native feature which will reduce the amount of custom tooling necessary to provide maintenance windows for customers. -5. Deliver a consistent change management experience across all platforms and profiles (e.g. Standalone, ROSA, HCP). -6. Enable SRE to, when appropriate, make configuration changes on a customer cluster and have that change actually take effect only when permitted by the customer's change management preferences. -7. Do not subvert expectations of customers well served by the existing fully self-managed cluster update. -8. Ensure the architectural space for enabling different change management strategies in the future. - -### Non-Goals - -1. Allowing control-plane upgrades to be paused midway through an update. Control-plane updates are relatively rapid and pausing will introduce unnecessary complexity and risk. -2. Requiring the use of maintenance schedules for OpenShift upgrades (the changes should be compatible with various upgrade methodologies – including being manually triggered). -3. Allowing Standalone worker-nodes to upgrade to a different payload version than the control-plane (this is supported in HCP, but is not a goal for standalone). -4. Exposing maintenance schedule controls from the oc CLI. This may be a future goal but is not required by this enhancement. -5. Providing strict promises around the exact timing of upgrade processes. Maintenance schedules will be honored to a reasonable extent (e.g. upgrade actions will only be initiated during a window), but long-running operations may exceed the configured end of a maintenance schedule. -6. Implementing logic to defend against impractical maintenance schedules (e.g. if a customer configures a 1-second maintenance schedule every year). Service Delivery may want to implement such logic to ensure upgrade progress can be made. - -## Proposal - -### Change Management Overview -Add a `changeManagement` stanza to several resources in the OpenShift ecosystem: -- HCP's `HostedCluster`. Honored by HyperShift Operator and supported by underlying CAPI primitives. -- HCP's `NodePool`. Honored by HyperShift Operator and supported by underlying CAPI primitives. -- Standalone's `ClusterVersion`. Honored by Cluster Version Operator. -- Standalone's `MachineConfigPool`. Honored by Machine Config Operator. - -The implementation of `changeManagement` will vary by profile -and resource, however, they will share a core schema and provide a reasonably consistent user -experience across profiles. - -The schema will provide options for controlling exactly when changes to API resources on the -cluster can initiate material changes to the cluster. Changes that are not allowed to be -initiated due to a change management control will be called "pending". Subsystems responsible -for initiating pending changes will await a permitted window according to the change's -relevant `changeManagement` configuration(s). - -### Change Management Strategies -Each resource supporting change management will add the `changeManagement` stanza and support a minimal set of change management strategies. -Each strategy may require an additional configuration element within the stanza. For example: -```yaml -spec: - changeManagement: - strategy: "MaintenanceSchedule" - pausedUntil: false - disabledUntil: false - config: - maintenanceSchedule: - ..options to configure a detailed policy for the maintenance schedule.. -``` - -All change management implementations must support `Disabled` and `MaintenanceSchedule`. Abstracting -change management into strategies allows for simplified future expansion or deprecation of strategies. -Tactically, `strategy: Disabled` provides a convenient syntax for bypassing any configured -change management policy without permanently deleting its configuration. - -For example, if SRE needs to apply emergency corrective action on a cluster with a `MaintenanceSchedule` change -management strategy configured, they can simply set `strategy: Disabled` without having to delete the existing -`maintenanceSchedule` stanza which configures the previous strategy. Once the correct action has been completed, -SRE simply restores `strategy: MaintenanceSchedule` and the previous configuration begins to be enforced. - -Configurations for multiple management strategies can be recorded in the `config` stanza, but -only one strategy can be active at a given time. - -Each strategy will support a policy for pausing or unpausing (permitting) material changes from being initiated. -This will be referred to as the strategy's enforcement state (or just "state"). The enforcement state for a -strategy can be either "paused" or "unpaused" (a.k.a. "permissive"). The `Disabled` strategy enforcement state -is always permissive -- allowing material changes to be initiated (see [Change Management -Hierarchy](#change-management-hierarchy) for caveats). - -All change management strategies, except `Disabled`, are subject to the following `changeManagement` fields: -- `changeManagement.disabledUntil: `: When `disabledUntil: true` or `disabledUntil: `, the interpreted strategy for - change management in the resource is `Disabled`. Setting a future date in `disabledUntil` offers a less invasive (i.e. no important configuration needs to be changed) method to - disable change management constraints (e.g. if it is critical to roll out a fix) and a method that - does not need to be reverted (i.e. it will naturally expire after the specified date and the configured - change management strategy will re-activate). -- `changeManagement.pausedUntil: `: Unless the effective active strategy is Disabled, `pausedUntil: true` or `pausedUntil: `, change management must - pause material changes. - -### Change Management Status -Change Management information will also be reflected in resource status. Each resource -which contains the stanza in its `spec` will expose its current impact in its `status`. -Common user interfaces for aggregating and displaying progress of these underlying resources -should be updated to proxy that status information to the end users. - -### Change Management Metrics -Cluster wide change management information will be made available through cluster metrics. Each resource -containing the stanza should expose the following metrics: -- The number of seconds until the next known permitted change window. 0 if changes can currently be initiated. -1 if changes are paused indefinitely. -2 if no permitted window can be computed. -- Whether any change management strategy is enabled. -- Which change management strategy is enabled. -- If changes are pending due to change management controls. - -### Change Management Hierarchy -Material changes to worker-nodes are constrained by change management policies in their associated resource AND -at the control-plane resource. For example, in a standalone profile, if a MachineConfigPool's change management -configuration apparently permits material changes from being initiated at a given moment, that is only the case -if ClusterVersion is **also** permitting changes from being initiated at that time. - -The design choice is informed by a thought experiment: As a cluster lifecycle administrator for a Standalone cluster, -who wants to achieve the simple goal of ensuring no material changes take place outside a well-defined -maintenance schedule, do you want to have to the challenge of keeping every MachineConfigPool's -`changeManagement` stanza in perfect synchronization with the ClusterVersion's? What if a new MCP is created -without your knowledge? - -The hierarchical approach allows a single master change management policy to be in place across -both the control-plane and worker-nodes. - -Conversely, material changes CAN take place on the control-plane when permitted by its associated -change management policy even while material changes are not being permitted by worker-nodes -policies. - -It is thus occasionally necessary to distinguish a resource's **configured** vs **effective** change management -state. There are two states: "paused" and "unpaused" (a.k.a. permissive; meaning that material changes be initiated). -For a control-plane resource, the configured and effective enforcement states are always the same. For worker-node -resources, the configured strategy may be disabled, but the effective enforcement state can be "paused" due to -an active strategy in the control-plane resource being in the "paused" state. - -| control-plane state | worker-node state | worker-node effective state | results | -|-------------------------|---------------------------|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| unpaused | unpaused | unpaused | Traditional, fully self-managed change rollouts. Material changes can be initiated immediately upon configuration change. | -| paused (any strategy) | **unpaused** | **paused** | Changes to both the control-plane and worker-nodes are constrained by the control-plane strategy. | -| unpaused | paused (any strategy) | paused | Material changes can be initiated immediately on the control-plane. Material changes on worker-nodes are subject to the worker-node policy. | -| paused (any strategy) | paused (any strategy) | paused | Material changes to the control-plane are subject to change control strategy for the control-plane. Material changes to the worker-nodes are subject to **both** the control-plane and worker-node strategies - if either precludes material change initiation, changes are left pending. | - -#### Maintenance Schedule Strategy -The maintenance schedule strategy is supported by all resources which support change management. The strategy -is configured by specifying an RRULE identifying permissive datetimes during which material changes can be -initiated. The cluster lifecycle administrator can also exclude specific date ranges, during which -material changes will be paused. - -#### Disabled Strategy -This strategy indicates that no change management strategy is being enforced by the resource. It always implies that -the enforcement state at the resource level is unpaused / permissive. This does not always -mean that material changes are permitted due to change management hierarchies. For example, a MachineConfigPool -with `strategy: Disabled` would still be subject to a `strategy: MaintenanceStrategy` in the ClusterVersion resource. - -#### Assisted Strategy - MachineConfigPool -Minimally, this strategy will be supported by MachineConfigPool. If and when the strategy is supported by other -change management capable resources, the configuration schema for the policy may differ as the details of -what constitutes and informs change varies between resources. - -This strategy is motivated by the desire to support the separation of control-plane and worker-node updates both -conceptually for users and in real technical terms. One way to do this for users who do not benefit from the -`MaintenanceSchedule` strategy is to ask them to initiate, pause, and resume the rollout of material -changes to their worker nodes. Contrast this with the fully self-managed state today, where worker-nodes -(normally) begin to be updated automatically and directly after the control-plane update. - -Clearly, if this was the only mode of updating worker-nodes, we could never successfully disentangle the -concepts of control-plane vs worker-node updates in Standalone environments since one implies the other. - -In short (details will follow in the implementation section), the assisted strategy allows users to specify the -exact rendered [`desiredConfig` the MachineConfigPool](https://github.com/openshift/machine-config-operator/blob/5112d4f8e562a2b072106f0336aeab451341d7dc/docs/MachineConfigDaemon.md#coordinating-updates) should be advertising to the MachineConfigDaemon on -nodes it is associated with. Like the `MaintenanceSchedule` strategy, it also respects the `pausedUntil` -field. - -#### Manual Strategy - MachineConfigPool -Minimally, this strategy will be supported by MachineConfigPool. If and when the strategy is supported by other -change management capable resources, the configuration schema for the policy may differ as the details of -what constitutes and informs change varies between resources. - -Like the Assisted strategy, this strategy is implemented to support the conceptual and technical separation -of control-plane and worker-nodes. The MachineConfigPool Manual strategy allows users to explicitly specify -their `desiredConfig` to be used for ignition of new and rebooting nodes. While the Manual strategy is enabled, -the MachineConfigOperator will not trigger the MachineConfigDaemon to drain or reboot nodes automatically. - -Because the Manual strategy initiates changes on its own behalf, `pausedUntil` has no effect. From a metrics -perspective, this strategy reports as paused indefinitely. - -### Workflow Description - -#### OCM HCP Standard Change Management Scenario - -1. A [Cluster Service Consumer](https://hypershift-docs.netlify.app/reference/concepts-and-personas/#personas) requests an HCP cluster via OCM. -1. To comply with their company policy, the service consumer configures a maintenance schedule through OCM. -1. Their first preference, no updates at all, is rejected by OCM policy, and they are referred to service - delivery documentation explaining minimum requirements. -1. The user specifies a policy which permits changes to be initiated any time Saturday UTC on the control-plane. -1. To limit perceived risk, they try to specify a separate policy permitting worker-nodes updates only on the **first** Sunday of each month. -1. OCM rejects the configuration because, due to change management hierarchy, worker-node maintenance schedules can only be a proper subset of control-plane maintenance schedules. -1. The user changes their preference to a policy permitting worker-nodes updates only on the **first** Saturday of each month. -1. OCM accepts the configuration. -1. OCM configures the HCP (HostedCluster/NodePool) resources via the Service Delivery Hive deployment to contain a `changeManagement` stanza - and an active/configured `MaintenanceSchedule` strategy. -1. Hive updates the associated HCP resources. -1. Company workloads are added to the new cluster and the cluster provides value. -1. To leverage a new feature in OpenShift, the service consumer plans to update the minor version of the platform. -1. Via OCM, the service consumer requests the minor version update. They can do this at any time with confidence that the maintenance - schedule will be honored. They do so on Wednesday. -1. OCM (through various layers) updates the target release payload in the HCP HostedCluster and NodePool. -1. The HyperShift Operator detects the desired changes but recognizes that the `changeManagement` stanza - precludes the updates from being initiated. -1. Curious, the service consumer checks the projects ClusterVersion within the HostedCluster and reviews its `status` stanza. It shows that changes are pending and the time of the next window in which changes can be initiated. -1. Separate metrics specific to change management indicate that changes are pending for both resources. -1. The non-Red Hat operations team has alerts setup to fire when changes are pending and the number of - seconds before the next permitted window is less than 2 days away. -1. These alerts fire after Thursday UTC 00:00 to inform the operations team that changes are about to be applied to the control-plane. -1. It is not the first week of the month, so there is no alert fired for the NodePool pending changes. -1. The operations team is comfortable with the changes being rolled out on the control-plane. -1. On Saturday 00:00 UTC, the HyperShift operator initiates changes the control-plane update. -1. The update completes without issue. -1. Changes remain pending for the NodePool resource. -1. As the first Saturday of the month approaches, the operations alerts fire to inform the team of forthcoming changes. -1. The operations team realizes that a corporate team needs to use the cluster heavily during the weekend for a business critical deliverable. -1. The service consumer logs into OCM and adds an exclusion for the upcoming Saturday. -1. Interpreting the new exclusion, the metric for time remaining until a permitted window increases to count down to the following month's first Saturday. -1. A month passes and the pending cause the configured alerts to fire again. -1. The operations team is comfortable with the forthcoming changes. -1. The first Saturday of the month 00:00 UTC arrives. The HyperShift operator initiates the worker-node updates based on the pending changes in the cluster NodePool. -1. The HCP cluster has a large number of worker nodes and draining and rebooting them is time-consuming. -1. At 23:59 UTC Saturday night, 80% of worker-nodes have been updated. Since the maintenance schedule still permits the initiation of material changes, another worker-node begins to be updated. -1. The update of this worker-node continues, but at 00:00 UTC Sunday, no further material changes are permitted by the change management policy and the worker-node update process is effectively paused. -1. Because not all worker-nodes have been updated, changes are still reported as pending via metrics for NodePool. **TODO: Review with HyperShift. Pausing progress should be possible, but a metric indicating changes still pending may not since they interact only through CAPI.** -1. The HCP cluster runs with worker-nodes at mixed versions throughout the month. The N-1 skew between the old kubelet versions and control-plane is supported. -1. **TODO: Review with Service Delivery. If the user requested another minor bump to their control-plane, how does OCM prevent unsupported version skew today?** -1. On the next first Saturday, the worker-nodes updates are completed. - -#### OCM Standalone Standard Change Management Scenario - -1. User interactions with OCM to configure a maintenance schedule are identical to [OCM HCP Standard Change Management Scenario](#ocm-hcp-standard-change-management-scenario). - This scenario differs after OCM accepts the maintenance schedule configuration. Control-plane updates are permitted to be initiated to any Saturday UTC. - Worker-nodes must wait until the first Saturday of the month. -2. OCM (through various layers) configures the ClusterVersion and worker MachineConfigPool(s) (MCP) for the cluster with appropriate `changeManagement` stanzas. -3. Company workloads are added to the new cluster and the cluster provides value. -4. To leverage a new feature in OpenShift, the service consumer plans to update the minor version of the platform. -5. Via OCM, the service consumer requests the minor version update. They can do this at any time with confidence that the maintenance - schedule will be honored. They do so on Wednesday. -6. OCM (through various layers) updates the ClusterVersion resource on the cluster indicating the new release payload in `desiredUpdate`. -7. The Cluster Version Operator (CVO) detects that its `changeManagement` stanza does not permit the initiation of the change. -8. The CVO sets a metric indicating that changes are pending for ClusterVersion. Irrespective of pending changes, the CVO also exposes a - metric indicating the number of seconds until the next window in which material changes can be initiated. -9. Since MachineConfigs likely do not match in the desired update and the current manifests (RHCOS changes occur 100% of the time for non-hotfix updates), - the CVO also sets a metric indicating that MachineConfig changes are pending. This is an assumption, but the price of being wrong - on rare occasions is very low (pending changes will be reported, but disappear shortly after a permissive window begins). - This is done because the MachineConfigOperator (MCO) cannot anticipate the coming manifest changes and cannot, - therefore, reflect expected changes to the worker-node MCPs. Anticipating this change ahead of time is necessary for an operations - team to be able to set an alert with the semantics (worker-node-update changes are pending & time remaining until changes are permitted < 2d). - The MCO will expose its own metric for changes pending when manifests are updated. But this metric will only indicate when - there are machines in the pool that have not achieved the desired configuration. An operations team trying to implement the 2d - early warning for worker-nodes must use OR on these metrics to determine whether changes are actually pending. -10. The MCO, irrespective of pending changes, exposes a metric for each MCP to indicate the number of seconds remaining until it is - permitted to initiate changes to nodes in that MCP. -11. A privileged user on the cluster notices different options available for `changeManagement` in the ClusterVersion and MachineConfigPool - resources. They try to set them but are prevented by a validating admission controller. If they wish - to change the settings, they must update them through OCM. -12. The privileged user does an `oc describe ...` on the resources. They can see that material changes are pending in ClusterVersion for - the control-plane and for worker machine config. They can also see the date and time that the next material change will be permitted. - The MCP will not show a pending change at this time, but will show the next time at which material changes will be permitted. -13. The next Saturday is _not_ the first Saturday of the month. The CVO detects that material changes are permitted at 00:00 UTC and - begins to apply manifests. This effectively initiates the control-plane update process, which is considered a single - material change to the cluster. -14. The control-plane update succeeds. The CVO, having reconciled its state, unsets metrics suggesting changes are pending. -15. As part of updating cluster manifests, MachineConfigs have been modified. The MachineConfigOperator (MCO) re-renders a - configuration for worker-nodes. However, because the MCP maintenance schedule precludes initiating material changes, - it will not begin to update Machines with that desired configuration. -16. The MCO will set a metric indicating that desired changes are pending. -17. `oc get -o=yaml/describe` will both provide status information indicating that changes are pending for the MCP and - the time at which the next material changes can be initiated according to the maintenance schedule. -18. On the first Saturday of the next month, 00:00 UTC, the MCO determines that material changes are permitted. - Based on limits like maxUnavailable, the MCO begins to annotate nodes with the desiredConfiguration. The - MachineConfigDaemon takes over from there, draining, and rebooting nodes into the updated release. -19. There are a large number of nodes in the cluster and this process continues for more than 24 hours. On Saturday - 23:59, the MCO applies a round of desired configurations annotations to Nodes. At 00:00 on Sunday, it detects - that material changes can no longer be initiated, and pauses its activity. Node updates that have already - been initiated continue beyond the maintenance schedule window. -20. Since not all nodes have been updated, the MCO continues to expose a metric informing the system of - pending changes. -21. In the subsequent days, the cluster is scaled up to handle additional workload. The new nodes receive - the most recent, desired configuration. -22. On the first Saturday of the next month, the MCO resumes its work. In order to ensure that forward progress is - made for all nodes, the MCO will update nodes that have the oldest current configuration first. This ensures - that even if the desired configuration has changed multiple times while maintenance was not permitted, - no nodes are starved of updates. Consider the alternative where (a) worker-node updates required > 24h, - (b) updates to nodes are performed alphabetically, and (c) MachineConfigs are frequently being changed - during times when maintenance is not permitted. This strategy could leave nodes sorting last - lexicographically no opportunity to receive updates. This scenario would eventually leave those nodes - more prone to version skew issues. -23. During this window of time, all node updates are initiated, and they complete successfully. - -#### Service Delivery Emergency Patch -1. SRE determines that a significant new CVE threatens the fleet. -1. A new OpenShift release in each z-stream fixes the problem. -1. SRE plans to override customer maintenance schedules in order to rapidly remediate the problem across the fleet. -1. The new OpenShift release(s) are configured across the fleet. Clusters with permissive maintenance - schedules begin to apply the changes immediately. -1. Clusters with change management policies precluding updates are SRE's next focus. -1. During each region's evening hours, to limit disruption, SRE changes the `changeManagement` strategy - field across relevant resources to `Disabled`. Changes that were previously pending are now - permitted to be initiated. -1. Cluster operators who have alerts configured to fire when there is no change management policy in place - will do so. -1. As clusters are successfully remediated, SRE restores the `MaintenanceSchedule` strategy for its resources. - - -#### Service Delivery Immediate Remediation -1. A customer raises a ticket for a problem that is eventually determined to be caused by a worker-node system configuration. -1. SRE can address the issue with a system configuration file applied in a MachineConfig. -1. SRE creates the MachineConfig for the customer and provides the customer the option to either (a) wait until their - configured maintenance schedule permits the material change from being initiated by the MachineConfigOperator - or (b) having SRE override the maintenance schedule and permitting its immediate application. -1. The customer chooses immediate application. -1. SRE applies a change to the relevant control-plane AND worker-node resource's `changeManagement` stanza - (both must be changed because of the change management hierarchy), setting `disabledUntil` to - a time 48 hours in the future. The configured change management schedule is ignored for 48 as the system - initiates all necessary node changes. - -#### Service Delivery Deferred Remediation -1. A customer raises a ticket for a problem that is eventually determined to be caused by a worker-node system configuration. -1. SRE can address the issue with a system configuration file applied in a MachineConfig. -1. SRE creates the MachineConfig for the customer and provides the customer the option to either (a) wait until their - configured maintenance schedule permits the material change from being initiated by the MachineConfigOperator - or (b) modify change management to permit immediate application (e.g. setting `disabledUntil`). -1. The problem is not pervasive, so the customer chooses the deferred remediation. -1. The change is initiated and nodes are rebooted during the next permissive window. - - -#### On-prem Standalone GitOps Change Management Scenario -1. An on-prem cluster is fully managed by gitops. As changes are committed to git, those changes are applied to cluster resources. -1. Configurable stanzas of the ClusterVersion and MachineConfigPool(s) resources are checked into git. -1. The cluster lifecycle administrator configures `changeManagement` in both the ClusterVersion and worker MachineConfigPool - in git. The MaintenanceSchedule strategy is chosen. The policy permits control-plane and worker-node updates only after - 19:00 Eastern US. -1. During the working day, users may contribute and merge changes to MachineConfigs or even the `desiredUpdate` of the - ClusterVersion. These resources will be updated in a timeline manner via GitOps. -1. Despite the resource changes, neither the CVO nor MCO will begin to initiate the material changes on the cluster. -1. Privileged users who may be curious as to the discrepancy between git and the cluster state can use `oc get -o=yaml/describe` - on the resources. They observe that changes are pending and the time at which changes will be initiated. -1. At 19:00 Eastern, the pending changes begin to be initiated. This rollout abides by documented OpenShift constraints - such as the MachineConfigPool `maxUnavailable` setting. - -#### On-prem Standalone Manual Strategy Scenario -1. A small, business critical cluster is being run on-prem. -1. There are no reoccurring windows of time when the cluster lifecycle administrator can tolerate downtime. - Instead, updates are negotiated and planned far in advance. -1. The cluster workloads are not HA and unplanned drains are considered a business risk. -1. To prevent surprises, the cluster lifecycle administrator sets the Manual strategy on the worker MCP. -1. Given the sensitivity of the operation, the lifecycle administrator wants to manually drain and reboot - nodes to accomplish the update. -1. The cluster lifecycle administrator sends a company-wide notice about the period during which service may be disrupted. -1. The user determines the most recent rendered worker configuration. They configure the `manual` change - management policy to use that exact configuration as the `desiredConfig`. -1. The MCO is thus being asked to ignite any new node or rebooted node with the desired configuration, but it - is **not** being permitted to apply that configuration to existing nodes because it is change management, in effect, - is paused indefinitely by the manual strategy. -1. The MCO metric for the MCP indicating the number of seconds remaining until changes can be initiated is `-1` - indicating - that there is presently no time in the future where it will initiate material changes. The operations team - has an alert configured if this value `!= -1`. -1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running - the most recently rendered configuration. This is irrespective of the `desiredConfig` in the `manual` - policy. Abstractly, it means, if change management were disabled, whether changes be initiated. -1. The cluster lifecycle administrator manually drains and reboots nodes in the cluster. As they come back online, - the MachineConfigServer offers them the desiredConfig requested by the manual policy. -1. After updating all nodes, the cluster lifecycle administrator does not need make any additional - configuration changes. They can leave the `changeManagement` stanza in their MCP as-is. - -#### On-prem Standalone Assisted Strategy Scenario -1. A large, business critical cluster is being run on-prem. -1. There are no reoccurring windows of time when the cluster lifecycle administrator can tolerate downtime. - Instead, updates are negotiated and planned far in advance. -1. The cluster workloads are not HA and unplanned drains are considered a business risk. -1. To prevent surprises, the cluster lifecycle administrator sets the Assisted strategy on the worker MCP. -1. In the `assisted` strategy change management policy, the lifecycle administrator configures `pausedUntil: true` - and the most recently rendered worker configuration in the policy's `renderedConfigsBefore: `. -1. The MCO is being asked to ignite any new node or any rebooted node with the latest rendered configuration - before the present datetime. However, because of `pausedUntil: true`, it is also being asked not to - automatically initiate that material change for existing nodes. -1. The MCO metric for the MCP indicating the number of seconds remaining until changes can be initiated is `-1` - indicating - that there is presently no time in the future where it will initiate material changes. The operations team - has an alert configured if this value `!= -1`. -1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running - the most recent, rendered configuration. This is irrespective of the `renderedConfigsBefore` in the `assisted` - configuration. Abstractly, it means, if change management were disabled, whether changes be initiated. -1. When the lifecycle administrator is ready to permit disruption, they set `pausedUntil: false`. -1. The MCO sets the number of seconds until changes are permitted to `0`. -1. The MCO begins to initiate worker node updates. This rollout abides by documented OpenShift constraints - such as the MachineConfigPool `maxUnavailable` setting. -1. Though new rendered configurations may be created, the assisted strategy will not act until the assisted policy - is updated to permit a more recent creation date. - -### API Extensions - -API Extensions are CRDs, admission and conversion webhooks, aggregated API servers, -and finalizers, i.e. those mechanisms that change the OCP API surface and behaviour. - -- Name the API extensions this enhancement adds or modifies. -- Does this enhancement modify the behaviour of existing resources, especially those owned - by other parties than the authoring team (including upstream resources), and, if yes, how? - Please add those other parties as reviewers to the enhancement. - - Examples: - - Adds a finalizer to namespaces. Namespace cannot be deleted without our controller running. - - Restricts the label format for objects to X. - - Defaults field Y on object kind Z. - -Fill in the operational impact of these API Extensions in the "Operational Aspects -of API Extensions" section. - -### Topology Considerations - -#### Hypershift / Hosted Control Planes - -In the HCP topology, the HostedCluster and NodePool resources are enhanced to support the change management strategies -`MaintenanceSchedule` and `Disabled`. - -#### Standalone Clusters - -In the Standalone topology, the ClusterVersion and MachineConfigPool resources are enhanced to support the change management strategies -`MaintenanceSchedule` and `Disabled`. The MachineConfigPool also supports the `Manual` and `Assisted` strategies. - -#### Single-node Deployments or MicroShift - -The ClusterVersion operator will honor the change management field just as in a standalone profile. If those profiles -have a MachineConfigPool, material changes the node could be controlled with a change management policy -in that resource. - -#### OCM Managed Profiles -OpenShift Cluster Manager (OCM) should expose a user interface allowing users to manage their change management policy. -Standard Fleet clusters will expose the option to configure the MaintenanceSchedule strategy - including -only permit and exclude times. - -- Service Delivery will reserve the right to disable this strategy for emergency corrective actions. -- Service Delivery should constrain permit & exclude configurations based on their internal policies. For example, customers may be forced to enable permissive windows which amount to at least 6 hours a month. - -### Implementation Details/Notes/Constraints - -#### ChangeManagement Stanza -The change management stanza will be introduced into ClusterVersion and MachineConfigPool (for standalone profiles) -and HostedCluster and NodePool (for HCP profiles). The structure of the stanza is: - -```yaml -spec: - changeManagement: - # The active strategy for change management (unless disabled by disabledUntil). - strategy: - - # If set to true or a future date, the effective change management strategy is Disabled. Date - # must be RFC3339. - disabledUntil: - - # If set to true or a future date, all strategies other than Disabled are paused. Date - # must be RFC3339. - pausedUntil: - - # If a strategy needs additional configuration information, it can read a - # key bearing its name in the config stanza. - config: - : - ...configuration policy for the strategy... -``` - -#### MaintenanceSchedule Strategy Configuration - -```yaml -spec: - changeManagement: - strategy: MaintenanceSchedule - config: - maintenanceSchedule: - # Specifies a reoccurring permissive window. - permit: - # RRULEs (https://www.rfc-editor.org/rfc/rfc5545#section-3.3.10) are commonly used - # for calendar management metadata. Only a subset of the RFC is supported. - # See "RRULE Constraints" section for details. - # If unset, all dates are permitted and only exclude constrains permissive windows. - recurrence: - # Given the identification of a date by an RRULE, at what time (always UTC) can the - # permissive window begin. "00:00" if unset. - startTime: - # Given the identification of a date by an RRULE, after what offset from the startTime should - # the permissive window close. This can create permissive windows within days that are not - # identified in the RRULE. For example, recurrence="FREQ=Weekly;BYDAY=Sa;", - # startTime="20:00", duration="8h" would permit material change initiation starting - # each Saturday at 8pm and continuing through Sunday 4am (all times are UTC). The default - # duration is 24:00-startTime (i.e. to the end of the day). - duration: - - - # Excluded date ranges override RRULE selections. - exclude: - # Dates should be specified in YYYY-MM-DD. Each date is excluded from 00:00 UTC for 24 hours. - - fromDate: - # Non-inclusive until. If null, until defaults to the day after from (meaning a single day exclusion). - untilDate: - -``` - -Permitted times (i.e. times at which the strategy enforcement state can be permissive) are specified using a -subset of the [RRULE RFC5545](https://www.rfc-editor.org/rfc/rfc5545#section-3.3.10) and, optionally, a -starting and ending time of day. https://freetools.textmagic.com/rrule-generator is a helpful tool to -review the basic semantics RRULE is capable of expressing. https://exontrol.com/exicalendar.jsp?config=/js#calendar -offers more complex expressions. - -**RRULE Interpretation** -RRULE supports expressions that suggest recurrence without implying an exact date. For example: -- `RRULE:FREQ=YEARLY` - An event that occurs once a year on a specific date. -- `RRULE:FREQ=WEEKLY;INTERVAL=2` - An event that occurs every two weeks. - -All such expressions shall be evaluated with a starting date of Jan 1st, 1970 00:00Z. In other -words, `RRULE:FREQ=YEARLY` would be considered permissive, for one day, at the start of each new year. - -If no `startTime` or `duration` is specified, any day selected by the RRULE will suggest a -permissive 24h window unless a date is in the `exclude` ranges. - -**RRULE Constraints** -A valid RRULE for change management: -- must identify a date, so, although RRULE supports `FREQ=HOURLY`, it will not be supported. -- cannot specify an end for the pattern. `RRULE:FREQ=DAILY;COUNT=3` suggests - an event that occurs every day for three days only. As such, neither `COUNT` nor `UNTIL` is - supported. -- cannot specify a permissive window more than 2 years away. - -**Overview of Interactions** -The MaintenanceSchedule strategy, along with `changeManagement.pausedUntil` allows a cluster lifecycle administrator to express -one of the following: - -| pausedUntil | permit | exclude | Enforcement State (Note that **effective** state must also take into account hierarchy) | -|----------------|--------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `null`/`false` | `null` | `null` | Permissive indefinitely | -| `true` | * | * | Paused indefinitely | -| `null`/`false` | set | `null` | Permissive during reoccurring windows time. Paused at all other times. | -| `null`/`false` | set | set | Permissive during reoccurring windows time modulo excluded date ranges during which it is paused. Paused at all other times. | -| `null`/`false` | `null` | set | Permissive except during excluded dates during which it is paused. | -| date | * | * | Honor permit and exclude values, but only after the specified date. For example, permit: `null` and exclude: `null` implies the strategy is indefinitely permissive after the specified date. | - - -#### MachineConfigPool Assisted Strategy Configuration - -```yaml -spec: - changeManagement: - strategy: Assisted - config: - assisted: - permit: - # The assisted strategy will allow the MCO to process any rendered configuration - # that was created before the specified datetime. - renderedConfigsBefore: - # When AllowSettings, rendered configurations after the preceding before date - # can be applied if and only if they do not contain changes to osImageURL. - policy: "AllowSettings|AllowNone" -``` - -The primary user of this strategy is `oc` with tentatively planned enhancements to include verbs -like: -```sh -$ oc adm update worker-nodes start ... -$ oc adm update worker-nodes pause ... -$ oc adm update worker-nodes rollback ... -``` - -These verbs can leverage the assisted strategy and `pausedUntil` to allow the manual initiation of worker-nodes -updates after a control-plane update. - -#### MachineConfigPool Manual Strategy Configuration - -```yaml -spec: - changeManagement: - strategy: Manual - config: - manual: - desiredConfig: -``` - -The manual strategy requests no automated initiation of updates. New and rebooting -nodes will only receive the desired configuration. From a metrics perspective, this strategy -is always paused state. - -#### Metrics - -`cm_change_pending` -Labels: -- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool -- object= -- system= - -Value: -- `0`: no material changes are pending. -- `1`: changes are pending but being initiated. -- `2`: changes are pending and blocked based on this resource's change management policy. -- `3`: changes are pending and blocked based on another resource in the change management hierarchy. - -`cm_change_eta` -Labels: -- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool -- object= -- system= - -Value: -- `-2`: Error determining the time at which changes can be initiated (e.g. cannot check with ClusterVersion / change management hierarchy). -- `-1`: Material changes are paused indefinitely. -- `0`: Any pending changes can be initiated now (e.g. change management is disabled or inside machine schedule window). -- `> 0`: The number seconds remaining until changes can be initiated OR 1000*24*60*60 (1000 days) if no permissive window can be found within the next 1000 days (this ensures a brute force check of intersecting datetimes with hierarchy RRULEs is a valid method of calculating intersection). - -`cm_strategy_enabled` -Labels: -- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool -- object= -- system= -- strategy=MaintenanceSchedule|Manual|Assisted - -Value: -- `0`: Change management for this resource is not subject to this enabled strategy (**does** consider hierarchy based disable). -- `1`: Change management for this resource is directly subject to this enabled strategy. -- `2`: Change management for this resource is indirectly subject to this enabled strategy (i.e. only via control-plane override hierarchy). -- `3`: Change management for this resource is directly and indirectly subject to this enabled strategy. - -#### Change Management Status -Each resource which exposes a `.spec.changeManagement` stanza must also expose `.status.changeManagement` . - -```yaml -status: - changeManagement: - # Always show control-plane level strategy. Disabled if disabledUntil is true. - clusterStrategy: - # If this a worker-node related resource (e.g. MCP), show local strategy. Disabled if disabledUntil is true. - workerNodeStrategy: - # Show effective state. - effectiveState: - description: "Human readable message explaining how strategies & configuration are resulting in the effective state." - # The start of the next permissive window, taking into account the hierarchy. "N/A" for indefinite pause. - permitChangesETA: - changesPending: -``` - -#### Change Management Bypass Annotation -In some situations, it may be necessary for a MachineConfig to be applied regardless of the active change -management policy for a MachineConfigPool. In such cases, `machineconfiguration.openshift.io/bypass-change-management` -can be set to any non-empty string. The MCO will progress until MCPs which select annotated -MachineConfigs have all machines running with a desiredConfig containing that MachineConfig's current state. - -This annotation will be present on `00-master` to ensure that, once the CVO updates the MachineConfig, -the remainder of the control-plane update will be treated as a single material change. - -### Special Handling -These cases are mentioned or implied elsewhere in the enhancement documentation, but they deserve special -attention. - -#### Change Management on Master MachineConfigPool -In order to allow control-plane updates as a single material change, the MCO will only honor change the management configuration for the -master MachineConfigPool if user generated MachineConfigs are the cause for a pending change. To accomplish this, -at least one MachineConfig updated by the CVO will have the `machineconfiguration.openshift.io/bypass-change-management` annotation -indicating that changes in the MachineConfig must be acted upon irrespective of the master MCP change management policy. - -#### Limiting Overlapping Window Search / Permissive Window Calculation -An operator implementing change management for a worker-node related resource must honor the change management hierarchy when -calculating when the next permissive window will occur (called elsewhere in the document, ETA). This is not -straightforward to compute when both the control-plane and worker-nodes have independent MaintenanceSchedule -configurations. - -We can, however, simplify the process by reducing the number of days in the future the algorithm must search for -coinciding permissive windows. 1000 days is a proposed cut-off. - -To calculate coinciding windows then, the implementation can use [rrule-go](https://github.com/teambition/rrule-go) -to iteratively find permissive windows at the cluster / control-plane level. These can be added to an -[interval-tree](https://github.com/rdleal/intervalst) . As dates are added, rrule calculations for the worker-nodes -can be performed. The interval-tree should be able to efficiently determine whether there is an -intersection between the permissive intervals it has stored for the control-plane and the time range tested for the -worker-nodes. - -Since it is possible there is no overlap, limits must be placed on this search. Once dates >1000 days from -the present moment are being tested, the operator can behave as if the next window will occur in -1000 days (prevents infinite search for overlap). - -This outcome does not need to be recomputed unless the operator restarts Or one of the RRULE involved -is modified. - -If an overlap _is_ found, no additional intervals need to be added to the tree and it can be discarded. -The operator can store the start & end datetimes for the overlap and count down the seconds remaining -until it occurs. Obviously, this calculation must be repeated: -1. If either MaintenanceSchedule configuration is updated. -1. The operator is restarted. -1. At the end of a permissive window, in order to determine the next permissive window. - - -#### Service Delivery Option Sanitization -It is obvious that the range of flexible options provided by change management configurations offers -can create risks for inexperienced cluster lifecycle administrators. For example, setting a -standalone cluster to use the Assisted strategy and failing to trigger worker-node updates will -leave unpatched CVEs on worker-nodes much longer than necessary. It will also eventually lead to -the need to resolve version skew (Upgradeable=False will be reported by the API cluster operator). - -Service Delivery understands that expose the full range of options to cluster -lifecycle administrators could dramatically increase the overhead of managing their fleet. To -prevent this outcome, Service Delivery will only expose a subset of the change management -strategies. They will also implement sanitization of the configuration options a use can -supply to those strategies. For example, a simplified interface in OCM for building a -limited range of RRULEs that are compliant with Service Delivery's update policies. - -### Risks and Mitigations - -- Given the range of operators which must implement support for change management, inconsistent behavior or reporting may make it difficult for users to navigate different profiles. - - Mitigation: A shared library should be created and vendored for RRULE/exclude/next window calculations/metrics. -- Users familiar with the fully self-managed nature of OpenShift are confused by the lack of material changes be initiated when change management constraints are active. - - Mitigation: The introduction of change management will not change the behavior of existing clusters. Users must make a configuration change. -- Users may put themselves at risk of CVEs by being too conservative with worker-node updates. -- Users leveraging change management may be more likely to reach unsupported kubelet skew configurations vs fully self-managed cluster management. - -### Drawbacks - -The scope of the enhancement - cutting across several operators requires multiple, careful implementations. The enhancement -also touches code paths that have been refined for years which assume a fully self-managed cluster approach. Upsetting these -code paths prove challenging. - -## Open Questions [optional] - -1. Can the HyperShift Operator expose a metric expose a metric for when changes are pending for a subset of worker nodes on the cluster if it can only interact via CAPI resources? -2. Can the MCO interrogate the ClusterVersion change management configuration in order to calculate overlapping permissive intervals in the future? - -## Test Plan - -**Note:** *Section not required until targeted at a release.* - -Consider the following in developing a test plan for this enhancement: -- Will there be e2e and integration tests, in addition to unit tests? -- How will it be tested in isolation vs with other components? -- What additional testing is necessary to support managed OpenShift service-based offerings? - -No need to outline all of the test cases, just the general strategy. Anything -that would count as tricky in the implementation and anything particularly -challenging to test should be called out. - -All code is expected to have adequate tests (eventually with coverage -expectations). - -## Graduation Criteria - -**Note:** *Section not required until targeted at a release.* - -Define graduation milestones. - -These may be defined in terms of API maturity, or as something else. Initial proposal -should keep this high-level with a focus on what signals will be looked at to -determine graduation. - -Consider the following in developing the graduation criteria for this -enhancement: - -- Maturity levels -The API extensions will be made to existing, stable APIs. `changeManagement` is an optional -field in the resources which bear it and so do not break backwards compatibility. - -The lack of a change management field implies the Disabled strategy - which ensures -the existing, fully self-managed update behaviors are not constrained. That is, -under a change management strategy is configured, the behavior of existing clusters -will not be affected. - -### Dev Preview -> Tech Preview - -- Ability to utilize the enhancement end to end -- End user documentation, relative API stability -- Sufficient test coverage -- Gather feedback from users rather than just developers -- Enumerate service level indicators (SLIs), expose SLIs as metrics -- Write symptoms-based alerts for the component(s) - -### Tech Preview -> GA - -- More testing (upgrade, downgrade, scale) -- Sufficient time for feedback -- Available by default -- Backhaul SLI telemetry -- Document SLOs for the component -- Conduct load testing -- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) - -**For non-optional features moving to GA, the graduation criteria must include -end to end tests.** - -### Removing a deprecated feature - -- The `MachineConfigPool.spec.pause` can begin the deprecation process. Change Management strategies allow for a superset of its behaviors. -- We may consider deprecating `HostCluster.spec.pausedUntil`. HyperShift may consider retaining it with the semantics of pausing all reconciliation with CAPI resources vs just pausing material changes per the change management contract. - -## Upgrade / Downgrade Strategy - -Operators implementing support for change management will carry forward their -existing upgrade and downgrade strategies. - -## Version Skew Strategy - -Operators implementing change management for their resources will not face any -new _internal_ version skew complexities due to this enhancement, but change management -does increase the odds of prolonged and larger differential kubelet version skew. - -For example, particularly given the Manual or Assisted change management strategy, it -becomes easier for a cluster lifecycle administrator to forget to update worker-nodes -along with updates to the control-plane. - -At some point, this will manifest as the kube-apiserver presenting as Upgradeable=False, -preventing future control-plane updates. To reduce the prevalence of this outcome, -the additional responsibilities of the cluster lifecycle administrator when -employing change management strategies must be clearly documented along with SOPs -from recovering from skew issues. - -HyperShift does not have any integrated skew mitigation strategy in place today. HostedCluster -and NodePool support independent release payloads being configured and a cluster lifecycle -administrator can trivially introduce problematic skew by editing these resources. HyperShift -documentation warns against this, but we should expect a moderate increase in the condition -being reported on non-managed clusters (OCM can prevent this situation from arising by -assessing telemetry for a cluster and preventing additional upgrades while worker-node -configurations are inconsistent with the API server). - -## Operational Aspects of API Extensions - -The API extensions proposed by this enhancement should not substantially increase -the scope of work of operators implementing the change management support. The -operators will interact with the same underlying resources/CRDs but with -constraints around when changes can be initiated. As such, no significant _new_ -operational aspects are expected to be introduced. - -## Support Procedures - -Change management problems created by faulty implementations will need to be resolved by -analyzing operator logs. The operator responsible for a given resource will vary. Existing -support tooling like must-gather should capture the information necessary to understand -and fix issues. - -Change management problems where user expectations are not being met are designed to -be informed by the detailed `status` provided by the resources bearing the `changeManagement` -stanza in their `spec`. - -## Alternatives - -### Implement maintenance schedules via an external control system (e.g. ACM) -We do not have an offering in this space. ACM is presently devoted to cluster monitoring and does -not participate in cluster lifecycle. - -### Do not separate control-plane and worker-node updates into separate phases -As separating control-plane and worker-node updates into separate phases is an important motivation for this -enhancement, we could abandon this strategic direction. Reasons motivating this separation are explained -in depth in the motivation section. - -### Separate control-plane and worker-node updates into separate phases, but do not implement the change control concept -As explained in the motivation section, there is a concern that implementing this separation without -maintenance schedules will double the perceived operational overhead of OpenShift updates. - -This also creates work for our Service Delivery team without any platform support. - -### Separate control-plane and worker-node updates into separate phases, but implement a simpler MaintenanceSchedule strategy -We could implement change control without `disabledUntil`, `pausedUntil`, `exclude`, and perhaps more. However, -it is risky to impose a single opinionated workflow onto the wide variety of consumers of the platform. The workflows -described in this enhancement are not intended to be exotic or contrived but situations in which flexibility -in our configuration can achieve real world, reasonable goals. - -`disabledUntil` is designed to support our Service Delivery team who, on occasion, will need -to be able to bypass configured change controls. The feature is easy to use, does not require -deleting or restoring customer configuration (which may be error-prone), and can be safely -"forgotten" after being set to a date in the future. - -`pausedUntil`, among other interesting possibilities, offers a cluster lifecycle administrator the ability -to stop a problematic update from unfolding further. You may have watched a 100 node -cluster roll out a bad configuration change without knowing exactly how to stop the damage -without causing further disruption. This is not a moment when you want to be figuring out how to format -a date string, calculating timezones, or copying around cluster configuration so that you can restore -it after you stop the bleeding. - -### Implement change control, but do not implement the Manual and/or Assisted strategy for MachineConfigPool -Major enterprise users of our software do not update on a predictable, recurring window of time. Updates -require laborious certification processes and testing. Maintenance schedules will not serve these customers -well. However, these customers may still benefit significantly from the change management concept -- -unexpected / disruptive worker node drains and reboots have bitten even experienced OpenShift operators -(e.g. a new MachineConfig being contributed via gitops). - -These strategies inform decision-making through metrics and provide facilities for fine-grained control -over exactly when material change is rolled out to a cluster. - -The Assisted strategy is also specifically designed to provide a foundation for -the forthcoming `oc adm update worker-nodes` verbs. After separating the control-plane and -worker-node update phases, these verbs are intended to provide cluster lifecycle administrators the -ability to easily start, pause, cancel, and even rollback worker-node changes. - -Making accommodations for these strategies should be a subset of the overall implementation -of the MaintenanceSchedule strategy and they will enable a foundation for a range of -different persons not served by MaintenanceSchedule. - - -### Use CRON instead of RRULE -The CRON specification is typically used to describe when something should start and -does not imply when things should end. CRON also cannot, in a standard way, -express common semantics like "The first Saturday of every month." \ No newline at end of file +--- +title: change-management-and-maintenance-schedules +authors: + - @jupierce +reviewers: + - TBD +approvers: + - @sdodson + - @jharrington22 +api-approvers: + - TBD +creation-date: 2024-02-29 +last-updated: 2024-02-29 + +tracking-link: + - TBD + +--- + +# Change Management and Maintenance Schedules + +## Summary +Implement high level APIs for change management which allow +standalone and Hosted Control Plane (HCP) clusters a measure of configurable control +over when control-plane or worker-node configuration rollouts are initiated. +As a primary mode of configuring change management, implement an option +called Maintenance Schedules which define reoccurring windows of time (and specifically +excluded times) in which potentially disruptive changes in configuration can be initiated. + +Material changes not permitted by change management configuration are left in a +pending state until such time as they are permitted by the configuration. + +Change management enforcement _does not_ guarantee that all initiated +material changes are completed by the close of a permitted change window (e.g. a worker-node +may still be draining or rebooting) at the close of a maintenance schedule, +but it does prevent _additional_ material changes from being initiated. + +Change management enforcement _does not_ attempt to define or control the detailed state of the +system. It only pertains to whether controllers which support change management +will attempt to initiate material change themselves. For example, if changes are paused in the middle +of a cluster update and a node is manually rebooted, change management does not define +whether the node will rejoin the cluster with the new or old version. + +A "material change" may vary by cluster profile and subsystem. For example, a +control-plane update (all components and control-plane nodes updated) is implemented as +a single material change (e.g. the close of a scheduled permissive window +will not suspend its progress). In contrast, the rollout of worker-node updates is +more granular (you can consider it as many individual material changes) and +the end of a permitted change window will prevent additional worker-node updates +from being initiated. + +Changes vital to the continued operation of the cluster (e.g. certificate rotation) +are not considered material changes. Ignoring operational practicalities (e.g. +the need to fix critical bugs or update a cluster to supported software versions), +it should be possible to safely leave changes pending indefinitely. That said, +Service Delivery and/or higher level management systems may choose to prevent +such problematic change management settings from being applied by using +validating webhooks. + +## Motivation +This enhancement is designed to improve user experience during the OpenShift +upgrade process and other key operational moments when configuration updates +may result in material changes in cluster behavior and potential disruption +for non-HA workloads. + +The enhancement offers a direct operational tool to users while indirectly +supporting a longer term separation of control-plane and worker-node updates +for **Standalone** cluster profiles into distinct concepts and phases of managing +an OpenShift cluster (HCP clusters already provide this distinction). The motivations +for both aspects will be covered, but a special focus will be made on the motivation +for separating Standalone control-plane and worker-node updates as, while not fully realized +by this enhancement alone, ultimately provides additional business value helping to +justify an investment in the new operational tool. + +### Supporting the Eventual Separation of Control-Plane and Worker-Node Updates +One of the key value propositions of this proposal pre-supposes a successful +decomposition of the existing, fully self-managed, Standalone update process into two +distinct phases as understood and controlled by the end-user: +(1) control-plane update and (2) worker-node updates. + +To some extent, Maintenance Schedules (a key supported option for change management) +are a solution to a problem that will be created by this separation: there is a perception that it would also +double the operational burden for users updating a cluster (i.e. they have +two phases to initiate and monitor instead of just one). In short, implementing the +Maintenance Schedules concept allows users to succinctly express if and how +they wish to differentiate these phases. + +Users well served by the fully self-managed update experience can disable +change management (i.e. not set an enforced maintenance schedule), specifying +that control-plane and worker node updates can take place at +any time. Users who need more control may choose to update their control-plane +regularly (e.g. to patch CVEs) with a permissive change management configuration +for the control-plane while using a tight maintenance schedule for worker-nodes +to only update during specific, low utilization, periods. + +Since separating the node update phases is such an important driver for +Maintenance Schedules, their motivations are heavily intertwined. The remainder of this +section, therefore, delves into the motivation for this separation. + +#### The Case for Control-Plane and Worker-Node Separation +From an overall platform perspective, we believe it is important to drive a distinction +between updates of the control-plane and worker-nodes. Currently, an update is initiated +and OpenShift's ostensibly fully self-managed update mechanics take over (CVO laying +out new manifests, cluster operators rolling out new operands, etc.) culminating with +worker-nodes being drained a rebooted by the machine-config-operator (MCO) to align +them with the version of OpenShift running on the control-plane. + +This approach has proven extraordinarily successful in providing a fast and reliable +control-plane update, but, in rare cases, the highly opinionated update process leads +to less than ideal outcomes. + +##### Node Update Separation to Address Problems in User Perception +Our success in making OpenShift control-plane updates reliable, exhaustive focus on quality aside, +is also made possible by the platform's exclusive ownership of the workloads that run on the control-plane +nodes. Worker-nodes, on the other hand, run an endless variety of non-platform, user defined workloads - many of +which are not necessarily perfectly crafted. For example, workloads with pod disruption budgets (PDBs) that +prevent node drains and workloads which are not fundamentally HA (i.e. where draining part of the workload creates +disruption in the service it provides). + +Ultimately, we cannot solve the issue of problematic user workload configurations because +they are intentionally representable with Kubernetes APIs (e.g. it may be the user's actual intention to prevent a pod +from being drained, or it may be too expensive to make a workload fully HA). When confronted with +problematic workloads, the current, fully self-managed, OpenShift update process can appear to the end-user +to be unreliable or slow. This is because the self-managed update process takes on the end-to-end responsibility +of updating the control-plane and worker-nodes. Given the automated and somewhat opaque nature of this +update, it is reasonable for users to expect that the process is hands-off and will complete in a timely +manner regardless of their workloads. + +When this expectation is violated because of problematic user workloads, the update process is +often called into question. For example, if an update appears stalled after 12 hours, a +user is likely to have a poor perception of the platform and open a support ticket before +successfully diagnosing an underlying undrainable workload. + +By separating control-plane and worker-node updates into two distinct phases for an operator to consider, +we can more clearly communicate (1) the reliability and speeed of OpenShift control-plane updates and +(2) the shared responsibility, along with the end user, of successfully updating worker-nodes. + +As an analogy, when you are late to work because of delays in a subway system, you blame the subway system. +They own the infrastructure and schedules and have every reason to provide reliable and predictable transport. +If, instead, you are late to work because you step into a fully automated car that gets stuck in traffic, you blame the +traffic. The fully self-managed update process suggests to the end user that it is a subway -- subtly insulating +them from the fact that they might well hit traffic (problematic user workloads). Separating the update journey into +two parts - a subway portion (the control-plane) and a self-driving car portion (worker-nodes), we can quickly build the +user's intuition about their responsibilities in the latter part of the journey. For example, leaving earlier to +avoid traffic or staying at a hotel after the subway to optimize their departure for the car ride. + +##### Node Update Separation to Improve Risk Mitigation Strategies +With any cluster update, there is risk -- software is changing and even subtle differences in behavior can cause +issues given an unlucky combination of factors. Teams responsible for cluster operations are familiar with these +risks and owe it to their stakeholders to minimize them where possible. + +The current, fully self-managed, update process makes one obvious risk mitigation strategy +a relatively advanced strategy to employ: only updating the control-plane and leaving worker-nodes as-is. +It is possible by pausing machine config pools, but this is certainly not an intuitive step for users. Farther back +in OpenShift 4's history, the strategy was not even safe to perform since it could lead to worker-node +certificates to expiring. + +By separating the control-plane and worker-node updates into two separate steps, we provide a clear +and intuitive method of deferring worker-node updates: not initiating them. Leaving this to the user's +discretion, within safe skew-bounds, gives them the flexibility to make the right choices for their +unique circumstances. + +#### Enhancing Operational Control +The preceding section delved deeply into a motivation for Change Management / Maintenance Schedules based on our desire to +separate control-plane and worker-node updates without increasing operational burden on end-users. However, +Change Management, by providing control over exactly when updates & material changes to nodes in +the cluster can be initiated, provide value irrespective of this strategic direction. The benefit of +controlling exactly when changes are applied to critical systems is universally appreciated in enterprise +software. + +Since these are such well established principles, I will summarize the motivation as helping +OpenShift meet industry standard expectations with respect to limiting potentially disruptive change +outside well planned time windows. + +It could be argued that rigorous and time sensitive management of OpenShift cluster API resources could prevent +unplanned material changes, but Change Management / Maintenance Schedules introduce higher level, platform native, and more +intuitive guard rails. For example, consider the common pattern of a gitops configured OpenShift cluster. +If a user wants to introduce a change to a MachineConfig, it is simple to merge a change to the +resource without appreciating the fact that it will trigger a rolling reboot of nodes in the cluster. + +Trying to merge this change at a particular time of day and/or trying to pause and unpause a +MachineConfigPool to limit the impact of that merge to a particular time window requires +significant forethought by the user. Even with that forethought, if an enterprise wants +changes to only be applied during weekends, additional custom mechanics would need +to be employed to ensure the change merged during the weekend without needing someone present. + +Contrast this complexity with the user setting a Change Management / Maintenance Schedule on the cluster. The user +is then free to merge configuration changes and gitops can apply those changes to OpenShift +resources, but material change to the cluster will not be initiated until a time permitted +by the Maintenance Schedule. Users do not require special insight into the implications of +configuring platform resources as the top-level Maintenance Schedule control will help ensure +that potentially disruptive changes are limited to well known time windows. + +#### Reducing Service Delivery Operational Tooling +Service Delivery, operating Red Hat's Managed OpenShift offerings (OpenShift Dedicated (OSD), +Red Hat OpenShift on AWS (ROSA) and Azure Red Hat OpenShift (ARO) ) is keenly aware of +the issues motivating the Change Management / Maintenance Schedule concept. This is evidenced by their design +and implementation of tooling to fill the gaps in the platform the preceding sections +suggest exist. + +Specifically, Service Delivery has developed UXs outside the platform which allow customers +to define a preferred maintenance window. For example, when requesting an update, the user +can specify the desired start time. This is honored by Service Delivery tooling (unless +there are reasons to supersede the customer's preference). + +By acknowledging the need for scheduled maintenance in the platform, we reduce the need for Service +Delivery to develop and maintain custom tooling to manage the platform while +simultaneously simplifying management for all customer facing similar challenges. + +### User Stories +For readability, "cluster lifecycle administrator" is used repeatedly in the user stories. This +term can apply to different roles depending on the cluster environment and profile. In general, +it is the person or team making most material changes to the cluster - including planning and +choosing when to enact phases of the OpenShift platform update. + +For HCP, the role is called the [Cluster Service Consumer](https://hypershift-docs.netlify.app/reference/concepts-and-personas/#personas). For +Standalone clusters, this role would normally be filled by one or more `system:admin` users. There +may be several layers of abstraction between the cluster lifecycle administrator and changes being +actuated on the cluster (e.g. gitops, OCM, Hive, etc.), but the role will still be concerned with limiting +risks and disruption when rolling out changes to their environments. + +> "As a cluster lifecycle administrator, I want to ensure any material changes to my cluster +> (control-plane or worker-nodes) are only initiated during well known windows of low service +> utilization to reduce the impact of any service disruption." + +> "As a cluster lifecycle administrator, I want to ensure any material changes to my +> control-plane are only initiated during well known windows of low service utilization to +> reduce the impact of any service disruption." + +> "As a cluster lifecycle administrator, I want to ensure that no material changes to my +> cluster occur during a known date range even if it falls within our +> normal maintenance schedule due to an anticipated atypical usage (e.g. Black Friday)." + +> "As a cluster lifecycle administrator, I want to pause additional material changes from +> taking place when it is no longer practical to monitor for service disruptions. For example, +> if a worker-node update is proving to be problematic during a valid permissive window, I would +> like to be able to pause that change manually so that the team will not have to work on the weekend." + +> "As a cluster lifecycle administrator, I need to stop all material changes on my cluster +> quickly and indefinitely until I can understand a potential issue. I not want to consider dates or +> timezones in this delay as they are not known and irrelevant to my immediate concern." + +> "As a cluster lifecycle administrator, I want to ensure any material changes to my +> control-plane are only initiated during well known windows of low service utilization to +> reduce the impact of any service disruption. Furthermore, I want to ensure that material +> changes to my worker-nodes occur on a less frequent cadence because I know my workloads +> are not HA." + +> "As an SRE, tasked with performing non-emergency corrective action, I want +> to be able to apply a desired configuration (e.g. PID limit change) and have that change roll out +> in a minimally disruptive way subject to the customer's configured maintenance schedule." + +> "As an SRE, tasked with performing emergency corrective action, I want to be able to +> quickly disable a configured maintenance schedule, apply necessary changes, have them roll out immediately, +> and restore the maintenance schedule to its previous configuration." + +> "As a leader within the Service Delivery organization, tasked with performing emergency corrective action +> across our fleet, I want to be able to bypass and then restore customer maintenance schedules +> with minimal technical overhead." + +> "As a cluster lifecycle administrator who is well served by a fully managed update without change management, +> I want to be minimally inconvenienced by the introduction of change management / maintenance schedules." + +> "As a cluster lifecycle administrator who is not well served by a fully managed update and needs exacting +> control over when material changes occur on my cluster where opportunities do NOT arise at reoccurring intervals, +> I want to employ a change management strategy that defers material changes until I perform a manual action." + +> "As a cluster lifecycle administrator, I want to easily determine the next time at which maintenance operations +> will be permitted to be initiated, based on the configured maintenance schedule, by looking at the +> status of relevant API resources or metrics." + +> "As a cluster lifecycle administrator, I want to easily determine whether there are material changes pending for +> my cluster, awaiting a permitted window based on the configured maintenance schedule, by looking at the +> status of relevant API resources or metrics." + +> "As a cluster lifecycle administrator, I want to easily determine whether a maintenance schedule is currently being +> enforced on my cluster by looking at the status of relevant API resources or metrics." + +> "As a cluster lifecycle administrator, I want to be able to alert my operations team when changes are pending, +> when and the number of seconds to the next permitted window approaches, or when a maintenance schedule is not being +> enforced on my cluster." + +> "As a cluster lifecycle administrator, I want to be able to diagnose why pending changes have not been applied +> if I expected them to be." + +> "As a cluster administrator or privileged user familiar with OpenShift prior to the introduction of change management, +> I want it to be clear when I am looking at the desired versus actual state of the system. For example, if I can see +> the state of the clusterversion or a machineconfigpool, it should be straightforward to understand why I am +> observing differences in the state of those resources compared to the state of the system." + +### Goals + +1. Indirectly support the strategic separation of control-plane and worker-node update phases for Standalone clusters by supplying a change control mechanism that will allow both control-plane and worker-node updates to proceed at predictable times without doubling operational overhead. +2. Directly support the strategic separation of control-plane and worker-node update phases by implementing a "manual" change management strategy where users who value the full control of the separation can manually actuate changes to them independently. +3. Empower OpenShift cluster lifecycle administrators with tools that simplify implementing industry standard notions of maintenance windows. +4. Provide Service Delivery a platform native feature which will reduce the amount of custom tooling necessary to provide maintenance windows for customers. +5. Deliver a consistent change management experience across all platforms and profiles (e.g. Standalone, ROSA, HCP). +6. Enable SRE to, when appropriate, make configuration changes on a customer cluster and have that change actually take effect only when permitted by the customer's change management preferences. +7. Do not subvert expectations of customers well served by the existing fully self-managed cluster update. +8. Ensure the architectural space for enabling different change management strategies in the future. + +### Non-Goals + +1. Allowing control-plane upgrades to be paused midway through an update. Control-plane updates are relatively rapid and pausing will introduce unnecessary complexity and risk. +2. Requiring the use of maintenance schedules for OpenShift upgrades (the changes should be compatible with various upgrade methodologies – including being manually triggered). +3. Allowing Standalone worker-nodes to upgrade to a different payload version than the control-plane (this is supported in HCP, but is not a goal for standalone). +4. Exposing maintenance schedule controls from the oc CLI. This may be a future goal but is not required by this enhancement. +5. Providing strict promises around the exact timing of upgrade processes. Maintenance schedules will be honored to a reasonable extent (e.g. upgrade actions will only be initiated during a window), but long-running operations may exceed the configured end of a maintenance schedule. +6. Implementing logic to defend against impractical maintenance schedules (e.g. if a customer configures a 1-second maintenance schedule every year). Service Delivery may want to implement such logic to ensure upgrade progress can be made. + +## Proposal + +### Change Management Overview +Add a `changeManagement` stanza to several resources in the OpenShift ecosystem: +- HCP's `HostedCluster`. Honored by HyperShift Operator and supported by underlying CAPI primitives. +- HCP's `NodePool`. Honored by HyperShift Operator and supported by underlying CAPI primitives. +- Standalone's `ClusterVersion`. Honored by Cluster Version Operator. +- Standalone's `MachineConfigPool`. Honored by Machine Config Operator. + +The implementation of `changeManagement` will vary by profile +and resource, however, they will share a core schema and provide a reasonably consistent user +experience across profiles. + +The schema will provide options for controlling exactly when changes to API resources on the +cluster can initiate material changes to the cluster. Changes that are not allowed to be +initiated due to a change management control will be called "pending". Subsystems responsible +for initiating pending changes will await a permitted window according to the change's +relevant `changeManagement` configuration(s). + +### Change Management Strategies +Each resource supporting change management will add the `changeManagement` stanza and support a minimal set of change management strategies. +Each strategy may require an additional configuration element within the stanza. For example: +```yaml +spec: + changeManagement: + strategy: "MaintenanceSchedule" + pausedUntil: false + disabledUntil: false + config: + maintenanceSchedule: + ..options to configure a detailed policy for the maintenance schedule.. +``` + +All change management implementations must support `Disabled` and `MaintenanceSchedule`. Abstracting +change management into strategies allows for simplified future expansion or deprecation of strategies. +Tactically, `strategy: Disabled` provides a convenient syntax for bypassing any configured +change management policy without permanently deleting its configuration. + +For example, if SRE needs to apply emergency corrective action on a cluster with a `MaintenanceSchedule` change +management strategy configured, they can simply set `strategy: Disabled` without having to delete the existing +`maintenanceSchedule` stanza which configures the previous strategy. Once the correct action has been completed, +SRE simply restores `strategy: MaintenanceSchedule` and the previous configuration begins to be enforced. + +Configurations for multiple management strategies can be recorded in the `config` stanza, but +only one strategy can be active at a given time. + +Each strategy will support a policy for pausing or unpausing (permitting) material changes from being initiated. +This will be referred to as the strategy's enforcement state (or just "state"). The enforcement state for a +strategy can be either "paused" or "unpaused" (a.k.a. "permissive"). The `Disabled` strategy enforcement state +is always permissive -- allowing material changes to be initiated (see [Change Management +Hierarchy](#change-management-hierarchy) for caveats). + +All change management strategies, except `Disabled`, are subject to the following `changeManagement` fields: +- `changeManagement.disabledUntil: `: When `disabledUntil: true` or `disabledUntil: `, the interpreted strategy for + change management in the resource is `Disabled`. Setting a future date in `disabledUntil` offers a less invasive (i.e. no important configuration needs to be changed) method to + disable change management constraints (e.g. if it is critical to roll out a fix) and a method that + does not need to be reverted (i.e. it will naturally expire after the specified date and the configured + change management strategy will re-activate). +- `changeManagement.pausedUntil: `: Unless the effective active strategy is Disabled, `pausedUntil: true` or `pausedUntil: `, change management must + pause material changes. + +### Change Management Status +Change Management information will also be reflected in resource status. Each resource +which contains the stanza in its `spec` will expose its current impact in its `status`. +Common user interfaces for aggregating and displaying progress of these underlying resources +should be updated to proxy that status information to the end users. + +### Change Management Metrics +Cluster wide change management information will be made available through cluster metrics. Each resource +containing the stanza must expose the following metrics: +- The number of seconds until the next known permitted change window. 0 if changes can currently be initiated. -1 if changes are paused indefinitely. -2 if no permitted window can be computed. +- Whether any change management strategy is enabled. +- Which change management strategy is enabled. +- If changes are pending due to change management controls. + +### Change Management Hierarchy +Material changes to worker-nodes are constrained by change management policies in their associated resource AND +at the control-plane resource. For example, in a standalone profile, if a MachineConfigPool's change management +configuration apparently permits material changes from being initiated at a given moment, that is only the case +if ClusterVersion is **also** permitting changes from being initiated at that time. + +The design choice is informed by a thought experiment: As a cluster lifecycle administrator for a Standalone cluster, +who wants to achieve the simple goal of ensuring no material changes take place outside a well-defined +maintenance schedule, do you want to have to the challenge of keeping every MachineConfigPool's +`changeManagement` stanza in perfect synchronization with the ClusterVersion's? What if a new MCP is created +without your knowledge? + +The hierarchical approach allows a single master change management policy to be in place across +both the control-plane and worker-nodes. + +Conversely, material changes CAN take place on the control-plane when permitted by its associated +change management policy even while material changes are not being permitted by worker-nodes +policies. + +It is thus occasionally necessary to distinguish a resource's **configured** vs **effective** change management +state. There are two states: "paused" and "unpaused" (a.k.a. permissive; meaning that material changes be initiated). +For a control-plane resource, the configured and effective enforcement states are always the same. For worker-node +resources, the configured strategy may be disabled, but the effective enforcement state can be "paused" due to +an active strategy in the control-plane resource being in the "paused" state. + +| control-plane state | worker-node state | worker-node effective state | results | +|-------------------------|---------------------------|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| unpaused | unpaused | unpaused | Traditional, fully self-managed change rollouts. Material changes can be initiated immediately upon configuration change. | +| paused (any strategy) | **unpaused** | **paused** | Changes to both the control-plane and worker-nodes are constrained by the control-plane strategy. | +| unpaused | paused (any strategy) | paused | Material changes can be initiated immediately on the control-plane. Material changes on worker-nodes are subject to the worker-node policy. | +| paused (any strategy) | paused (any strategy) | paused | Material changes to the control-plane are subject to change control strategy for the control-plane. Material changes to the worker-nodes are subject to **both** the control-plane and worker-node strategies - if either precludes material change initiation, changes are left pending. | + +#### Maintenance Schedule Strategy +The maintenance schedule strategy is supported by all resources which support change management. The strategy +is configured by specifying an RRULE identifying permissive datetimes during which material changes can be +initiated. The cluster lifecycle administrator can also exclude specific date ranges, during which +material changes will be paused. + +#### Disabled Strategy +This strategy indicates that no change management strategy is being enforced by the resource. It always implies that +the enforcement state at the resource level is unpaused / permissive. This does not always +mean that material changes are permitted due to change management hierarchies. For example, a MachineConfigPool +with `strategy: Disabled` would still be subject to a `strategy: MaintenanceStrategy` in the ClusterVersion resource. + +#### Assisted Strategy - MachineConfigPool +Minimally, this strategy will be supported by MachineConfigPool. If and when the strategy is supported by other +change management capable resources, the configuration schema for the policy may differ as the details of +what constitutes and informs change varies between resources. + +This strategy is motivated by the desire to support the separation of control-plane and worker-node updates both +conceptually for users and in real technical terms. One way to do this for users who do not benefit from the +`MaintenanceSchedule` strategy is to ask them to initiate, pause, and resume the rollout of material +changes to their worker nodes. Contrast this with the fully self-managed state today, where worker-nodes +(normally) begin to be updated automatically and directly after the control-plane update. + +Clearly, if this was the only mode of updating worker-nodes, we could never successfully disentangle the +concepts of control-plane vs worker-node updates in Standalone environments since one implies the other. + +In short (details will follow in the implementation section), the assisted strategy allows users to specify the +exact rendered [`desiredConfig` the MachineConfigPool](https://github.com/openshift/machine-config-operator/blob/5112d4f8e562a2b072106f0336aeab451341d7dc/docs/MachineConfigDaemon.md#coordinating-updates) should be advertising to the MachineConfigDaemon on +nodes it is associated with. Like the `MaintenanceSchedule` strategy, it also respects the `pausedUntil` +field. + +#### Manual Strategy - MachineConfigPool +Minimally, this strategy will be supported by MachineConfigPool. If and when the strategy is supported by other +change management capable resources, the configuration schema for the policy may differ as the details of +what constitutes and informs change varies between resources. + +Like the Assisted strategy, this strategy is implemented to support the conceptual and technical separation +of control-plane and worker-nodes. The MachineConfigPool Manual strategy allows users to explicitly specify +their `desiredConfig` to be used for ignition of new and rebooting nodes. While the Manual strategy is enabled, +the MachineConfigOperator will not trigger the MachineConfigDaemon to drain or reboot nodes automatically. + +Because the Manual strategy initiates changes on its own behalf, `pausedUntil` has no effect. From a metrics +perspective, this strategy reports as paused indefinitely. + +### Workflow Description + +#### OCM HCP Standard Change Management Scenario + +1. A [Cluster Service Consumer](https://hypershift-docs.netlify.app/reference/concepts-and-personas/#personas) requests an HCP cluster via OCM. +1. To comply with their company policy, the service consumer configures a maintenance schedule through OCM. +1. Their first preference, no updates at all, is rejected by OCM policy, and they are referred to service + delivery documentation explaining minimum requirements. +1. The user specifies a policy which permits changes to be initiated any time Saturday UTC on the control-plane. +1. To limit perceived risk, they try to specify a separate policy permitting worker-nodes updates only on the **first** Sunday of each month. +1. OCM rejects the configuration because, due to change management hierarchy, worker-node maintenance schedules can only be a proper subset of control-plane maintenance schedules. +1. The user changes their preference to a policy permitting worker-nodes updates only on the **first** Saturday of each month. +1. OCM accepts the configuration. +1. OCM configures the HCP (HostedCluster/NodePool) resources via the Service Delivery Hive deployment to contain a `changeManagement` stanza + and an active/configured `MaintenanceSchedule` strategy. +1. Hive updates the associated HCP resources. +1. Company workloads are added to the new cluster and the cluster provides value. +1. To leverage a new feature in OpenShift, the service consumer plans to update the minor version of the platform. +1. Via OCM, the service consumer requests the minor version update. They can do this at any time with confidence that the maintenance + schedule will be honored. They do so on Wednesday. +1. OCM (through various layers) updates the target release payload in the HCP HostedCluster and NodePool. +1. The HyperShift Operator detects the desired changes but recognizes that the `changeManagement` stanza + precludes the updates from being initiated. +1. Curious, the service consumer checks the projects ClusterVersion within the HostedCluster and reviews its `status` stanza. It shows that changes are pending and the time of the next window in which changes can be initiated. +1. Separate metrics specific to change management indicate that changes are pending for both resources. +1. The non-Red Hat operations team has alerts setup to fire when changes are pending and the number of + seconds before the next permitted window is less than 2 days away. +1. These alerts fire after Thursday UTC 00:00 to inform the operations team that changes are about to be applied to the control-plane. +1. It is not the first week of the month, so there is no alert fired for the NodePool pending changes. +1. The operations team is comfortable with the changes being rolled out on the control-plane. +1. On Saturday 00:00 UTC, the HyperShift operator initiates changes the control-plane update. +1. The update completes without issue. +1. Changes remain pending for the NodePool resource. +1. As the first Saturday of the month approaches, the operations alerts fire to inform the team of forthcoming changes. +1. The operations team realizes that a corporate team needs to use the cluster heavily during the weekend for a business critical deliverable. +1. The service consumer logs into OCM and adds an exclusion for the upcoming Saturday. +1. Interpreting the new exclusion, the metric for time remaining until a permitted window increases to count down to the following month's first Saturday. +1. A month passes and the pending cause the configured alerts to fire again. +1. The operations team is comfortable with the forthcoming changes. +1. The first Saturday of the month 00:00 UTC arrives. The HyperShift operator initiates the worker-node updates based on the pending changes in the cluster NodePool. +1. The HCP cluster has a large number of worker nodes and draining and rebooting them is time-consuming. +1. At 23:59 UTC Saturday night, 80% of worker-nodes have been updated. Since the maintenance schedule still permits the initiation of material changes, another worker-node begins to be updated. +1. The update of this worker-node continues, but at 00:00 UTC Sunday, no further material changes are permitted by the change management policy and the worker-node update process is effectively paused. +1. Because not all worker-nodes have been updated, changes are still reported as pending via metrics for NodePool. **TODO: Review with HyperShift. Pausing progress should be possible, but a metric indicating changes still pending may not since they interact only through CAPI.** +1. The HCP cluster runs with worker-nodes at mixed versions throughout the month. The N-1 skew between the old kubelet versions and control-plane is supported. +1. **TODO: Review with Service Delivery. If the user requested another minor bump to their control-plane, how does OCM prevent unsupported version skew today?** +1. On the next first Saturday, the worker-nodes updates are completed. + +#### OCM Standalone Standard Change Management Scenario + +1. User interactions with OCM to configure a maintenance schedule are identical to [OCM HCP Standard Change Management Scenario](#ocm-hcp-standard-change-management-scenario). + This scenario differs after OCM accepts the maintenance schedule configuration. Control-plane updates are permitted to be initiated to any Saturday UTC. + Worker-nodes must wait until the first Saturday of the month. +2. OCM (through various layers) configures the ClusterVersion and worker MachineConfigPool(s) (MCP) for the cluster with appropriate `changeManagement` stanzas. +3. Company workloads are added to the new cluster and the cluster provides value. +4. To leverage a new feature in OpenShift, the service consumer plans to update the minor version of the platform. +5. Via OCM, the service consumer requests the minor version update. They can do this at any time with confidence that the maintenance + schedule will be honored. They do so on Wednesday. +6. OCM (through various layers) updates the ClusterVersion resource on the cluster indicating the new release payload in `desiredUpdate`. +7. The Cluster Version Operator (CVO) detects that its `changeManagement` stanza does not permit the initiation of the change. +8. The CVO sets a metric indicating that changes are pending for ClusterVersion. Irrespective of pending changes, the CVO also exposes a + metric indicating the number of seconds until the next window in which material changes can be initiated. +9. Since MachineConfigs likely do not match in the desired update and the current manifests (RHCOS changes occur 100% of the time for non-hotfix updates), + the CVO also sets a metric indicating that MachineConfig changes are pending. This is an assumption, but the price of being wrong + on rare occasions is very low (pending changes will be reported, but disappear shortly after a permissive window begins). + This is done because the MachineConfigOperator (MCO) cannot anticipate the coming manifest changes and cannot, + therefore, reflect expected changes to the worker-node MCPs. Anticipating this change ahead of time is necessary for an operations + team to be able to set an alert with the semantics (worker-node-update changes are pending & time remaining until changes are permitted < 2d). + The MCO will expose its own metric for changes pending when manifests are updated. But this metric will only indicate when + there are machines in the pool that have not achieved the desired configuration. An operations team trying to implement the 2d + early warning for worker-nodes must use OR on these metrics to determine whether changes are actually pending. +10. The MCO, irrespective of pending changes, exposes a metric for each MCP to indicate the number of seconds remaining until it is + permitted to initiate changes to nodes in that MCP. +11. A privileged user on the cluster notices different options available for `changeManagement` in the ClusterVersion and MachineConfigPool + resources. They try to set them but are prevented by a validating admission controller. If they wish + to change the settings, they must update them through OCM. +12. The privileged user does an `oc describe ...` on the resources. They can see that material changes are pending in ClusterVersion for + the control-plane and for worker machine config. They can also see the date and time that the next material change will be permitted. + The MCP will not show a pending change at this time, but will show the next time at which material changes will be permitted. +13. The next Saturday is _not_ the first Saturday of the month. The CVO detects that material changes are permitted at 00:00 UTC and + begins to apply manifests. This effectively initiates the control-plane update process, which is considered a single + material change to the cluster. +14. The control-plane update succeeds. The CVO, having reconciled its state, unsets metrics suggesting changes are pending. +15. As part of updating cluster manifests, MachineConfigs have been modified. The MachineConfigOperator (MCO) re-renders a + configuration for worker-nodes. However, because the MCP maintenance schedule precludes initiating material changes, + it will not begin to update Machines with that desired configuration. +16. The MCO will set a metric indicating that desired changes are pending. +17. `oc get -o=yaml/describe` will both provide status information indicating that changes are pending for the MCP and + the time at which the next material changes can be initiated according to the maintenance schedule. +18. On the first Saturday of the next month, 00:00 UTC, the MCO determines that material changes are permitted. + Based on limits like maxUnavailable, the MCO begins to annotate nodes with the desiredConfiguration. The + MachineConfigDaemon takes over from there, draining, and rebooting nodes into the updated release. +19. There are a large number of nodes in the cluster and this process continues for more than 24 hours. On Saturday + 23:59, the MCO applies a round of desired configurations annotations to Nodes. At 00:00 on Sunday, it detects + that material changes can no longer be initiated, and pauses its activity. Node updates that have already + been initiated continue beyond the maintenance schedule window. +20. Since not all nodes have been updated, the MCO continues to expose a metric informing the system of + pending changes. +21. In the subsequent days, the cluster is scaled up to handle additional workload. The new nodes receive + the most recent, desired configuration. +22. On the first Saturday of the next month, the MCO resumes its work. In order to ensure that forward progress is + made for all nodes, the MCO will update nodes that have the oldest current configuration first. This ensures + that even if the desired configuration has changed multiple times while maintenance was not permitted, + no nodes are starved of updates. Consider the alternative where (a) worker-node updates required > 24h, + (b) updates to nodes are performed alphabetically, and (c) MachineConfigs are frequently being changed + during times when maintenance is not permitted. This strategy could leave nodes sorting last + lexicographically no opportunity to receive updates. This scenario would eventually leave those nodes + more prone to version skew issues. +23. During this window of time, all node updates are initiated, and they complete successfully. + +#### Service Delivery Emergency Patch +1. SRE determines that a significant new CVE threatens the fleet. +1. A new OpenShift release in each z-stream fixes the problem. +1. SRE plans to override customer maintenance schedules in order to rapidly remediate the problem across the fleet. +1. The new OpenShift release(s) are configured across the fleet. Clusters with permissive maintenance + schedules begin to apply the changes immediately. +1. Clusters with change management policies precluding updates are SRE's next focus. +1. During each region's evening hours, to limit disruption, SRE changes the `changeManagement` strategy + field across relevant resources to `Disabled`. Changes that were previously pending are now + permitted to be initiated. +1. Cluster operators who have alerts configured to fire when there is no change management policy in place + will do so. +1. As clusters are successfully remediated, SRE restores the `MaintenanceSchedule` strategy for its resources. + + +#### Service Delivery Immediate Remediation +1. A customer raises a ticket for a problem that is eventually determined to be caused by a worker-node system configuration. +1. SRE can address the issue with a system configuration file applied in a MachineConfig. +1. SRE creates the MachineConfig for the customer and provides the customer the option to either (a) wait until their + configured maintenance schedule permits the material change from being initiated by the MachineConfigOperator + or (b) having SRE override the maintenance schedule and permitting its immediate application. +1. The customer chooses immediate application. +1. SRE applies a change to the relevant control-plane AND worker-node resource's `changeManagement` stanza + (both must be changed because of the change management hierarchy), setting `disabledUntil` to + a time 48 hours in the future. The configured change management schedule is ignored for 48 as the system + initiates all necessary node changes. + +#### Service Delivery Deferred Remediation +1. A customer raises a ticket for a problem that is eventually determined to be caused by a worker-node system configuration. +1. SRE can address the issue with a system configuration file applied in a MachineConfig. +1. SRE creates the MachineConfig for the customer and provides the customer the option to either (a) wait until their + configured maintenance schedule permits the material change from being initiated by the MachineConfigOperator + or (b) modify change management to permit immediate application (e.g. setting `disabledUntil`). +1. The problem is not pervasive, so the customer chooses the deferred remediation. +1. The change is initiated and nodes are rebooted during the next permissive window. + + +#### On-prem Standalone GitOps Change Management Scenario +1. An on-prem cluster is fully managed by gitops. As changes are committed to git, those changes are applied to cluster resources. +1. Configurable stanzas of the ClusterVersion and MachineConfigPool(s) resources are checked into git. +1. The cluster lifecycle administrator configures `changeManagement` in both the ClusterVersion and worker MachineConfigPool + in git. The MaintenanceSchedule strategy is chosen. The policy permits control-plane and worker-node updates only after + 19:00 Eastern US. +1. During the working day, users may contribute and merge changes to MachineConfigs or even the `desiredUpdate` of the + ClusterVersion. These resources will be updated in a timeline manner via GitOps. +1. Despite the resource changes, neither the CVO nor MCO will begin to initiate the material changes on the cluster. +1. Privileged users who may be curious as to the discrepancy between git and the cluster state can use `oc get -o=yaml/describe` + on the resources. They observe that changes are pending and the time at which changes will be initiated. +1. At 19:00 Eastern, the pending changes begin to be initiated. This rollout abides by documented OpenShift constraints + such as the MachineConfigPool `maxUnavailable` setting. + +#### On-prem Standalone Manual Strategy Scenario +1. A small, business critical cluster is being run on-prem. +1. There are no reoccurring windows of time when the cluster lifecycle administrator can tolerate downtime. + Instead, updates are negotiated and planned far in advance. +1. The cluster workloads are not HA and unplanned drains are considered a business risk. +1. To prevent surprises, the cluster lifecycle administrator sets the Manual strategy on the worker MCP. +1. Given the sensitivity of the operation, the lifecycle administrator wants to manually drain and reboot + nodes to accomplish the update. +1. The cluster lifecycle administrator sends a company-wide notice about the period during which service may be disrupted. +1. The user determines the most recent rendered worker configuration. They configure the `manual` change + management policy to use that exact configuration as the `desiredConfig`. +1. The MCO is thus being asked to ignite any new node or rebooted node with the desired configuration, but it + is **not** being permitted to apply that configuration to existing nodes because it is change management, in effect, + is paused indefinitely by the manual strategy. +1. The MCO metric for the MCP indicating the number of seconds remaining until changes can be initiated is `-1` - indicating + that there is presently no time in the future where it will initiate material changes. The operations team + has an alert configured if this value `!= -1`. +1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running + the most recently rendered configuration. This is irrespective of the `desiredConfig` in the `manual` + policy. Abstractly, it means, if change management were disabled, whether changes be initiated. +1. The cluster lifecycle administrator manually drains and reboots nodes in the cluster. As they come back online, + the MachineConfigServer offers them the desiredConfig requested by the manual policy. +1. After updating all nodes, the cluster lifecycle administrator does not need make any additional + configuration changes. They can leave the `changeManagement` stanza in their MCP as-is. + +#### On-prem Standalone Assisted Strategy Scenario +1. A large, business critical cluster is being run on-prem. +1. There are no reoccurring windows of time when the cluster lifecycle administrator can tolerate downtime. + Instead, updates are negotiated and planned far in advance. +1. The cluster workloads are not HA and unplanned drains are considered a business risk. +1. To prevent surprises, the cluster lifecycle administrator sets the Assisted strategy on the worker MCP. +1. In the `assisted` strategy change management policy, the lifecycle administrator configures `pausedUntil: true` + and the most recently rendered worker configuration in the policy's `renderedConfigsBefore: `. +1. The MCO is being asked to ignite any new node or any rebooted node with the latest rendered configuration + before the present datetime. However, because of `pausedUntil: true`, it is also being asked not to + automatically initiate that material change for existing nodes. +1. The MCO metric for the MCP indicating the number of seconds remaining until changes can be initiated is `-1` - indicating + that there is presently no time in the future where it will initiate material changes. The operations team + has an alert configured if this value `!= -1`. +1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running + the most recent, rendered configuration. This is irrespective of the `renderedConfigsBefore` in the `assisted` + configuration. Abstractly, it means, if change management were disabled, whether changes be initiated. +1. When the lifecycle administrator is ready to permit disruption, they set `pausedUntil: false`. +1. The MCO sets the number of seconds until changes are permitted to `0`. +1. The MCO begins to initiate worker node updates. This rollout abides by documented OpenShift constraints + such as the MachineConfigPool `maxUnavailable` setting. +1. Though new rendered configurations may be created, the assisted strategy will not act until the assisted policy + is updated to permit a more recent creation date. + +### API Extensions + +API Extensions are CRDs, admission and conversion webhooks, aggregated API servers, +and finalizers, i.e. those mechanisms that change the OCP API surface and behaviour. + +- Name the API extensions this enhancement adds or modifies. +- Does this enhancement modify the behaviour of existing resources, especially those owned + by other parties than the authoring team (including upstream resources), and, if yes, how? + Please add those other parties as reviewers to the enhancement. + + Examples: + - Adds a finalizer to namespaces. Namespace cannot be deleted without our controller running. + - Restricts the label format for objects to X. + - Defaults field Y on object kind Z. + +Fill in the operational impact of these API Extensions in the "Operational Aspects +of API Extensions" section. + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +In the HCP topology, the HostedCluster and NodePool resources are enhanced to support the change management strategies +`MaintenanceSchedule` and `Disabled`. + +#### Standalone Clusters + +In the Standalone topology, the ClusterVersion and MachineConfigPool resources are enhanced to support the change management strategies +`MaintenanceSchedule` and `Disabled`. The MachineConfigPool also supports the `Manual` and `Assisted` strategies. + +#### Single-node Deployments or MicroShift + +The ClusterVersion operator will honor the change management field just as in a standalone profile. If those profiles +have a MachineConfigPool, material changes the node could be controlled with a change management policy +in that resource. + +#### OCM Managed Profiles +OpenShift Cluster Manager (OCM) should expose a user interface allowing users to manage their change management policy. +Standard Fleet clusters will expose the option to configure the MaintenanceSchedule strategy - including +only permit and exclude times. + +- Service Delivery will reserve the right to disable this strategy for emergency corrective actions. +- Service Delivery should constrain permit & exclude configurations based on their internal policies. For example, customers may be forced to enable permissive windows which amount to at least 6 hours a month. + +### Implementation Details/Notes/Constraints + +#### ChangeManagement Stanza +The change management stanza will be introduced into ClusterVersion and MachineConfigPool (for standalone profiles) +and HostedCluster and NodePool (for HCP profiles). The structure of the stanza is: + +```yaml +spec: + changeManagement: + # The active strategy for change management (unless disabled by disabledUntil). + strategy: + + # If set to true or a future date, the effective change management strategy is Disabled. Date + # must be RFC3339. + disabledUntil: + + # If set to true or a future date, all strategies other than Disabled are paused. Date + # must be RFC3339. + pausedUntil: + + # If a strategy needs additional configuration information, it can read a + # key bearing its name in the config stanza. + config: + : + ...configuration policy for the strategy... +``` + +#### MaintenanceSchedule Strategy Configuration + +```yaml +spec: + changeManagement: + strategy: MaintenanceSchedule + config: + maintenanceSchedule: + # Specifies a reoccurring permissive window. + permit: + # RRULEs (https://www.rfc-editor.org/rfc/rfc5545#section-3.3.10) are commonly used + # for calendar management metadata. Only a subset of the RFC is supported. + # See "RRULE Constraints" section for details. + # If unset, all dates are permitted and only exclude constrains permissive windows. + recurrence: + # Given the identification of a date by an RRULE, at what time (always UTC) can the + # permissive window begin. "00:00" if unset. + startTime: + # Given the identification of a date by an RRULE, after what offset from the startTime should + # the permissive window close. This can create permissive windows within days that are not + # identified in the RRULE. For example, recurrence="FREQ=Weekly;BYDAY=Sa;", + # startTime="20:00", duration="8h" would permit material change initiation starting + # each Saturday at 8pm and continuing through Sunday 4am (all times are UTC). The default + # duration is 24:00-startTime (i.e. to the end of the day). + duration: + + + # Excluded date ranges override RRULE selections. + exclude: + # Dates should be specified in YYYY-MM-DD. Each date is excluded from 00:00 UTC for 24 hours. + - fromDate: + # Non-inclusive until. If null, until defaults to the day after from (meaning a single day exclusion). + untilDate: + # Provide additional detail for status when the cluster is within an exclusion period. + reason: Optional human readable which will be included in status description. + +``` + +Permitted times (i.e. times at which the strategy enforcement state can be permissive) are specified using a +subset of the [RRULE RFC5545](https://www.rfc-editor.org/rfc/rfc5545#section-3.3.10) and, optionally, a +starting and ending time of day. https://freetools.textmagic.com/rrule-generator is a helpful tool to +review the basic semantics RRULE is capable of expressing. https://exontrol.com/exicalendar.jsp?config=/js#calendar +offers more complex expressions. + +**RRULE Interpretation** +RRULE supports expressions that suggest recurrence without implying an exact date. For example: +- `RRULE:FREQ=YEARLY` - An event that occurs once a year on a specific date. +- `RRULE:FREQ=WEEKLY;INTERVAL=2` - An event that occurs every two weeks. + +All such expressions shall be evaluated with a starting date of Jan 1st, 1970 00:00Z. In other +words, `RRULE:FREQ=YEARLY` would be considered permissive, for one day, at the start of each new year. + +If no `startTime` or `duration` is specified, any day selected by the RRULE will suggest a +permissive 24h window unless a date is in the `exclude` ranges. + +**RRULE Constraints** +A valid RRULE for change management: +- must identify a date, so, although RRULE supports `FREQ=HOURLY`, it will not be supported. +- cannot specify an end for the pattern. `RRULE:FREQ=DAILY;COUNT=3` suggests + an event that occurs every day for three days only. As such, neither `COUNT` nor `UNTIL` is + supported. +- cannot specify a permissive window more than 2 years away. + +**Overview of Interactions** +The MaintenanceSchedule strategy, along with `changeManagement.pausedUntil` allows a cluster lifecycle administrator to express +one of the following: + +| pausedUntil | permit | exclude | Enforcement State (Note that **effective** state must also take into account hierarchy) | +|----------------|--------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `null`/`false` | `null` | `null` | Permissive indefinitely | +| `true` | * | * | Paused indefinitely | +| `null`/`false` | set | `null` | Permissive during reoccurring windows time. Paused at all other times. | +| `null`/`false` | set | set | Permissive during reoccurring windows time modulo excluded date ranges during which it is paused. Paused at all other times. | +| `null`/`false` | `null` | set | Permissive except during excluded dates during which it is paused. | +| date | * | * | Honor permit and exclude values, but only after the specified date. For example, permit: `null` and exclude: `null` implies the strategy is indefinitely permissive after the specified date. | + + +#### MachineConfigPool Assisted Strategy Configuration + +```yaml +spec: + changeManagement: + strategy: Assisted + config: + assisted: + permit: + # The assisted strategy will allow the MCO to process any rendered configuration + # that was created before the specified datetime. + renderedConfigsBefore: + # When AllowSettings, rendered configurations after the preceding before date + # can be applied if and only if they do not contain changes to osImageURL. + policy: "AllowSettings|AllowNone" +``` + +The primary user of this strategy is `oc` with tentatively planned enhancements to include verbs +like: +```sh +$ oc adm update worker-nodes start ... +$ oc adm update worker-nodes pause ... +$ oc adm update worker-nodes rollback ... +``` + +These verbs can leverage the assisted strategy and `pausedUntil` to allow the manual initiation of worker-nodes +updates after a control-plane update. + +#### MachineConfigPool Manual Strategy Configuration + +```yaml +spec: + changeManagement: + strategy: Manual + config: + manual: + desiredConfig: +``` + +The manual strategy requests no automated initiation of updates. New and rebooting +nodes will only receive the desired configuration. From a metrics perspective, this strategy +is always paused state. + +#### Metrics + +`cm_change_pending` +Labels: +- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool +- object= +- system= + +Value: +- `0`: no material changes are pending. +- `1`: changes are pending but being initiated. +- `2`: changes are pending and blocked based on this resource's change management policy. +- `3`: changes are pending and blocked based on another resource in the change management hierarchy. + +`cm_change_eta` +Labels: +- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool +- object= +- system= + +Value: +- `-2`: Error determining the time at which changes can be initiated (e.g. cannot check with ClusterVersion / change management hierarchy). +- `-1`: Material changes are paused indefinitely. +- `0`: Any pending changes can be initiated now (e.g. change management is disabled or inside machine schedule window). +- `> 0`: The number seconds remaining until changes can be initiated OR 1000*24*60*60 (1000 days) if no permissive window can be found within the next 1000 days (this ensures a brute force check of intersecting datetimes with hierarchy RRULEs is a valid method of calculating intersection). + +`cm_strategy_enabled` +Labels: +- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool +- object= +- system= +- strategy=MaintenanceSchedule|Manual|Assisted + +Value: +- `0`: Change management for this resource is not subject to this enabled strategy (**does** consider hierarchy based disable). +- `1`: Change management for this resource is directly subject to this enabled strategy. +- `2`: Change management for this resource is indirectly subject to this enabled strategy (i.e. only via control-plane override hierarchy). +- `3`: Change management for this resource is directly and indirectly subject to this enabled strategy. + +#### Change Management Status +Each resource which exposes a `.spec.changeManagement` stanza must also expose `.status.changeManagement` . + +```yaml +status: + changeManagement: + # Always show control-plane level strategy. Disabled if disabledUntil is true. + clusterStrategy: + # If this a worker-node related resource (e.g. MCP), show local strategy. Disabled if disabledUntil is true. + workerNodeStrategy: + # Show effective state. + effectiveState: + description: "Human readable message explaining how strategies & configuration are resulting in the effective state." + # The start of the next permissive window, taking into account the hierarchy. "N/A" for indefinite pause. + permitChangesETA: + changesPending: +``` + +#### Change Management Bypass Annotation +In some situations, it may be necessary for a MachineConfig to be applied regardless of the active change +management policy for a MachineConfigPool. In such cases, `machineconfiguration.openshift.io/bypass-change-management` +can be set to any non-empty string. The MCO will progress until MCPs which select annotated +MachineConfigs have all machines running with a desiredConfig containing that MachineConfig's current state. + +This annotation will be present on `00-master` to ensure that, once the CVO updates the MachineConfig, +the remainder of the control-plane update will be treated as a single material change. + +### Special Handling +These cases are mentioned or implied elsewhere in the enhancement documentation, but they deserve special +attention. + +#### Change Management on Master MachineConfigPool +In order to allow control-plane updates as a single material change, the MCO will only honor change the management configuration for the +master MachineConfigPool if user generated MachineConfigs are the cause for a pending change. To accomplish this, +at least one MachineConfig updated by the CVO will have the `machineconfiguration.openshift.io/bypass-change-management` annotation +indicating that changes in the MachineConfig must be acted upon irrespective of the master MCP change management policy. + +#### Limiting Overlapping Window Search / Permissive Window Calculation +An operator implementing change management for a worker-node related resource must honor the change management hierarchy when +calculating when the next permissive window will occur (called elsewhere in the document, ETA). This is not +straightforward to compute when both the control-plane and worker-nodes have independent MaintenanceSchedule +configurations. + +We can, however, simplify the process by reducing the number of days in the future the algorithm must search for +coinciding permissive windows. 1000 days is a proposed cut-off. + +To calculate coinciding windows then, the implementation can use [rrule-go](https://github.com/teambition/rrule-go) +to iteratively find permissive windows at the cluster / control-plane level. These can be added to an +[interval-tree](https://github.com/rdleal/intervalst) . As dates are added, rrule calculations for the worker-nodes +can be performed. The interval-tree should be able to efficiently determine whether there is an +intersection between the permissive intervals it has stored for the control-plane and the time range tested for the +worker-nodes. + +Since it is possible there is no overlap, limits must be placed on this search. Once dates >1000 days from +the present moment are being tested, the operator can behave as if the next window will occur in +1000 days (prevents infinite search for overlap). + +This outcome does not need to be recomputed unless the operator restarts Or one of the RRULE involved +is modified. + +If an overlap _is_ found, no additional intervals need to be added to the tree and it can be discarded. +The operator can store the start & end datetimes for the overlap and count down the seconds remaining +until it occurs. Obviously, this calculation must be repeated: +1. If either MaintenanceSchedule configuration is updated. +1. The operator is restarted. +1. At the end of a permissive window, in order to determine the next permissive window. + + +#### Service Delivery Option Sanitization +It is obvious that the range of flexible options provided by change management configurations offers +can create risks for inexperienced cluster lifecycle administrators. For example, setting a +standalone cluster to use the Assisted strategy and failing to trigger worker-node updates will +leave unpatched CVEs on worker-nodes much longer than necessary. It will also eventually lead to +the need to resolve version skew (Upgradeable=False will be reported by the API cluster operator). + +Service Delivery understands that expose the full range of options to cluster +lifecycle administrators could dramatically increase the overhead of managing their fleet. To +prevent this outcome, Service Delivery will only expose a subset of the change management +strategies. They will also implement sanitization of the configuration options a use can +supply to those strategies. For example, a simplified interface in OCM for building a +limited range of RRULEs that are compliant with Service Delivery's update policies. + +### Risks and Mitigations + +- Given the range of operators which must implement support for change management, inconsistent behavior or reporting may make it difficult for users to navigate different profiles. + - Mitigation: A shared library should be created and vendored for RRULE/exclude/next window calculations/metrics. +- Users familiar with the fully self-managed nature of OpenShift are confused by the lack of material changes be initiated when change management constraints are active. + - Mitigation: The introduction of change management will not change the behavior of existing clusters. Users must make a configuration change. +- Users may put themselves at risk of CVEs by being too conservative with worker-node updates. +- Users leveraging change management may be more likely to reach unsupported kubelet skew configurations vs fully self-managed cluster management. + +### Drawbacks + +The scope of the enhancement - cutting across several operators requires multiple, careful implementations. The enhancement +also touches code paths that have been refined for years which assume a fully self-managed cluster approach. Upsetting these +code paths prove challenging. + +## Open Questions [optional] + +1. Can the HyperShift Operator expose a metric expose a metric for when changes are pending for a subset of worker nodes on the cluster if it can only interact via CAPI resources? +2. Can the MCO interrogate the ClusterVersion change management configuration in order to calculate overlapping permissive intervals in the future? + +## Test Plan + +**Note:** *Section not required until targeted at a release.* + +Consider the following in developing a test plan for this enhancement: +- Will there be e2e and integration tests, in addition to unit tests? +- How will it be tested in isolation vs with other components? +- What additional testing is necessary to support managed OpenShift service-based offerings? + +No need to outline all of the test cases, just the general strategy. Anything +that would count as tricky in the implementation and anything particularly +challenging to test should be called out. + +All code is expected to have adequate tests (eventually with coverage +expectations). + +## Graduation Criteria + +**Note:** *Section not required until targeted at a release.* + +Define graduation milestones. + +These may be defined in terms of API maturity, or as something else. Initial proposal +should keep this high-level with a focus on what signals will be looked at to +determine graduation. + +Consider the following in developing the graduation criteria for this +enhancement: + +- Maturity levels +The API extensions will be made to existing, stable APIs. `changeManagement` is an optional +field in the resources which bear it and so do not break backwards compatibility. + +The lack of a change management field implies the Disabled strategy - which ensures +the existing, fully self-managed update behaviors are not constrained. That is, +under a change management strategy is configured, the behavior of existing clusters +will not be affected. + +### Dev Preview -> Tech Preview + +- Ability to utilize the enhancement end to end +- End user documentation, relative API stability +- Sufficient test coverage +- Gather feedback from users rather than just developers +- Enumerate service level indicators (SLIs), expose SLIs as metrics +- Write symptoms-based alerts for the component(s) + +### Tech Preview -> GA + +- More testing (upgrade, downgrade, scale) +- Sufficient time for feedback +- Available by default +- Backhaul SLI telemetry +- Document SLOs for the component +- Conduct load testing +- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +**For non-optional features moving to GA, the graduation criteria must include +end to end tests.** + +### Removing a deprecated feature + +- The `MachineConfigPool.spec.pause` can begin the deprecation process. Change Management strategies allow for a superset of its behaviors. +- We may consider deprecating `HostCluster.spec.pausedUntil`. HyperShift may consider retaining it with the semantics of pausing all reconciliation with CAPI resources vs just pausing material changes per the change management contract. + +## Upgrade / Downgrade Strategy + +Operators implementing support for change management will carry forward their +existing upgrade and downgrade strategies. + +## Version Skew Strategy + +Operators implementing change management for their resources will not face any +new _internal_ version skew complexities due to this enhancement, but change management +does increase the odds of prolonged and larger differential kubelet version skew. + +For example, particularly given the Manual or Assisted change management strategy, it +becomes easier for a cluster lifecycle administrator to forget to update worker-nodes +along with updates to the control-plane. + +At some point, this will manifest as the kube-apiserver presenting as Upgradeable=False, +preventing future control-plane updates. To reduce the prevalence of this outcome, +the additional responsibilities of the cluster lifecycle administrator when +employing change management strategies must be clearly documented along with SOPs +from recovering from skew issues. + +HyperShift does not have any integrated skew mitigation strategy in place today. HostedCluster +and NodePool support independent release payloads being configured and a cluster lifecycle +administrator can trivially introduce problematic skew by editing these resources. HyperShift +documentation warns against this, but we should expect a moderate increase in the condition +being reported on non-managed clusters (OCM can prevent this situation from arising by +assessing telemetry for a cluster and preventing additional upgrades while worker-node +configurations are inconsistent with the API server). + +## Operational Aspects of API Extensions + +The API extensions proposed by this enhancement should not substantially increase +the scope of work of operators implementing the change management support. The +operators will interact with the same underlying resources/CRDs but with +constraints around when changes can be initiated. As such, no significant _new_ +operational aspects are expected to be introduced. + +## Support Procedures + +Change management problems created by faulty implementations will need to be resolved by +analyzing operator logs. The operator responsible for a given resource will vary. Existing +support tooling like must-gather should capture the information necessary to understand +and fix issues. + +Change management problems where user expectations are not being met are designed to +be informed by the detailed `status` provided by the resources bearing the `changeManagement` +stanza in their `spec`. + +## Alternatives + +### Implement maintenance schedules via an external control system (e.g. ACM) +We do not have an offering in this space. ACM is presently devoted to cluster monitoring and does +not participate in cluster lifecycle. + +### Do not separate control-plane and worker-node updates into separate phases +As separating control-plane and worker-node updates into separate phases is an important motivation for this +enhancement, we could abandon this strategic direction. Reasons motivating this separation are explained +in depth in the motivation section. + +### Separate control-plane and worker-node updates into separate phases, but do not implement the change control concept +As explained in the motivation section, there is a concern that implementing this separation without +maintenance schedules will double the perceived operational overhead of OpenShift updates. + +This also creates work for our Service Delivery team without any platform support. + +### Separate control-plane and worker-node updates into separate phases, but implement a simpler MaintenanceSchedule strategy +We could implement change control without `disabledUntil`, `pausedUntil`, `exclude`, and perhaps more. However, +it is risky to impose a single opinionated workflow onto the wide variety of consumers of the platform. The workflows +described in this enhancement are not intended to be exotic or contrived but situations in which flexibility +in our configuration can achieve real world, reasonable goals. + +`disabledUntil` is designed to support our Service Delivery team who, on occasion, will need +to be able to bypass configured change controls. The feature is easy to use, does not require +deleting or restoring customer configuration (which may be error-prone), and can be safely +"forgotten" after being set to a date in the future. + +`pausedUntil`, among other interesting possibilities, offers a cluster lifecycle administrator the ability +to stop a problematic update from unfolding further. You may have watched a 100 node +cluster roll out a bad configuration change without knowing exactly how to stop the damage +without causing further disruption. This is not a moment when you want to be figuring out how to format +a date string, calculating timezones, or copying around cluster configuration so that you can restore +it after you stop the bleeding. + +### Implement change control, but do not implement the Manual and/or Assisted strategy for MachineConfigPool +Major enterprise users of our software do not update on a predictable, recurring window of time. Updates +require laborious certification processes and testing. Maintenance schedules will not serve these customers +well. However, these customers may still benefit significantly from the change management concept -- +unexpected / disruptive worker node drains and reboots have bitten even experienced OpenShift operators +(e.g. a new MachineConfig being contributed via gitops). + +These strategies inform decision-making through metrics and provide facilities for fine-grained control +over exactly when material change is rolled out to a cluster. + +The Assisted strategy is also specifically designed to provide a foundation for +the forthcoming `oc adm update worker-nodes` verbs. After separating the control-plane and +worker-node update phases, these verbs are intended to provide cluster lifecycle administrators the +ability to easily start, pause, cancel, and even rollback worker-node changes. + +Making accommodations for these strategies should be a subset of the overall implementation +of the MaintenanceSchedule strategy and they will enable a foundation for a range of +different persons not served by MaintenanceSchedule. + +### Use CRON instead of RRULE +The CRON specification is typically used to describe when something should start and +does not imply when things should end. CRON also cannot, in a standard way, +express common semantics like "The first Saturday of every month." + +### Use a separate CRD instead of updating ClusterVersion, MCP, ... +In this alternative, we introduce a CRD separate from ClusterVersion, MCP, HostedCluster, and NodePool. For example +an independent `UpdatePolicy` CRD where administrator preferences can be captured. This approach [was explored](https://github.com/jupierce/oc-update-design-ideas/commit/a6364ee2f2c1ebf84ed6d50bc277f9736bf793bd). +Ultimately, it felt less intuitive across profiles. Consider a CRD called `UpdatePolicy` that tries to be a central config. + +1. Is it namespaced? If it is, that feels odd for an object that that should control the cluster. If it is not namespaced, the policy feels misplaced for HCP HostedCluster resources which live in a namespace. +1. Where does someone check the status of the policy (e.g. when the next update is going to be possible on a given MCP?). If it is the UpdatePolicy object, you have multiple independent controllers controlling the status, which is an antipattern. If the UpdatePolicy controls multiple different MCPs differently, how do they independently report their status'? It also introduces the problem of having to look at multiple places (MCP and UpdatePolicy to understand what may be happening. +1. As you pack policies for control-plane, MCPs, HCP primitives into an expressive UpdatePolicy, the schema options varied from too complex, to error prone, to overly abstract, to overly limiting. +1. If you split the policy object up to simplify it, e.g. one for each MCP, you have the problem of associating it to the right MCP unambiguously. Planting the policy in the MCP itself solves this problem. + +In summary, it is possible, but the working group didn't believe the alternative was as elegant or user friendly as housing the policies directly in the resources they control. \ No newline at end of file From e4fe894f311a5eddb1f69165cdb5d981ed0c88bf Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Tue, 14 May 2024 19:53:40 -0400 Subject: [PATCH 04/12] James review --- ...ge-management-and-maintenance-schedules.md | 50 ++++++++++++++++--- 1 file changed, 42 insertions(+), 8 deletions(-) diff --git a/enhancements/update/change-management-and-maintenance-schedules.md b/enhancements/update/change-management-and-maintenance-schedules.md index 36db623303..c3c48fbb88 100644 --- a/enhancements/update/change-management-and-maintenance-schedules.md +++ b/enhancements/update/change-management-and-maintenance-schedules.md @@ -160,6 +160,12 @@ and intuitive method of deferring worker-node updates: not initiating them. Leav discretion, within safe skew-bounds, gives them the flexibility to make the right choices for their unique circumstances. +It should also be noted that Service Delivery does not permit customers to directly modify machine config +pools. This means that the existing machine config pool based pause is not directly available. It is +not exposed via OCM either. This enhancement seeks to create high level abstractions supporting the +separation of control-plane and worker-nodes that will be straight-forward and intuitive options +to expose through OCM. + #### Enhancing Operational Control The preceding section delved deeply into a motivation for Change Management / Maintenance Schedules based on our desire to separate control-plane and worker-node updates without increasing operational burden on end-users. However, @@ -183,9 +189,12 @@ MachineConfigPool to limit the impact of that merge to a particular time window significant forethought by the user. Even with that forethought, if an enterprise wants changes to only be applied during weekends, additional custom mechanics would need to be employed to ensure the change merged during the weekend without needing someone present. +Even this approach is unavailable to our managed services customers who are restricted +from modifying machine config pool directly. -Contrast this complexity with the user setting a Change Management / Maintenance Schedule on the cluster. The user -is then free to merge configuration changes and gitops can apply those changes to OpenShift +Contrast this complexity with the user setting a Change Management / Maintenance Schedule +on the cluster (or indirectly via OCM when Service Delivery exposes the option for managed clusters). +The user is then free to merge configuration changes and gitops can apply those changes to OpenShift resources, but material change to the cluster will not be initiated until a time permitted by the Maintenance Schedule. Users do not require special insight into the implications of configuring platform resources as the top-level Maintenance Schedule control will help ensure @@ -378,7 +387,9 @@ should be updated to proxy that status information to the end users. ### Change Management Metrics Cluster wide change management information will be made available through cluster metrics. Each resource containing the stanza must expose the following metrics: -- The number of seconds until the next known permitted change window. 0 if changes can currently be initiated. -1 if changes are paused indefinitely. -2 if no permitted window can be computed. +- The number of seconds until the next known permitted change window. See `cm_change_eta` metric. +- The number of seconds until the current change window closes. See `cm_change_remaining` metric. +- The last datetime at which changes were permitted (can be nil). See `cm_change_last` metric (which represents this as seconds instead of a datetime). - Whether any change management strategy is enabled. - Which change management strategy is enabled. - If changes are pending due to change management controls. @@ -593,7 +604,7 @@ perspective, this strategy reports as paused indefinitely. 1. The customer chooses immediate application. 1. SRE applies a change to the relevant control-plane AND worker-node resource's `changeManagement` stanza (both must be changed because of the change management hierarchy), setting `disabledUntil` to - a time 48 hours in the future. The configured change management schedule is ignored for 48 as the system + a time 48 hours in the future. The configured change management schedule is ignored for 48 hours as the system initiates all necessary node changes. #### Service Delivery Deferred Remediation @@ -797,10 +808,10 @@ permissive 24h window unless a date is in the `exclude` ranges. **RRULE Constraints** A valid RRULE for change management: -- must identify a date, so, although RRULE supports `FREQ=HOURLY`, it will not be supported. +- must identify a date, so, although RRULE supports `FREQ=HOURLY`, it will be rejected if an attempt it made to use it. - cannot specify an end for the pattern. `RRULE:FREQ=DAILY;COUNT=3` suggests an event that occurs every day for three days only. As such, neither `COUNT` nor `UNTIL` is - supported. + supported and will be rejected if an attempt is made to use them. - cannot specify a permissive window more than 2 years away. **Overview of Interactions** @@ -886,6 +897,29 @@ Value: - `0`: Any pending changes can be initiated now (e.g. change management is disabled or inside machine schedule window). - `> 0`: The number seconds remaining until changes can be initiated OR 1000*24*60*60 (1000 days) if no permissive window can be found within the next 1000 days (this ensures a brute force check of intersecting datetimes with hierarchy RRULEs is a valid method of calculating intersection). +`cm_change_remaining` +Labels: +- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool +- object= +- system= + +Value: +- `-2`: Error determining the time at which current permissive window will close. +- `-1`: Material changes are permitted indefinitely (e.g. `strategy: disabled`). +- `0`: Material changes are not presently permitted (i.e. the cluster is outside of a permissive window). +- `> 0`: The number seconds remaining in the current permissive change window (or the equivalent of 1000 days if end of window cannot be computed). + +`cm_change_last` +Labels: +- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool +- object= +- system= + +Value: +- `-1`: Datetime unknown. +- `0`: Material changes are currently permitted. +- `> 0`: The number of seconds which have elapsed since the material changes were last permitted. + `cm_strategy_enabled` Labels: - kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool @@ -974,10 +1008,10 @@ standalone cluster to use the Assisted strategy and failing to trigger worker-no leave unpatched CVEs on worker-nodes much longer than necessary. It will also eventually lead to the need to resolve version skew (Upgradeable=False will be reported by the API cluster operator). -Service Delivery understands that expose the full range of options to cluster +Service Delivery understands that exposing the full range of options to cluster lifecycle administrators could dramatically increase the overhead of managing their fleet. To prevent this outcome, Service Delivery will only expose a subset of the change management -strategies. They will also implement sanitization of the configuration options a use can +strategies. They will also implement sanitization of the configuration options a user can supply to those strategies. For example, a simplified interface in OCM for building a limited range of RRULEs that are compliant with Service Delivery's update policies. From 89167f1a399278131a80ddf25b88a413f8364071 Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Thu, 13 Jun 2024 18:25:34 -0400 Subject: [PATCH 05/12] Joel review --- ...ge-management-and-maintenance-schedules.md | 60 ++++++++++--------- 1 file changed, 31 insertions(+), 29 deletions(-) diff --git a/enhancements/update/change-management-and-maintenance-schedules.md b/enhancements/update/change-management-and-maintenance-schedules.md index c3c48fbb88..da445293a5 100644 --- a/enhancements/update/change-management-and-maintenance-schedules.md +++ b/enhancements/update/change-management-and-maintenance-schedules.md @@ -55,7 +55,7 @@ the need to fix critical bugs or update a cluster to supported software versions it should be possible to safely leave changes pending indefinitely. That said, Service Delivery and/or higher level management systems may choose to prevent such problematic change management settings from being applied by using -validating webhooks. +validating webhooks or admission policies. ## Motivation This enhancement is designed to improve user experience during the OpenShift @@ -178,7 +178,8 @@ Since these are such well established principles, I will summarize the motivatio OpenShift meet industry standard expectations with respect to limiting potentially disruptive change outside well planned time windows. -It could be argued that rigorous and time sensitive management of OpenShift cluster API resources could prevent +It could be argued that rigorous and time sensitive management of OpenShift API resources +(e.g. ClusterVersion, MachineConfigPool, HostedCluster, NodePool, etc.) could prevent unplanned material changes, but Change Management / Maintenance Schedules introduce higher level, platform native, and more intuitive guard rails. For example, consider the common pattern of a gitops configured OpenShift cluster. If a user wants to introduce a change to a MachineConfig, it is simple to merge a change to the @@ -246,7 +247,7 @@ risks and disruption when rolling out changes to their environments. > like to be able to pause that change manually so that the team will not have to work on the weekend." > "As a cluster lifecycle administrator, I need to stop all material changes on my cluster -> quickly and indefinitely until I can understand a potential issue. I not want to consider dates or +> quickly and indefinitely until I can understand a potential issue. I do not want to consider dates or > timezones in this delay as they are not known and irrelevant to my immediate concern." > "As a cluster lifecycle administrator, I want to ensure any material changes to my @@ -343,8 +344,8 @@ Each strategy may require an additional configuration element within the stanza. spec: changeManagement: strategy: "MaintenanceSchedule" - pausedUntil: false - disabledUntil: false + pausedUntil: "false" + disabledUntil: "false" config: maintenanceSchedule: ..options to configure a detailed policy for the maintenance schedule.. @@ -370,12 +371,12 @@ is always permissive -- allowing material changes to be initiated (see [Change M Hierarchy](#change-management-hierarchy) for caveats). All change management strategies, except `Disabled`, are subject to the following `changeManagement` fields: -- `changeManagement.disabledUntil: `: When `disabledUntil: true` or `disabledUntil: `, the interpreted strategy for +- `changeManagement.disabledUntil: ""`: When `disabledUntil: "true"` or `disabledUntil: ""`, the interpreted strategy for change management in the resource is `Disabled`. Setting a future date in `disabledUntil` offers a less invasive (i.e. no important configuration needs to be changed) method to disable change management constraints (e.g. if it is critical to roll out a fix) and a method that does not need to be reverted (i.e. it will naturally expire after the specified date and the configured change management strategy will re-activate). -- `changeManagement.pausedUntil: `: Unless the effective active strategy is Disabled, `pausedUntil: true` or `pausedUntil: `, change management must +- `changeManagement.pausedUntil: ""`: Unless the effective active strategy is Disabled, `pausedUntil: "true"` or `pausedUntil: ""`, change management must pause material changes. ### Change Management Status @@ -387,12 +388,13 @@ should be updated to proxy that status information to the end users. ### Change Management Metrics Cluster wide change management information will be made available through cluster metrics. Each resource containing the stanza must expose the following metrics: -- The number of seconds until the next known permitted change window. See `cm_change_eta` metric. -- The number of seconds until the current change window closes. See `cm_change_remaining` metric. -- The last datetime at which changes were permitted (can be nil). See `cm_change_last` metric (which represents this as seconds instead of a datetime). - Whether any change management strategy is enabled. -- Which change management strategy is enabled. -- If changes are pending due to change management controls. +- Which change management strategy is enabled. This can be used not notify SRE when a cluster begins using a non-standard strategy (e.g. during emergency corrective action). +- The number of seconds until the next known permitted change window. See `change_management_next_change_eta` metric. This might be used to notify an SRE team of an approaching permissive window. +- The number of seconds until the current change window closes. See `change_management_permissive_remaining` metric. +- The last datetime at which changes were permitted (can be nil). See `change_management_last_change` metric (which represents this as seconds instead of a datetime). This could be used to notify an SRE team if a cluster has not had the opportunity to update for a non-compliant period. +- If changes are pending due to change management controls. When combined with other metrics (`change_management_next_change_eta`, `change_management_permissive_remaining`), this can be used to notify SRE when an upcoming permissive window is going to initiate changes and whether changes are still pending as a permissive window closes. + ### Change Management Hierarchy Material changes to worker-nodes are constrained by change management policies in their associated resource AND @@ -643,14 +645,14 @@ perspective, this strategy reports as paused indefinitely. 1. The user determines the most recent rendered worker configuration. They configure the `manual` change management policy to use that exact configuration as the `desiredConfig`. 1. The MCO is thus being asked to ignite any new node or rebooted node with the desired configuration, but it - is **not** being permitted to apply that configuration to existing nodes because it is change management, in effect, + is **not** being permitted to apply that configuration to existing nodes because change management, in effect, is paused indefinitely by the manual strategy. 1. The MCO metric for the MCP indicating the number of seconds remaining until changes can be initiated is `-1` - indicating that there is presently no time in the future where it will initiate material changes. The operations team has an alert configured if this value `!= -1`. 1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running the most recently rendered configuration. This is irrespective of the `desiredConfig` in the `manual` - policy. Abstractly, it means, if change management were disabled, whether changes be initiated. + policy. Abstractly, it means, if change management were disabled, whether changes would be initiated. 1. The cluster lifecycle administrator manually drains and reboots nodes in the cluster. As they come back online, the MachineConfigServer offers them the desiredConfig requested by the manual policy. 1. After updating all nodes, the cluster lifecycle administrator does not need make any additional @@ -662,18 +664,18 @@ perspective, this strategy reports as paused indefinitely. Instead, updates are negotiated and planned far in advance. 1. The cluster workloads are not HA and unplanned drains are considered a business risk. 1. To prevent surprises, the cluster lifecycle administrator sets the Assisted strategy on the worker MCP. -1. In the `assisted` strategy change management policy, the lifecycle administrator configures `pausedUntil: true` +1. In the `assisted` strategy change management policy, the lifecycle administrator configures `pausedUntil: "true"` and the most recently rendered worker configuration in the policy's `renderedConfigsBefore: `. 1. The MCO is being asked to ignite any new node or any rebooted node with the latest rendered configuration - before the present datetime. However, because of `pausedUntil: true`, it is also being asked not to + before the present datetime. However, because of `pausedUntil: "true"`, it is also being asked not to automatically initiate that material change for existing nodes. 1. The MCO metric for the MCP indicating the number of seconds remaining until changes can be initiated is `-1` - indicating that there is presently no time in the future where it will initiate material changes. The operations team has an alert configured if this value `!= -1`. 1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running the most recent, rendered configuration. This is irrespective of the `renderedConfigsBefore` in the `assisted` - configuration. Abstractly, it means, if change management were disabled, whether changes be initiated. -1. When the lifecycle administrator is ready to permit disruption, they set `pausedUntil: false`. + configuration. Abstractly, it means, if change management were disabled, whether changes would be initiated. +1. When the lifecycle administrator is ready to permit disruption, they set `pausedUntil: "false"`. 1. The MCO sets the number of seconds until changes are permitted to `0`. 1. The MCO begins to initiate worker node updates. This rollout abides by documented OpenShift constraints such as the MachineConfigPool `maxUnavailable` setting. @@ -736,12 +738,12 @@ spec: # The active strategy for change management (unless disabled by disabledUntil). strategy: - # If set to true or a future date, the effective change management strategy is Disabled. Date - # must be RFC3339. + # If set to "true" or a future date (represented as string), the effective change + # management strategy is Disabled. Date must be RFC3339. disabledUntil: - # If set to true or a future date, all strategies other than Disabled are paused. Date - # must be RFC3339. + # If set to "true" or a future date (represented as string), all strategies other + # than Disabled are paused. Date must be RFC3339. pausedUntil: # If a strategy needs additional configuration information, it can read a @@ -873,7 +875,7 @@ is always paused state. #### Metrics -`cm_change_pending` +`change_management_change_pending` Labels: - kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool - object= @@ -885,7 +887,7 @@ Value: - `2`: changes are pending and blocked based on this resource's change management policy. - `3`: changes are pending and blocked based on another resource in the change management hierarchy. -`cm_change_eta` +`change_management_next_change_eta` Labels: - kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool - object= @@ -897,7 +899,7 @@ Value: - `0`: Any pending changes can be initiated now (e.g. change management is disabled or inside machine schedule window). - `> 0`: The number seconds remaining until changes can be initiated OR 1000*24*60*60 (1000 days) if no permissive window can be found within the next 1000 days (this ensures a brute force check of intersecting datetimes with hierarchy RRULEs is a valid method of calculating intersection). -`cm_change_remaining` +`change_management_permissive_remaining` Labels: - kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool - object= @@ -909,7 +911,7 @@ Value: - `0`: Material changes are not presently permitted (i.e. the cluster is outside of a permissive window). - `> 0`: The number seconds remaining in the current permissive change window (or the equivalent of 1000 days if end of window cannot be computed). -`cm_change_last` +`change_management_last_change` Labels: - kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool - object= @@ -920,7 +922,7 @@ Value: - `0`: Material changes are currently permitted. - `> 0`: The number of seconds which have elapsed since the material changes were last permitted. -`cm_strategy_enabled` +`change_management_strategy_enabled` Labels: - kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool - object= @@ -939,9 +941,9 @@ Each resource which exposes a `.spec.changeManagement` stanza must also expose ` ```yaml status: changeManagement: - # Always show control-plane level strategy. Disabled if disabledUntil is true. + # Always show control-plane level strategy. Disabled if disabledUntil is "true". clusterStrategy: - # If this a worker-node related resource (e.g. MCP), show local strategy. Disabled if disabledUntil is true. + # If this a worker-node related resource (e.g. MCP), show local strategy. Disabled if disabledUntil is "true". workerNodeStrategy: # Show effective state. effectiveState: From ec9eee7d34fea72352fee695704cc598a5e5e8c1 Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Fri, 14 Jun 2024 17:03:02 -0400 Subject: [PATCH 06/12] Adding terminology section --- ...ge-management-and-maintenance-schedules.md | 52 ++++++++++++++++++- 1 file changed, 50 insertions(+), 2 deletions(-) diff --git a/enhancements/update/change-management-and-maintenance-schedules.md b/enhancements/update/change-management-and-maintenance-schedules.md index da445293a5..3302cf9049 100644 --- a/enhancements/update/change-management-and-maintenance-schedules.md +++ b/enhancements/update/change-management-and-maintenance-schedules.md @@ -57,6 +57,47 @@ Service Delivery and/or higher level management systems may choose to prevent such problematic change management settings from being applied by using validating webhooks or admission policies. +## Definitions and Reference + +**RRULE** +RRULE, or "Recurrence Rule", is an RFC https://icalendar.org/RFC-Specifications/iCalendar-RFC-5545/ +commonly used to express reoccurring windows of time. Consider a calendar invite for a meeting that +should occur on the last Friday of every month. RRULE can express this as `FREQ=MONTHLY;INTERVAL=1;BYDAY=-1FR`. +While commonly employed for calendar data interchange, it is used in this enhancement to allow users +the ability to specify maintenance schedules. +Tools for generating RRULES: +- Simple: https://icalendar.org/rrule-tool.html +- Complex: https://exontrol.com/exicalendar.jsp?config=/js#calendar + +**Change Management Terminology** +This document uses unique terms to describe the key aspects of change management. +It is worth internalizing the meaning of these terms before reviewing sections of the document. +- "Material Change". A longer definition is provided in the Summary, but in short, any configuration + change a platform operator wishes to apply which would necessitate the reboot or replacement of one + or more nodes is considered a material change. For example, updating the control-plane version is + a material change as it requires rebooting master nodes. +- "Enabled" / "Disabled". Change management can be enabled or disabled through various configuration options. + When "Disabled" via any of those options, change management is not active and any pending material changes + will be applied. Existing versions of OpenShift, without this enhancement, are, conceptually, + always running with change management disabled. +- "Paused" / "Unpaused" are enforcement states _when change management is enabled_. "Paused" means that + material changes will be deferred / left pending. "Unpaused" means that pending material changes + can be applied. + "Disabled" supersedes "Paused". In other words, if change management is disabled, it does not matter + if a configured strategy would be enforcing a change pause or not -- because that disabled strategy + is not being considered. +- "Permissive". When change management is disabled xor enabled & unpaused, the cluster is described to be in + a permissive window. This is another way to say that material changes can be applied. When change + management is enabled & paused, material changes will be left in a pending state and the cluster + is not in a permissive window. +- "Strategy". There are multiple change management strategies proposed. Each informs a different behavior + for a controller to pause and unpause changes. "Disabled" is a special strategy that means change management + is indefinitely disabled -- meaning material changes can be freely applied. This will be the default + strategy for OpenShift for the foreseeable future in order to provide consistent behavior with past versions. +- "Maintenance Schedule" is one type of change management strategy. When enabled, based on a recurrence + rule (RRULE) and exclusion periods, a controller will pause or unpause changes according to the + current datetime. + ## Motivation This enhancement is designed to improve user experience during the OpenShift upgrade process and other key operational moments when configuration updates @@ -963,8 +1004,6 @@ This annotation will be present on `00-master` to ensure that, once the CVO upda the remainder of the control-plane update will be treated as a single material change. ### Special Handling -These cases are mentioned or implied elsewhere in the enhancement documentation, but they deserve special -attention. #### Change Management on Master MachineConfigPool In order to allow control-plane updates as a single material change, the MCO will only honor change the management configuration for the @@ -1017,6 +1056,15 @@ strategies. They will also implement sanitization of the configuration options a supply to those strategies. For example, a simplified interface in OCM for building a limited range of RRULEs that are compliant with Service Delivery's update policies. +#### Node Disruption Policy +https://github.com/openshift/enhancements/pull/1525 describes the addition of `nodeDisruptionPolicy` +to `MachineConfiguration`. Through this configuration, an administrator can convey that +a configuration should not trigger a node to be rebooted. + +When `nodeDisruptionPolicy` indicates that a `MachineConfiguration` should not trigger +a node reboot, it becomes a non-material change from the perspective of a maintenance +schedule. In other words, it can be applied immediately, even outside a permissive window. + ### Risks and Mitigations - Given the range of operators which must implement support for change management, inconsistent behavior or reporting may make it difficult for users to navigate different profiles. From f24a7d62b13ba86c061dd6bfe60ccf6605abd11e Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Mon, 17 Jun 2024 17:15:11 -0400 Subject: [PATCH 07/12] Add maxUnavailable note --- .../update/change-management-and-maintenance-schedules.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/enhancements/update/change-management-and-maintenance-schedules.md b/enhancements/update/change-management-and-maintenance-schedules.md index 3302cf9049..0b67eb4aa8 100644 --- a/enhancements/update/change-management-and-maintenance-schedules.md +++ b/enhancements/update/change-management-and-maintenance-schedules.md @@ -490,7 +490,8 @@ This strategy is motivated by the desire to support the separation of control-pl conceptually for users and in real technical terms. One way to do this for users who do not benefit from the `MaintenanceSchedule` strategy is to ask them to initiate, pause, and resume the rollout of material changes to their worker nodes. Contrast this with the fully self-managed state today, where worker-nodes -(normally) begin to be updated automatically and directly after the control-plane update. +(normally) begin to be updated automatically and directly after the control-plane update (subject to constraints +like `maxUnavailable` in each MachineConfigPool). Clearly, if this was the only mode of updating worker-nodes, we could never successfully disentangle the concepts of control-plane vs worker-node updates in Standalone environments since one implies the other. From b31fd6341468c5081ab10e9729500cc66ce21192 Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Thu, 4 Jul 2024 11:22:58 -0400 Subject: [PATCH 08/12] Typos & clarifications --- .../update/change-management-and-maintenance-schedules.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/enhancements/update/change-management-and-maintenance-schedules.md b/enhancements/update/change-management-and-maintenance-schedules.md index 0b67eb4aa8..4d7efc35d7 100644 --- a/enhancements/update/change-management-and-maintenance-schedules.md +++ b/enhancements/update/change-management-and-maintenance-schedules.md @@ -358,6 +358,7 @@ risks and disruption when rolling out changes to their environments. 4. Exposing maintenance schedule controls from the oc CLI. This may be a future goal but is not required by this enhancement. 5. Providing strict promises around the exact timing of upgrade processes. Maintenance schedules will be honored to a reasonable extent (e.g. upgrade actions will only be initiated during a window), but long-running operations may exceed the configured end of a maintenance schedule. 6. Implementing logic to defend against impractical maintenance schedules (e.g. if a customer configures a 1-second maintenance schedule every year). Service Delivery may want to implement such logic to ensure upgrade progress can be made. +7. Automatically initiating updates to `ClusterVersion`. This will still occur through external actors/orchestration. Maintenance schedules simply give the assurance that changes to `ClusterVersion` will not result in material changes until permitted by the defined maintenance schedules. ## Proposal @@ -430,7 +431,7 @@ should be updated to proxy that status information to the end users. Cluster wide change management information will be made available through cluster metrics. Each resource containing the stanza must expose the following metrics: - Whether any change management strategy is enabled. -- Which change management strategy is enabled. This can be used not notify SRE when a cluster begins using a non-standard strategy (e.g. during emergency corrective action). +- Which change management strategy is enabled. This can be used to notify SRE when a cluster begins using a non-standard strategy (e.g. during emergency corrective action). - The number of seconds until the next known permitted change window. See `change_management_next_change_eta` metric. This might be used to notify an SRE team of an approaching permissive window. - The number of seconds until the current change window closes. See `change_management_permissive_remaining` metric. - The last datetime at which changes were permitted (can be nil). See `change_management_last_change` metric (which represents this as seconds instead of a datetime). This could be used to notify an SRE team if a cluster has not had the opportunity to update for a non-compliant period. @@ -479,7 +480,8 @@ material changes will be paused. This strategy indicates that no change management strategy is being enforced by the resource. It always implies that the enforcement state at the resource level is unpaused / permissive. This does not always mean that material changes are permitted due to change management hierarchies. For example, a MachineConfigPool -with `strategy: Disabled` would still be subject to a `strategy: MaintenanceStrategy` in the ClusterVersion resource. +with `strategy: Disabled` would still be subject to a `strategy: MaintenanceSchedule` in the ClusterVersion resource. +The impact of hierarchy should always be made clear in the change management status of the MachineConfigPool. #### Assisted Strategy - MachineConfigPool Minimally, this strategy will be supported by MachineConfigPool. If and when the strategy is supported by other @@ -1083,7 +1085,7 @@ code paths prove challenging. ## Open Questions [optional] -1. Can the HyperShift Operator expose a metric expose a metric for when changes are pending for a subset of worker nodes on the cluster if it can only interact via CAPI resources? +1. Can the HyperShift Operator expose a metric for when changes are pending for a subset of worker nodes on the cluster if it can only interact via CAPI resources? 2. Can the MCO interrogate the ClusterVersion change management configuration in order to calculate overlapping permissive intervals in the future? ## Test Plan From 1c098aebbdc25f18024413d5326404debdb748c3 Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Thu, 4 Jul 2024 16:51:52 -0400 Subject: [PATCH 09/12] Fix implementation detail regarding MCS --- ...ge-management-and-maintenance-schedules.md | 59 ++++++++++--------- 1 file changed, 30 insertions(+), 29 deletions(-) diff --git a/enhancements/update/change-management-and-maintenance-schedules.md b/enhancements/update/change-management-and-maintenance-schedules.md index 4d7efc35d7..803b84b112 100644 --- a/enhancements/update/change-management-and-maintenance-schedules.md +++ b/enhancements/update/change-management-and-maintenance-schedules.md @@ -486,7 +486,8 @@ The impact of hierarchy should always be made clear in the change management sta #### Assisted Strategy - MachineConfigPool Minimally, this strategy will be supported by MachineConfigPool. If and when the strategy is supported by other change management capable resources, the configuration schema for the policy may differ as the details of -what constitutes and informs change varies between resources. +what constitutes and informs change varies between resources (e.g. `renderedConfigsBefore` informs the assisted +strategy configuration for a MachineConfigPool, but may be meaningless for other resources). This strategy is motivated by the desire to support the separation of control-plane and worker-node updates both conceptually for users and in real technical terms. One way to do this for users who do not benefit from the @@ -498,24 +499,28 @@ like `maxUnavailable` in each MachineConfigPool). Clearly, if this was the only mode of updating worker-nodes, we could never successfully disentangle the concepts of control-plane vs worker-node updates in Standalone environments since one implies the other. -In short (details will follow in the implementation section), the assisted strategy allows users to specify the -exact rendered [`desiredConfig` the MachineConfigPool](https://github.com/openshift/machine-config-operator/blob/5112d4f8e562a2b072106f0336aeab451341d7dc/docs/MachineConfigDaemon.md#coordinating-updates) should be advertising to the MachineConfigDaemon on -nodes it is associated with. Like the `MaintenanceSchedule` strategy, it also respects the `pausedUntil` -field. +In short (details will follow in the implementation section), the assisted strategy allows users to +constrain which rendered machineconfig [`desiredConfig` the MachineConfigPool](https://github.com/openshift/machine-config-operator/blob/0f53196f6480481d3f5a04e217a143a56d4db79e/docs/MachineConfig.md#final-rendered-machineconfig-object) +should be advertised to the MachineConfigDaemon on nodes it is associated with. Like the `MaintenanceSchedule` +strategy, it also respects the `pausedUntil` field. + +When using this strategy, estimated time related metrics are set to 0 (e.g. eta and remaining). #### Manual Strategy - MachineConfigPool Minimally, this strategy will be supported by MachineConfigPool. If and when the strategy is supported by other change management capable resources, the configuration schema for the policy may differ as the details of what constitutes and informs change varies between resources. -Like the Assisted strategy, this strategy is implemented to support the conceptual and technical separation +Like the assisted strategy, this strategy is implemented to support the conceptual and technical separation of control-plane and worker-nodes. The MachineConfigPool Manual strategy allows users to explicitly specify their `desiredConfig` to be used for ignition of new and rebooting nodes. While the Manual strategy is enabled, the MachineConfigOperator will not trigger the MachineConfigDaemon to drain or reboot nodes automatically. -Because the Manual strategy initiates changes on its own behalf, `pausedUntil` has no effect. From a metrics +Because the Manual strategy never initiates changes on its own behalf, `pausedUntil` has no effect. From a metrics perspective, this strategy reports as paused indefinitely. +When using this strategy, estimated time related metrics are set to 0 (e.g. eta and remaining). + ### Workflow Description #### OCM HCP Standard Change Management Scenario @@ -688,17 +693,16 @@ perspective, this strategy reports as paused indefinitely. 1. The cluster lifecycle administrator sends a company-wide notice about the period during which service may be disrupted. 1. The user determines the most recent rendered worker configuration. They configure the `manual` change management policy to use that exact configuration as the `desiredConfig`. -1. The MCO is thus being asked to ignite any new node or rebooted node with the desired configuration, but it - is **not** being permitted to apply that configuration to existing nodes because change management, in effect, - is paused indefinitely by the manual strategy. -1. The MCO metric for the MCP indicating the number of seconds remaining until changes can be initiated is `-1` - indicating - that there is presently no time in the future where it will initiate material changes. The operations team - has an alert configured if this value `!= -1`. +1. The MCC is thus being asked to ignite any new node or rebooted node with the desired configuration, but it + is **not** being permitted to initiate that configuration change on existing nodes because change management, in effect, + is paused indefinitely by the manual strategy. A new annotation `applyOnReboot` will be set on + nodes selected by the MachineConfigPool. The flag indicates to the MCD will that it should only + apply the configuration before a node is rebooted (vs initiating its own drain / reboot). 1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running the most recently rendered configuration. This is irrespective of the `desiredConfig` in the `manual` - policy. Abstractly, it means, if change management were disabled, whether changes would be initiated. -1. The cluster lifecycle administrator manually drains and reboots nodes in the cluster. As they come back online, - the MachineConfigServer offers them the desiredConfig requested by the manual policy. + policy. Conceptually, it means, if change management were disabled, whether changes would be initiated. +1. The cluster lifecycle administrator manually drains and reboots nodes in the cluster. + Before the node reboots, it takes the opportunity to pivot to the `desiredConfig`. 1. After updating all nodes, the cluster lifecycle administrator does not need make any additional configuration changes. They can leave the `changeManagement` stanza in their MCP as-is. @@ -707,24 +711,21 @@ perspective, this strategy reports as paused indefinitely. 1. There are no reoccurring windows of time when the cluster lifecycle administrator can tolerate downtime. Instead, updates are negotiated and planned far in advance. 1. The cluster workloads are not HA and unplanned drains are considered a business risk. -1. To prevent surprises, the cluster lifecycle administrator sets the Assisted strategy on the worker MCP. +1. To prevent surprises, the cluster lifecycle administrator sets the assisted strategy on the worker MCP. 1. In the `assisted` strategy change management policy, the lifecycle administrator configures `pausedUntil: "true"` and the most recently rendered worker configuration in the policy's `renderedConfigsBefore: `. 1. The MCO is being asked to ignite any new node or any rebooted node with the latest rendered configuration - before the present datetime. However, because of `pausedUntil: "true"`, it is also being asked not to + before the specified datetime. However, because of `pausedUntil: "true"`, it is also being asked not to automatically initiate that material change for existing nodes. -1. The MCO metric for the MCP indicating the number of seconds remaining until changes can be initiated is `-1` - indicating - that there is presently no time in the future where it will initiate material changes. The operations team - has an alert configured if this value `!= -1`. 1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running the most recent, rendered configuration. This is irrespective of the `renderedConfigsBefore` in the `assisted` - configuration. Abstractly, it means, if change management were disabled, whether changes would be initiated. + configuration. Conceptually, it means, if change management were disabled, whether changes would be initiated. 1. When the lifecycle administrator is ready to permit disruption, they set `pausedUntil: "false"`. -1. The MCO sets the number of seconds until changes are permitted to `0`. 1. The MCO begins to initiate worker node updates. This rollout abides by documented OpenShift constraints such as the MachineConfigPool `maxUnavailable` setting. 1. Though new rendered configurations may be created, the assisted strategy will not act until the assisted policy is updated to permit a more recent creation date. + ### API Extensions @@ -938,9 +939,9 @@ Labels: - system= Value: -- `-2`: Error determining the time at which changes can be initiated (e.g. cannot check with ClusterVersion / change management hierarchy). -- `-1`: Material changes are paused indefinitely. -- `0`: Any pending changes can be initiated now (e.g. change management is disabled or inside machine schedule window). +- `-2`: Error determining the time at which changes can be initiated (e.g. cannot check with ClusterVersion / change management hierarchy). +- `-1`: Material changes are paused indefinitely. +- `0`: Material changes can be initiated now (e.g. change management is disabled or inside machine schedule window). Alternatively, time is not relevant to the strategy (e.g. assisted strategy). - `> 0`: The number seconds remaining until changes can be initiated OR 1000*24*60*60 (1000 days) if no permissive window can be found within the next 1000 days (this ensures a brute force check of intersecting datetimes with hierarchy RRULEs is a valid method of calculating intersection). `change_management_permissive_remaining` @@ -952,7 +953,7 @@ Labels: Value: - `-2`: Error determining the time at which current permissive window will close. - `-1`: Material changes are permitted indefinitely (e.g. `strategy: disabled`). -- `0`: Material changes are not presently permitted (i.e. the cluster is outside of a permissive window). +- `0`: Material changes are not presently permitted (i.e. the cluster is outside of a permissive window). Alternatively, time is not relevant to the strategy (e.g. assisted strategy). - `> 0`: The number seconds remaining in the current permissive change window (or the equivalent of 1000 days if end of window cannot be computed). `change_management_last_change` @@ -964,7 +965,7 @@ Labels: Value: - `-1`: Datetime unknown. - `0`: Material changes are currently permitted. -- `> 0`: The number of seconds which have elapsed since the material changes were last permitted. +- `> 0`: The number of seconds which have elapsed since the material changes were last permitted or initiated (for non-time based strategies). `change_management_strategy_enabled` Labels: @@ -1048,7 +1049,7 @@ until it occurs. Obviously, this calculation must be repeated: #### Service Delivery Option Sanitization It is obvious that the range of flexible options provided by change management configurations offers can create risks for inexperienced cluster lifecycle administrators. For example, setting a -standalone cluster to use the Assisted strategy and failing to trigger worker-node updates will +standalone cluster to use the assisted strategy and failing to trigger worker-node updates will leave unpatched CVEs on worker-nodes much longer than necessary. It will also eventually lead to the need to resolve version skew (Upgradeable=False will be reported by the API cluster operator). From 5bd373fdef4682b82f232ab3a45c3d4067b8c09c Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Fri, 5 Jul 2024 17:06:37 -0400 Subject: [PATCH 10/12] updates from yuqi-zhang review --- ...ge-management-and-maintenance-schedules.md | 33 +++++++++++++------ 1 file changed, 23 insertions(+), 10 deletions(-) diff --git a/enhancements/update/change-management-and-maintenance-schedules.md b/enhancements/update/change-management-and-maintenance-schedules.md index 803b84b112..2bbd67fc88 100644 --- a/enhancements/update/change-management-and-maintenance-schedules.md +++ b/enhancements/update/change-management-and-maintenance-schedules.md @@ -622,7 +622,8 @@ When using this strategy, estimated time related metrics are set to 0 (e.g. eta 21. In the subsequent days, the cluster is scaled up to handle additional workload. The new nodes receive the most recent, desired configuration. 22. On the first Saturday of the next month, the MCO resumes its work. In order to ensure that forward progress is - made for all nodes, the MCO will update nodes that have the oldest current configuration first. This ensures + made for all nodes, the MCO will update nodes in order of oldest rendered configuration to newest rendered configuration + (among those nodes that do not already have the desired configuration). This ensures that even if the desired configuration has changed multiple times while maintenance was not permitted, no nodes are starved of updates. Consider the alternative where (a) worker-node updates required > 24h, (b) updates to nodes are performed alphabetically, and (c) MachineConfigs are frequently being changed @@ -695,8 +696,15 @@ When using this strategy, estimated time related metrics are set to 0 (e.g. eta management policy to use that exact configuration as the `desiredConfig`. 1. The MCC is thus being asked to ignite any new node or rebooted node with the desired configuration, but it is **not** being permitted to initiate that configuration change on existing nodes because change management, in effect, - is paused indefinitely by the manual strategy. A new annotation `applyOnReboot` will be set on - nodes selected by the MachineConfigPool. The flag indicates to the MCD will that it should only + is paused indefinitely by the manual strategy. + Note that this differs from the MCO's current rollout strategy where new nodes are ignited with an older + configuration until the new configuration has been used on an existing node. The rationale in this change + applies only the manual strategy. The administrator has chosen and advanced strategy and should have a + deterministic set of steps to follow. (a) make the change in the MCP. (b) reboot all extant nodes. If + new nodes were entering the system with old configurations, the list of nodes the administrator needed + to action could be increasing after they query the list of extant nodes. +1. A new annotation `applyOnReboot` will be set on + nodes selected by the MachineConfigPool. The flag indicates to the MCD that it should only apply the configuration before a node is rebooted (vs initiating its own drain / reboot). 1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running the most recently rendered configuration. This is irrespective of the `desiredConfig` in the `manual` @@ -714,15 +722,17 @@ When using this strategy, estimated time related metrics are set to 0 (e.g. eta 1. To prevent surprises, the cluster lifecycle administrator sets the assisted strategy on the worker MCP. 1. In the `assisted` strategy change management policy, the lifecycle administrator configures `pausedUntil: "true"` and the most recently rendered worker configuration in the policy's `renderedConfigsBefore: `. -1. The MCO is being asked to ignite any new node or any rebooted node with the latest rendered configuration - before the specified datetime. However, because of `pausedUntil: "true"`, it is also being asked not to - automatically initiate that material change for existing nodes. +1. Conceptually, this is like a paused MCP today, however, it is paused waiting to apply a machine configuration + that may not be latest rendered configuration. The MCO should not prune any `MachineConfig` referenced by + this strategy. 1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running the most recent, rendered configuration. This is irrespective of the `renderedConfigsBefore` in the `assisted` configuration. Conceptually, it means, if change management were disabled, whether changes would be initiated. 1. When the lifecycle administrator is ready to permit disruption, they set `pausedUntil: "false"`. 1. The MCO begins to initiate worker node updates. This rollout abides by documented OpenShift constraints - such as the MachineConfigPool `maxUnavailable` setting. + such as the MachineConfigPool `maxUnavailable` setting. It also abides by currently rollout rules (i.e. + new nodes igniting will receive an older configuration until at least one node has the configuration + selected by `renderedConfigsBefore`). 1. Though new rendered configurations may be created, the assisted strategy will not act until the assisted policy is updated to permit a more recent creation date. @@ -759,9 +769,12 @@ In the Standalone topology, the ClusterVersion and MachineConfigPool resources a #### Single-node Deployments or MicroShift -The ClusterVersion operator will honor the change management field just as in a standalone profile. If those profiles -have a MachineConfigPool, material changes the node could be controlled with a change management policy -in that resource. +These toplogies do not have worker nodes. Only the ClusterVersion change management policy will be relevant. +There is no logical distinction between user workloads and control-plane workloads in this case. A control-plane +update will drain user workloads and will cause workload disruption. + +Though disruption is inevitable, the maintenance schedule feature provides value in these toplogies by +ensuring that disruption happens on a specified schedule. #### OCM Managed Profiles OpenShift Cluster Manager (OCM) should expose a user interface allowing users to manage their change management policy. From 6a0c121a8cb1497c0b7e823202666bcbef4564e9 Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Sun, 7 Jul 2024 12:54:18 -0400 Subject: [PATCH 11/12] Extract ChanageManagementPolicy into external CRD --- ...ge-management-and-maintenance-schedules.md | 985 +++++++++--------- 1 file changed, 516 insertions(+), 469 deletions(-) diff --git a/enhancements/update/change-management-and-maintenance-schedules.md b/enhancements/update/change-management-and-maintenance-schedules.md index 2bbd67fc88..e6f43090e1 100644 --- a/enhancements/update/change-management-and-maintenance-schedules.md +++ b/enhancements/update/change-management-and-maintenance-schedules.md @@ -7,6 +7,8 @@ reviewers: approvers: - @sdodson - @jharrington22 + - @joelspeed + - @yuqi-zhang api-approvers: - TBD creation-date: 2024-02-29 @@ -23,7 +25,7 @@ tracking-link: Implement high level APIs for change management which allow standalone and Hosted Control Plane (HCP) clusters a measure of configurable control over when control-plane or worker-node configuration rollouts are initiated. -As a primary mode of configuring change management, implement an option +As a primary mode of configuring change management, implement a strategy called Maintenance Schedules which define reoccurring windows of time (and specifically excluded times) in which potentially disruptive changes in configuration can be initiated. @@ -32,15 +34,9 @@ pending state until such time as they are permitted by the configuration. Change management enforcement _does not_ guarantee that all initiated material changes are completed by the close of a permitted change window (e.g. a worker-node -may still be draining or rebooting) at the close of a maintenance schedule, +may still be draining or rebooting) at the close of a scheduled permissive window, but it does prevent _additional_ material changes from being initiated. -Change management enforcement _does not_ attempt to define or control the detailed state of the -system. It only pertains to whether controllers which support change management -will attempt to initiate material change themselves. For example, if changes are paused in the middle -of a cluster update and a node is manually rebooted, change management does not define -whether the node will rejoin the cluster with the new or old version. - A "material change" may vary by cluster profile and subsystem. For example, a control-plane update (all components and control-plane nodes updated) is implemented as a single material change (e.g. the close of a scheduled permissive window @@ -70,9 +66,9 @@ Tools for generating RRULES: - Complex: https://exontrol.com/exicalendar.jsp?config=/js#calendar **Change Management Terminology** -This document uses unique terms to describe the key aspects of change management. +This document uses specialized terms to describe the key aspects of change management. It is worth internalizing the meaning of these terms before reviewing sections of the document. -- "Material Change". A longer definition is provided in the Summary, but in short, any configuration +- "Material Change". A longer definition is provided in the Summary, but, in short, any configuration change a platform operator wishes to apply which would necessitate the reboot or replacement of one or more nodes is considered a material change. For example, updating the control-plane version is a material change as it requires rebooting master nodes. @@ -86,17 +82,15 @@ It is worth internalizing the meaning of these terms before reviewing sections o "Disabled" supersedes "Paused". In other words, if change management is disabled, it does not matter if a configured strategy would be enforcing a change pause or not -- because that disabled strategy is not being considered. -- "Permissive". When change management is disabled xor enabled & unpaused, the cluster is described to be in - a permissive window. This is another way to say that material changes can be applied. When change - management is enabled & paused, material changes will be left in a pending state and the cluster - is not in a permissive window. -- "Strategy". There are multiple change management strategies proposed. Each informs a different behavior - for a controller to pause and unpause changes. "Disabled" is a special strategy that means change management - is indefinitely disabled -- meaning material changes can be freely applied. This will be the default - strategy for OpenShift for the foreseeable future in order to provide consistent behavior with past versions. -- "Maintenance Schedule" is one type of change management strategy. When enabled, based on a recurrence - rule (RRULE) and exclusion periods, a controller will pause or unpause changes according to the - current datetime. +- "Permissive". When change management is disabled xor (enabled & unpaused), change management is providing + a permissive window. This is another way to say that material changes can be applied. +- "Restrictive". The opposite of permissive. When change management is (enabled & paused). Material changes + to associated resources will not be initiated. +- "Strategy". There are different change management strategies proposed. Each informs a different behavior + for a controller to pause and unpause changes. +- "Maintenance Schedule" is one change management strategy. When enabled, based on a recurrence + rule ([RRULE](https://icalendar.org/RFC-Specifications/iCalendar-RFC-5545/)) and exclusion periods, + a change management policy will be permissive according to the current datetime. ## Motivation This enhancement is designed to improve user experience during the OpenShift @@ -119,7 +113,7 @@ decomposition of the existing, fully self-managed, Standalone update process int distinct phases as understood and controlled by the end-user: (1) control-plane update and (2) worker-node updates. -To some extent, Maintenance Schedules (a key supported option for change management) +To some extent, Maintenance Schedules (a key strategy supported for change management) are a solution to a problem that will be created by this separation: there is a perception that it would also double the operational burden for users updating a cluster (i.e. they have two phases to initiate and monitor instead of just one). In short, implementing the @@ -136,7 +130,7 @@ to only update during specific, low utilization, periods. Since separating the node update phases is such an important driver for Maintenance Schedules, their motivations are heavily intertwined. The remainder of this -section, therefore, delves into the motivation for this separation. +section, therefore, delves deeply into the motivation for this separation. #### The Case for Control-Plane and Worker-Node Separation From an overall platform perspective, we believe it is important to drive a distinction @@ -215,13 +209,13 @@ the cluster can be initiated, provide value irrespective of this strategic direc controlling exactly when changes are applied to critical systems is universally appreciated in enterprise software. -Since these are such well established principles, I will summarize the motivation as helping -OpenShift meet industry standard expectations with respect to limiting potentially disruptive change +Since these are such well established principles, the motivation can be summarized as helping +OpenShift meet industry standard expectations with respect to limiting potentially disruptive changes outside well planned time windows. It could be argued that rigorous and time sensitive management of OpenShift API resources (e.g. ClusterVersion, MachineConfigPool, HostedCluster, NodePool, etc.) could prevent -unplanned material changes, but Change Management / Maintenance Schedules introduce higher level, platform native, and more +unplanned material changes, but Change Management / Maintenance Schedules introduce higher level, platform native, and intuitive guard rails. For example, consider the common pattern of a gitops configured OpenShift cluster. If a user wants to introduce a change to a MachineConfig, it is simple to merge a change to the resource without appreciating the fact that it will trigger a rolling reboot of nodes in the cluster. @@ -245,7 +239,7 @@ that potentially disruptive changes are limited to well known time windows. #### Reducing Service Delivery Operational Tooling Service Delivery, operating Red Hat's Managed OpenShift offerings (OpenShift Dedicated (OSD), Red Hat OpenShift on AWS (ROSA) and Azure Red Hat OpenShift (ARO) ) is keenly aware of -the issues motivating the Change Management / Maintenance Schedule concept. This is evidenced by their design +the challenges motivating the Change Management / Maintenance Schedule concept. This is evidenced by their design and implementation of tooling to fill the gaps in the platform the preceding sections suggest exist. @@ -341,185 +335,303 @@ risks and disruption when rolling out changes to their environments. ### Goals -1. Indirectly support the strategic separation of control-plane and worker-node update phases for Standalone clusters by supplying a change control mechanism that will allow both control-plane and worker-node updates to proceed at predictable times without doubling operational overhead. -2. Directly support the strategic separation of control-plane and worker-node update phases by implementing a "manual" change management strategy where users who value the full control of the separation can manually actuate changes to them independently. -3. Empower OpenShift cluster lifecycle administrators with tools that simplify implementing industry standard notions of maintenance windows. -4. Provide Service Delivery a platform native feature which will reduce the amount of custom tooling necessary to provide maintenance windows for customers. -5. Deliver a consistent change management experience across all platforms and profiles (e.g. Standalone, ROSA, HCP). -6. Enable SRE to, when appropriate, make configuration changes on a customer cluster and have that change actually take effect only when permitted by the customer's change management preferences. -7. Do not subvert expectations of customers well served by the existing fully self-managed cluster update. -8. Ensure the architectural space for enabling different change management strategies in the future. - +1. Indirectly support the strategic separation of control-plane and worker-node update phases for Standalone clusters by + supplying a change control mechanism that will allow both control-plane and worker-node updates to proceed at predictable + times without doubling operational overhead. +1. Empower OpenShift cluster lifecycle administrators with tools that simplify implementing industry standard notions + of maintenance windows. +1. Provide Service Delivery a platform native feature which will reduce the amount of custom tooling necessary to + provide maintenance windows for customers. +1. Deliver a consistent change management experience across all platforms and profiles (e.g. Standalone, ROSA, HCP). +1. Enable SRE to, when appropriate, make configuration changes on a customer cluster and have that change actually + take effect only when permitted by the customer's change management preferences. +1. Do not subvert expectations of customers well served by the existing fully self-managed cluster update. +1. Ensure the architectural space for enabling different change management strategies in the future. +1. Directly support the strategic separation of control-plane and worker-node update phases by empowering cluster + lifecycle administrators with change management strategies that provide them fine-grained control over exactly + when and how worker-nodes are updated to a desired configuration even if no regular maintenance schedule is possible. + ### Non-Goals -1. Allowing control-plane upgrades to be paused midway through an update. Control-plane updates are relatively rapid and pausing will introduce unnecessary complexity and risk. -2. Requiring the use of maintenance schedules for OpenShift upgrades (the changes should be compatible with various upgrade methodologies – including being manually triggered). -3. Allowing Standalone worker-nodes to upgrade to a different payload version than the control-plane (this is supported in HCP, but is not a goal for standalone). -4. Exposing maintenance schedule controls from the oc CLI. This may be a future goal but is not required by this enhancement. -5. Providing strict promises around the exact timing of upgrade processes. Maintenance schedules will be honored to a reasonable extent (e.g. upgrade actions will only be initiated during a window), but long-running operations may exceed the configured end of a maintenance schedule. -6. Implementing logic to defend against impractical maintenance schedules (e.g. if a customer configures a 1-second maintenance schedule every year). Service Delivery may want to implement such logic to ensure upgrade progress can be made. -7. Automatically initiating updates to `ClusterVersion`. This will still occur through external actors/orchestration. Maintenance schedules simply give the assurance that changes to `ClusterVersion` will not result in material changes until permitted by the defined maintenance schedules. +1. Allowing control-plane upgrades to be paused midway through an update. Control-plane updates are relatively rapid + and pausing will introduce unnecessary complexity and risk. +1. Requiring the use of maintenance schedules for OpenShift upgrades (the changes should be compatible with various + upgrade methodologies – including being manually triggered). +1. Allowing Standalone worker-nodes to upgrade to a different payload version than the control-plane + (this is supported in HCP, but is not a goal for standalone). +1. Exposing maintenance schedule controls from the oc CLI. This may be a future goal but is not required by this enhancement. +1. Providing strict promises around the exact timing of upgrade processes. Maintenance schedules will be honored to a + reasonable extent (e.g. upgrade actions will only be initiated during a window), but long-running operations may + exceed the configured end of a maintenance schedule. +1. Implementing logic to defend against impractical maintenance schedules (e.g. if a customer configures a 1-second + maintenance schedule every year). Service Delivery may want to implement such logic to ensure upgrade progress can + be made. +1. Automatically initiating updates to `ClusterVersion`. This will still occur through external actors/orchestration. + Maintenance schedules simply give the assurance that changes to `ClusterVersion` will not result in material changes + until permitted by the defined maintenance schedules. ## Proposal ### Change Management Overview -Add a `changeManagement` stanza to several resources in the OpenShift ecosystem: -- HCP's `HostedCluster`. Honored by HyperShift Operator and supported by underlying CAPI primitives. -- HCP's `NodePool`. Honored by HyperShift Operator and supported by underlying CAPI primitives. +Establish a new namespaced Custom Resource Definition (CRD) called `ChangeManagementPolicy` which allows cluster lifecycle +administrators to capture their requirements for when resource(s) associated with the policy can initiate material +changes on the cluster. + +Add a `changeManagement` stanza to several existing resources in the OpenShift ecosystem which can reference +the new `ChangeManagementPolicy` resource to restrict when and how their associated controllers can initiate +material changes to a cluster. +- HCP's `HostedCluster`. Honored by HyperShift Operator and supported by underlying CAPI providers. +- HCP's `NodePool`. Honored by HyperShift Operator and supported by underlying CAPI providers. - Standalone's `ClusterVersion`. Honored by Cluster Version Operator. - Standalone's `MachineConfigPool`. Honored by Machine Config Operator. -The implementation of `changeManagement` will vary by profile -and resource, however, they will share a core schema and provide a reasonably consistent user -experience across profiles. +Changes that are not allowed to be initiated due to a change management policy will be +called "pending". Controllers responsible for initiating pending changes will await a permissive window +according to each resource's relevant `changeManagement` configuration. -The schema will provide options for controlling exactly when changes to API resources on the -cluster can initiate material changes to the cluster. Changes that are not allowed to be -initiated due to a change management control will be called "pending". Subsystems responsible -for initiating pending changes will await a permitted window according to the change's -relevant `changeManagement` configuration(s). +In additional to "policies", different resource kinds may offer additional knobs in their `changeManagement` +stanzas to provide cluster lifecycle administrators additional control over the exact nature of +the changes desired by the resource's associated controller. For example, in `MachineConfigPool`'s +`changeManagement` stanza, a cluster lifecycle administrator will be able to (optionally) specify the +exact rendered configuration the controller should be working towards (during the next permissive window) vs +the traditional OpenShift model where the "latest" rendered configuration is always the destination. + +### ChangeManagementPolicy Resource -### Change Management Strategies -Each resource supporting change management will add the `changeManagement` stanza and support a minimal set of change management strategies. -Each strategy may require an additional configuration element within the stanza. For example: ```yaml +kind: ChangeManagementPolicy +metdata: + # ChangeManagementPolicies are namespaced resources. They will normally reside in the + # namespace associated with the controller initiating material changes. + # For example, in Standalone namespace/openshift-machine-config for the MCO and + # namespace/openshift-cluster-version for the CVO. + # For HCP, the ChangeManagementPolicies for will reside in the same namespace as the + # HostedCluster resource. + # This namespace can be overridden in resources being constrained by a ChangeManagementPolicy + # but RBAC for the resource's controller must permit reading the object from the non-default + # namespace. + namespace: openshift-machine-config + name: example-policy + spec: - changeManagement: - strategy: "MaintenanceSchedule" - pausedUntil: "false" - disabledUntil: "false" - config: - maintenanceSchedule: - ..options to configure a detailed policy for the maintenance schedule.. + # Supported strategy overview: + # Permissive: + # Always permissive - allows material changes. + # Restrictive: + # Always restrictive - pauses material changes. + # MaintenanceSchedule: + # An RRULE and other fields will be used to specify reoccurring permissive windows + # as well as any special exclusion periods. + strategy: Permissive|Restrictive|MaintenanceSchedule + + # Difference strategies expose a configuration + # stanza that further informs their behavior. + maintenanceSchedule: + permit: + ... + +# A new ChangeManagementPolicy controller will update the status field so that +# other controllers, attempting to abide by change management, can easily +# determine whether they can initiate material changes. +status: + # Seconds remaining until changes permitted. 0 means changes + # are currently permitted. + # nil IF pausedUntil: true OR not "Ready". + nextPermissiveETA: 0 + + # Number of seconds remaining in the current permissive window. + # 0 if outside of a permissive window. + # nil if changes are permitted indefinitely or policy not "Ready". + permissiveRemaining: 3600 + + # nil if within a permissive window or policy not "Ready". + lastPermissiveDate: + + conditions: + # If a ChangeManagementPolicy has not calculated yet, it will not + # have Ready=True. + # "reason" and "message" should be set if not ready. + - type: Ready + status: "True" + # Indicates whether the policy is in a permissive mode. + # Must be False while not "Ready". + # Message must provide detailed reason when False. + - type: ChangesPaused + status: "True" + message: "Details on why..." ``` -All change management implementations must support `Disabled` and `MaintenanceSchedule`. Abstracting -change management into strategies allows for simplified future expansion or deprecation of strategies. -Tactically, `strategy: Disabled` provides a convenient syntax for bypassing any configured -change management policy without permanently deleting its configuration. - -For example, if SRE needs to apply emergency corrective action on a cluster with a `MaintenanceSchedule` change -management strategy configured, they can simply set `strategy: Disabled` without having to delete the existing -`maintenanceSchedule` stanza which configures the previous strategy. Once the correct action has been completed, -SRE simply restores `strategy: MaintenanceSchedule` and the previous configuration begins to be enforced. - -Configurations for multiple management strategies can be recorded in the `config` stanza, but -only one strategy can be active at a given time. - -Each strategy will support a policy for pausing or unpausing (permitting) material changes from being initiated. -This will be referred to as the strategy's enforcement state (or just "state"). The enforcement state for a -strategy can be either "paused" or "unpaused" (a.k.a. "permissive"). The `Disabled` strategy enforcement state -is always permissive -- allowing material changes to be initiated (see [Change Management -Hierarchy](#change-management-hierarchy) for caveats). - -All change management strategies, except `Disabled`, are subject to the following `changeManagement` fields: -- `changeManagement.disabledUntil: ""`: When `disabledUntil: "true"` or `disabledUntil: ""`, the interpreted strategy for - change management in the resource is `Disabled`. Setting a future date in `disabledUntil` offers a less invasive (i.e. no important configuration needs to be changed) method to - disable change management constraints (e.g. if it is critical to roll out a fix) and a method that - does not need to be reverted (i.e. it will naturally expire after the specified date and the configured - change management strategy will re-activate). -- `changeManagement.pausedUntil: ""`: Unless the effective active strategy is Disabled, `pausedUntil: "true"` or `pausedUntil: ""`, change management must - pause material changes. - -### Change Management Status -Change Management information will also be reflected in resource status. Each resource -which contains the stanza in its `spec` will expose its current impact in its `status`. -Common user interfaces for aggregating and displaying progress of these underlying resources -should be updated to proxy that status information to the end users. - -### Change Management Metrics -Cluster wide change management information will be made available through cluster metrics. Each resource -containing the stanza must expose the following metrics: -- Whether any change management strategy is enabled. -- Which change management strategy is enabled. This can be used to notify SRE when a cluster begins using a non-standard strategy (e.g. during emergency corrective action). -- The number of seconds until the next known permitted change window. See `change_management_next_change_eta` metric. This might be used to notify an SRE team of an approaching permissive window. -- The number of seconds until the current change window closes. See `change_management_permissive_remaining` metric. -- The last datetime at which changes were permitted (can be nil). See `change_management_last_change` metric (which represents this as seconds instead of a datetime). This could be used to notify an SRE team if a cluster has not had the opportunity to update for a non-compliant period. -- If changes are pending due to change management controls. When combined with other metrics (`change_management_next_change_eta`, `change_management_permissive_remaining`), this can be used to notify SRE when an upcoming permissive window is going to initiate changes and whether changes are still pending as a permissive window closes. - - -### Change Management Hierarchy -Material changes to worker-nodes are constrained by change management policies in their associated resource AND -at the control-plane resource. For example, in a standalone profile, if a MachineConfigPool's change management -configuration apparently permits material changes from being initiated at a given moment, that is only the case -if ClusterVersion is **also** permitting changes from being initiated at that time. - -The design choice is informed by a thought experiment: As a cluster lifecycle administrator for a Standalone cluster, -who wants to achieve the simple goal of ensuring no material changes take place outside a well-defined -maintenance schedule, do you want to have to the challenge of keeping every MachineConfigPool's -`changeManagement` stanza in perfect synchronization with the ClusterVersion's? What if a new MCP is created -without your knowledge? - -The hierarchical approach allows a single master change management policy to be in place across -both the control-plane and worker-nodes. - -Conversely, material changes CAN take place on the control-plane when permitted by its associated -change management policy even while material changes are not being permitted by worker-nodes -policies. - -It is thus occasionally necessary to distinguish a resource's **configured** vs **effective** change management -state. There are two states: "paused" and "unpaused" (a.k.a. permissive; meaning that material changes be initiated). -For a control-plane resource, the configured and effective enforcement states are always the same. For worker-node -resources, the configured strategy may be disabled, but the effective enforcement state can be "paused" due to -an active strategy in the control-plane resource being in the "paused" state. - -| control-plane state | worker-node state | worker-node effective state | results | -|-------------------------|---------------------------|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| unpaused | unpaused | unpaused | Traditional, fully self-managed change rollouts. Material changes can be initiated immediately upon configuration change. | -| paused (any strategy) | **unpaused** | **paused** | Changes to both the control-plane and worker-nodes are constrained by the control-plane strategy. | -| unpaused | paused (any strategy) | paused | Material changes can be initiated immediately on the control-plane. Material changes on worker-nodes are subject to the worker-node policy. | -| paused (any strategy) | paused (any strategy) | paused | Material changes to the control-plane are subject to change control strategy for the control-plane. Material changes to the worker-nodes are subject to **both** the control-plane and worker-node strategies - if either precludes material change initiation, changes are left pending. | +### Change Management Strategies #### Maintenance Schedule Strategy -The maintenance schedule strategy is supported by all resources which support change management. The strategy -is configured by specifying an RRULE identifying permissive datetimes during which material changes can be +The strategy is configured by specifying an RRULE identifying permissive datetimes during which material changes can be initiated. The cluster lifecycle administrator can also exclude specific date ranges, during which -material changes will be paused. +the policy will request material changes to be paused. -#### Disabled Strategy -This strategy indicates that no change management strategy is being enforced by the resource. It always implies that -the enforcement state at the resource level is unpaused / permissive. This does not always -mean that material changes are permitted due to change management hierarchies. For example, a MachineConfigPool -with `strategy: Disabled` would still be subject to a `strategy: MaintenanceSchedule` in the ClusterVersion resource. -The impact of hierarchy should always be made clear in the change management status of the MachineConfigPool. +#### Restrictive Strategy +A policy using the restrictive strategy will always request material changes to be paused. This strategy is useful +when a cluster lifecycle administrator wants tight control over when material changes are initiated on the cluster +but cannot provide a maintenance schedule (e.g. viable windows are too unpredictable). -#### Assisted Strategy - MachineConfigPool -Minimally, this strategy will be supported by MachineConfigPool. If and when the strategy is supported by other -change management capable resources, the configuration schema for the policy may differ as the details of -what constitutes and informs change varies between resources (e.g. `renderedConfigsBefore` informs the assisted -strategy configuration for a MachineConfigPool, but may be meaningless for other resources). +#### Permissive Strategy +A policy using the permissive strategy will always suggest a permissive window. A cluster lifecycle administrator +may want to toggle a `ChangeManagementPolicy` from the `Restrictive` to `Permissive` strategy, and back again, +as a means to implementing their own change management window mechanism. -This strategy is motivated by the desire to support the separation of control-plane and worker-node updates both -conceptually for users and in real technical terms. One way to do this for users who do not benefit from the -`MaintenanceSchedule` strategy is to ask them to initiate, pause, and resume the rollout of material -changes to their worker nodes. Contrast this with the fully self-managed state today, where worker-nodes -(normally) begin to be updated automatically and directly after the control-plane update (subject to constraints -like `maxUnavailable` in each MachineConfigPool). +### Resources Supporting Change Management -Clearly, if this was the only mode of updating worker-nodes, we could never successfully disentangle the -concepts of control-plane vs worker-node updates in Standalone environments since one implies the other. +Resources which support a reference to a `ChangeManagementPolicy` are said to support change management. +Resources which support change management will implement a `spec.changeManagement` stanza. These stanzas +must support AT LEAST the following fields: -In short (details will follow in the implementation section), the assisted strategy allows users to -constrain which rendered machineconfig [`desiredConfig` the MachineConfigPool](https://github.com/openshift/machine-config-operator/blob/0f53196f6480481d3f5a04e217a143a56d4db79e/docs/MachineConfig.md#final-rendered-machineconfig-object) -should be advertised to the MachineConfigDaemon on nodes it is associated with. Like the `MaintenanceSchedule` -strategy, it also respects the `pausedUntil` field. +#### ChangeManagement Stanzas +```yaml +kind: ClusterVersion|MachineConfigPool|HostedCluster|NodePool +spec: + changeManagement: + # If set to "true" or a future date (represented as string), the effective change + # management strategy is Disabled and the window is permissive. + # Date must be RFC3339. + # When disabled with this field, all other values of changeManagement are ignored. + # This field is modeled on HCP's existing HostedCluster.spec.pausedUntil which uses a string. + disabledUntil: str + + # If set to "true" or a future date (represented as string), the strategy will be + # paused (not permissive) unless overridden by disabledUntil. If paused with this + # setting, it overrides the "policy" stanza. + # Date must be RFC3339. + # This field is modeled on HCP's existing HostedCluster.spec.pausedUntil which uses a string. + pausedUntil: str + + # If not overridden with disabledUntil / pausedUntil, a reference to the ChangeManagementPolicy + # to determine whether material changes can be initiated. + policy: + # Namespace is optional. If not specified, the controller assumes the namespace in + # which the controller is running. + namespace: openshift-machine-config + # The name of the ChangeManagementPolicy. + name: example-policy +``` -When using this strategy, estimated time related metrics are set to 0 (e.g. eta and remaining). +At a given moment, a `changeManagement` stanza indicates to a controller responsible for a resource +whether changes should be paused (i.e. it is a restrictive window where material changes should not +be initiated) or unpaused (i.e. it is a permissive window where material changes can be initiated). +- `changeManagement.disabledUntil: ""`: When `disabledUntil: "true"` or `disabledUntil: ""`, + the controller completely disables change management and all changes are permitted. `disabledUntil` overrides + both `pausedUntil` and `policy` when it suggests change management should be disabled. +- `changeManagement.pausedUntil: ""`: When `pausedUntil: "true"` or `pausedUntil: ""`, + changes must be paused and the controller must stop initiating material changes. `pausedUntil` overrides + `policy` when it suggests changes should be paused. +- `changeManagement.policy`: An optional reference to a `ChangeManagementPolicy` object. If neither `disabledUntil` + or `pausedUntil` overrides it, the permissive or restrictive state suggested by the policy object will + inform the controller whether material changes can be initiated. + +While fields like `disabledUntil` or `pausedUntil` may seem to add unnecessarily complexity, they provide +simple to use knobs for SRE and cluster lifecycle administrators to confidently undertake sensitive actions. +For example, +- If SRE needs to apply emergency corrective action on a cluster with a `MaintenanceSchedule` change + management strategy configured, they can simply set `disabledUntil: ` without having to + change object references OR worry about follow-up corrective actions to restore a previous policy. +- If the cluster lifecycle administrator needs to urgently stop a problematic update, they can set + `pausedUntil: true` until issues are fully understood. In a scenario impacting business critical + applications, compare the complexity of this operation with that of trying to fiddle with policy + dates. + +#### Change Management Status & Conditions +Each resource which exposes `spec.changeManagement` must also expose change management status information +to explain its current impact. Common user interfaces for aggregating and displaying progress of these underlying +resources (e.g. OpenShift web console) must be updated to proxy that status information to end users. + +You may note that several of the fields in `status.changeManagement` can be derived directly from `ChangeManagementPolicy.status` +(unless overridden by `spec.changeManagement.pausedUntil` or `spec.changeManagement.disabledUntil`). This simplifies +the work of each controller which supports change management - they simply need to observe `ChanageManagementPolicy.status` +and the `ChanageManagmentPolicy` controller does the heavy lifting of interpreting policy (e.g. interpreting RRULEs). -#### Manual Strategy - MachineConfigPool -Minimally, this strategy will be supported by MachineConfigPool. If and when the strategy is supported by other -change management capable resources, the configuration schema for the policy may differ as the details of -what constitutes and informs change varies between resources. +```yaml +kind: ClusterVersion|MachineConfigPool|HostedCluster|NodePool +spec: + ... +status: + changeManagement: + # Seconds remaining until changes permitted. 0 means changes + # are currently permitted. + # nil IF pausedUntil: true or policy not "Ready". + nextPermissiveETA: 0 + + # Number of seconds remaining in the current permissive window. + # 0 if outside of a permissive window. + # nil if changes are permitted indefinitely or policy not "Ready". + permissiveRemaining: 3600 + + message: "human readable summary" + + # Last recorded permissive window by THIS MCP (this may be different from a recently + # configured ChangeManagementPolicy's lastPermissiveDate). + lastPermissiveDate: + conditions: + - type: ChangesPaused + status: "True" + message: "Details on why..." + - type: ChangesPending + status: "True" + message: "Details on what..." +``` -Like the assisted strategy, this strategy is implemented to support the conceptual and technical separation -of control-plane and worker-nodes. The MachineConfigPool Manual strategy allows users to explicitly specify -their `desiredConfig` to be used for ignition of new and rebooting nodes. While the Manual strategy is enabled, -the MachineConfigOperator will not trigger the MachineConfigDaemon to drain or reboot nodes automatically. +#### Change Management Metrics +Change management information will be made available through cluster metrics. Each resource +containing the `spec.changeManagement` stanza must expose the following metrics: +- Whether any change management strategy is enabled. +- Which change management strategy is enabled. This can be used to notify SRE when a cluster begins using a + non-standard strategy (e.g. during emergency corrective action). +- The number of seconds until the next known permitted change window. See `change_management_next_change_eta` metric. + This might be used to notify an SRE team of an approaching permissive window. +- The number of seconds until the current change window closes. See `change_management_permissive_remaining` metric. +- The last datetime at which changes were permitted (can be nil). See `change_management_last_change` metric (which + represents this as seconds instead of a datetime). This could be used to notify an SRE team if a cluster has not + had the opportunity to update for a non-compliant period. +- If changes are pending due to change management controls. When combined with other + metrics (`change_management_next_change_eta`, `change_management_permissive_remaining`), this can be used to + notify SRE when an upcoming permissive window is going to initiate changes and whether changes are still + pending as a permissive window closes. + +### Enhanced MachineConfigPool Control +The MachineConfigOperator (MCO), like any other operator, works toward eventual consistency of the system state +with state of configured resources. This behavior offers a simple user experience wherein an administrator +can make a change to a `MachineConfig` and the MCO take over rolling out that change. It will +1. Aggregate selected `MachineConfig` objects into a new "rendered" `MachineConfig` object. +1. Work through updating nodes to use the "latest" rendered `MachineConfig` associated with them. + +However, in business critical clusters, progress towards the "latest" rendered `MachineConfig` offers less control +than may be desired for some use cases. To address this, the `MachineConfigPool` resource will be extended +to include an option to declare which rendered `MachineConfig` the MachineConfigOperator should make +progress toward instead of the "latest". -Because the Manual strategy never initiates changes on its own behalf, `pausedUntil` has no effect. From a metrics -perspective, this strategy reports as paused indefinitely. +```yaml +kind: MachineConfigPool +spec: + machineConfig: + # An administrator can identify the exact rendered MachineConfig + # the MCO should progress towards for this MCP. + name: + + # Validation defines when MCO will allow the use of a new + # configuration. Default means a node must successfully use the + # configuration before new nodes are ignited with the + # desiredConfiguration. + # None means no validation is necessary (new nodes will + # immediately ignite with the configuration). + validation: Default +``` + +### Projected ClusterVersion in HCP +A [Cluster Instance Admin](https://hypershift-docs.netlify.app/reference/concepts-and-personas/#personas) using a +hosted cluster cannot directly update their cluster. When they query `ClusterVersion` on their cluster, they only +see a "projected" value. If they edit the projected value, it does not affect change on the cluster. -When using this strategy, estimated time related metrics are set to 0 (e.g. eta and remaining). +In order to update the hosted cluster, changes must be made to its associated HCP resources on the management cluster. +Since this proposal subjects those resources to change management control, this information must also be projected +into the `ClusterVersion` (including change management status information) of hosted clusters so that a Cluster +Instance Admin can understand when material changes to their cluster will take place. ### Workflow Description @@ -529,22 +641,22 @@ When using this strategy, estimated time related metrics are set to 0 (e.g. eta 1. To comply with their company policy, the service consumer configures a maintenance schedule through OCM. 1. Their first preference, no updates at all, is rejected by OCM policy, and they are referred to service delivery documentation explaining minimum requirements. -1. The user specifies a policy which permits changes to be initiated any time Saturday UTC on the control-plane. -1. To limit perceived risk, they try to specify a separate policy permitting worker-nodes updates only on the **first** Sunday of each month. -1. OCM rejects the configuration because, due to change management hierarchy, worker-node maintenance schedules can only be a proper subset of control-plane maintenance schedules. -1. The user changes their preference to a policy permitting worker-nodes updates only on the **first** Saturday of each month. +1. The user specifies a policy which permits control-plane changes to be initiated any time Saturday UTC on the control-plane. +1. To limit overall workload disruption, the user changes their worker-node policy to permit updates only on the **first** Saturday of each month. 1. OCM accepts the configuration. 1. OCM configures the HCP (HostedCluster/NodePool) resources via the Service Delivery Hive deployment to contain a `changeManagement` stanza - and an active/configured `MaintenanceSchedule` strategy. -1. Hive updates the associated HCP resources. -1. Company workloads are added to the new cluster and the cluster provides value. + referring to newly created `ChanageManagementPolicy` objects using the `MaintenanceSchedule` strategy. +1. Hive creates/updates the associated HCP resources. +1. Company workloads are added to the new cluster and the cluster provides business value. 1. To leverage a new feature in OpenShift, the service consumer plans to update the minor version of the platform. 1. Via OCM, the service consumer requests the minor version update. They can do this at any time with confidence that the maintenance schedule will be honored. They do so on Wednesday. 1. OCM (through various layers) updates the target release payload in the HCP HostedCluster and NodePool. -1. The HyperShift Operator detects the desired changes but recognizes that the `changeManagement` stanza - precludes the updates from being initiated. -1. Curious, the service consumer checks the projects ClusterVersion within the HostedCluster and reviews its `status` stanza. It shows that changes are pending and the time of the next window in which changes can be initiated. +1. The HyperShift Operator detects the desired changes but recognizes that the `changeManagement` stanzas + of each resource precludes updates from being initiated. +1. Curious, the service consumer checks the projected ClusterVersion within the HostedCluster and reviews + its `status` stanza. It shows that changes are pending and the time of the next window in which changes + can be initiated. 1. Separate metrics specific to change management indicate that changes are pending for both resources. 1. The non-Red Hat operations team has alerts setup to fire when changes are pending and the number of seconds before the next permitted window is less than 2 days away. @@ -560,68 +672,63 @@ When using this strategy, estimated time related metrics are set to 0 (e.g. eta 1. Interpreting the new exclusion, the metric for time remaining until a permitted window increases to count down to the following month's first Saturday. 1. A month passes and the pending cause the configured alerts to fire again. 1. The operations team is comfortable with the forthcoming changes. -1. The first Saturday of the month 00:00 UTC arrives. The HyperShift operator initiates the worker-node updates based on the pending changes in the cluster NodePool. +1. The first Saturday of the month 00:00 UTC arrives. The HyperShift operator initiates the worker-node updates based + on the pending changes in the cluster NodePool. 1. The HCP cluster has a large number of worker nodes and draining and rebooting them is time-consuming. -1. At 23:59 UTC Saturday night, 80% of worker-nodes have been updated. Since the maintenance schedule still permits the initiation of material changes, another worker-node begins to be updated. -1. The update of this worker-node continues, but at 00:00 UTC Sunday, no further material changes are permitted by the change management policy and the worker-node update process is effectively paused. -1. Because not all worker-nodes have been updated, changes are still reported as pending via metrics for NodePool. **TODO: Review with HyperShift. Pausing progress should be possible, but a metric indicating changes still pending may not since they interact only through CAPI.** +1. At 23:59 UTC Saturday night, 80% of worker-nodes have been updated. Since the maintenance schedule still permits + the initiation of material changes, another worker-node begins to be updated. +1. The update of this worker-node continues, but at 00:00 UTC Sunday, no further material changes are permitted by + the change management policy and the worker-node update process is effectively paused. +1. Because not all worker-nodes have been updated, changes are still reported as pending via metrics for + NodePool. 1. The HCP cluster runs with worker-nodes at mixed versions throughout the month. The N-1 skew between the old kubelet versions and control-plane is supported. -1. **TODO: Review with Service Delivery. If the user requested another minor bump to their control-plane, how does OCM prevent unsupported version skew today?** -1. On the next first Saturday, the worker-nodes updates are completed. +1. On the next first Saturday of the month, the worker-nodes updates are completed. #### OCM Standalone Standard Change Management Scenario 1. User interactions with OCM to configure a maintenance schedule are identical to [OCM HCP Standard Change Management Scenario](#ocm-hcp-standard-change-management-scenario). This scenario differs after OCM accepts the maintenance schedule configuration. Control-plane updates are permitted to be initiated to any Saturday UTC. Worker-nodes must wait until the first Saturday of the month. -2. OCM (through various layers) configures the ClusterVersion and worker MachineConfigPool(s) (MCP) for the cluster with appropriate `changeManagement` stanzas. -3. Company workloads are added to the new cluster and the cluster provides value. -4. To leverage a new feature in OpenShift, the service consumer plans to update the minor version of the platform. -5. Via OCM, the service consumer requests the minor version update. They can do this at any time with confidence that the maintenance +1. OCM (through various layers) creates ChangeManagementPolicies and configures the ClusterVersion and + worker MachineConfigPool(s) (MCP) for the cluster with appropriate `changeManagement` stanzas. +1. Company workloads are added to the new cluster and the cluster provides business value. +1. To leverage a new feature in OpenShift, the service consumer plans to update the minor version of the platform. +1. Via OCM, the service consumer requests the minor version update. They can do this at any time with confidence that the maintenance schedule will be honored. They do so on Wednesday. -6. OCM (through various layers) updates the ClusterVersion resource on the cluster indicating the new release payload in `desiredUpdate`. -7. The Cluster Version Operator (CVO) detects that its `changeManagement` stanza does not permit the initiation of the change. -8. The CVO sets a metric indicating that changes are pending for ClusterVersion. Irrespective of pending changes, the CVO also exposes a +1. OCM (through various layers) updates the ClusterVersion resource on the cluster indicating the new release payload in `desiredUpdate`. +1. The Cluster Version Operator (CVO) detects that its `changeManagement` stanza does not permit the initiation of the change. +1. The CVO sets a metric indicating that changes are pending for ClusterVersion. Irrespective of pending changes, the CVO also exposes a metric indicating the number of seconds until the next window in which material changes can be initiated. -9. Since MachineConfigs likely do not match in the desired update and the current manifests (RHCOS changes occur 100% of the time for non-hotfix updates), - the CVO also sets a metric indicating that MachineConfig changes are pending. This is an assumption, but the price of being wrong - on rare occasions is very low (pending changes will be reported, but disappear shortly after a permissive window begins). - This is done because the MachineConfigOperator (MCO) cannot anticipate the coming manifest changes and cannot, - therefore, reflect expected changes to the worker-node MCPs. Anticipating this change ahead of time is necessary for an operations - team to be able to set an alert with the semantics (worker-node-update changes are pending & time remaining until changes are permitted < 2d). - The MCO will expose its own metric for changes pending when manifests are updated. But this metric will only indicate when - there are machines in the pool that have not achieved the desired configuration. An operations team trying to implement the 2d - early warning for worker-nodes must use OR on these metrics to determine whether changes are actually pending. -10. The MCO, irrespective of pending changes, exposes a metric for each MCP to indicate the number of seconds remaining until it is +1. The MCO, irrespective of pending changes, exposes a metric for each MCP to indicate the number of seconds remaining until it is permitted to initiate changes to nodes in that MCP. -11. A privileged user on the cluster notices different options available for `changeManagement` in the ClusterVersion and MachineConfigPool - resources. They try to set them but are prevented by a validating admission controller. If they wish +1. A privileged user on the cluster notices different options available for `changeManagement` in the ClusterVersion and MachineConfigPool + resources. They try to set them but are prevented by an SRE validating admission controller. If they wish to change the settings, they must update them through OCM. -12. The privileged user does an `oc describe ...` on the resources. They can see that material changes are pending in ClusterVersion for +1. The privileged user does an `oc describe ...` on the resources. They can see that material changes are pending in ClusterVersion for the control-plane and for worker machine config. They can also see the date and time that the next material change will be permitted. The MCP will not show a pending change at this time, but will show the next time at which material changes will be permitted. -13. The next Saturday is _not_ the first Saturday of the month. The CVO detects that material changes are permitted at 00:00 UTC and +1. The next Saturday is _not_ the first Saturday of the month. The CVO detects that material changes are permitted at 00:00 UTC and begins to apply manifests. This effectively initiates the control-plane update process, which is considered a single material change to the cluster. -14. The control-plane update succeeds. The CVO, having reconciled its state, unsets metrics suggesting changes are pending. -15. As part of updating cluster manifests, MachineConfigs have been modified. The MachineConfigOperator (MCO) re-renders a - configuration for worker-nodes. However, because the MCP maintenance schedule precludes initiating material changes, - it will not begin to update Machines with that desired configuration. -16. The MCO will set a metric indicating that desired changes are pending. -17. `oc get -o=yaml/describe` will both provide status information indicating that changes are pending for the MCP and +1. The control-plane update succeeds. The CVO, having reconciled its state, unsets metrics suggesting changes are pending. +1. As part of updating cluster manifests, MachineConfigs have been modified. The MachineConfigOperator (MCO) re-renders a + configuration for worker-nodes. However, because the MCP `chanageManagement` stanza precludes initiating material changes, + it will not yet begin to update Machines with that desired configuration. +1. The MCO will set a metric indicating that desired changes are pending. +1. `oc get -o=yaml/describe` will both provide status information indicating that changes are pending for the MCP and the time at which the next material changes can be initiated according to the maintenance schedule. -18. On the first Saturday of the next month, 00:00 UTC, the MCO determines that material changes are permitted. +1. On the first Saturday of the next month, 00:00 UTC, the MCO determines that material changes are permitted. Based on limits like maxUnavailable, the MCO begins to annotate nodes with the desiredConfiguration. The MachineConfigDaemon takes over from there, draining, and rebooting nodes into the updated release. -19. There are a large number of nodes in the cluster and this process continues for more than 24 hours. On Saturday +1. There are a large number of nodes in the cluster and this process continues for more than 24 hours. On Saturday 23:59, the MCO applies a round of desired configurations annotations to Nodes. At 00:00 on Sunday, it detects that material changes can no longer be initiated, and pauses its activity. Node updates that have already been initiated continue beyond the maintenance schedule window. -20. Since not all nodes have been updated, the MCO continues to expose a metric informing the system of +1. Since not all nodes have been updated, the MCO continues to expose a metric informing the system of pending changes. -21. In the subsequent days, the cluster is scaled up to handle additional workload. The new nodes receive +1. In the subsequent days, the cluster is scaled up to handle additional workload. The new nodes receive the most recent, desired configuration. -22. On the first Saturday of the next month, the MCO resumes its work. In order to ensure that forward progress is +1. On the first Saturday of the next month, the MCO resumes its work. In order to ensure that forward progress is made for all nodes, the MCO will update nodes in order of oldest rendered configuration to newest rendered configuration (among those nodes that do not already have the desired configuration). This ensures that even if the desired configuration has changed multiple times while maintenance was not permitted, @@ -630,22 +737,26 @@ When using this strategy, estimated time related metrics are set to 0 (e.g. eta during times when maintenance is not permitted. This strategy could leave nodes sorting last lexicographically no opportunity to receive updates. This scenario would eventually leave those nodes more prone to version skew issues. -23. During this window of time, all node updates are initiated, and they complete successfully. +1. During this window of time, all node updates are initiated, and they complete successfully. #### Service Delivery Emergency Patch 1. SRE determines that a significant new CVE threatens the fleet. -1. A new OpenShift release in each z-stream fixes the problem. +1. A new OpenShift release in each z-stream fixes the problem. Only the control-plane needs to be updated + to remediate the CVE. 1. SRE plans to override customer maintenance schedules in order to rapidly remediate the problem across the fleet. -1. The new OpenShift release(s) are configured across the fleet. Clusters with permissive maintenance - schedules begin to apply the changes immediately. -1. Clusters with change management policies precluding updates are SRE's next focus. -1. During each region's evening hours, to limit disruption, SRE changes the `changeManagement` strategy - field across relevant resources to `Disabled`. Changes that were previously pending are now - permitted to be initiated. -1. Cluster operators who have alerts configured to fire when there is no change management policy in place +1. The new OpenShift release(s) are configured across the fleet. Clusters with permissive change management + policies begin to apply the changes immediately. +1. Clusters with `ClusterVersion` change management policies precluding updates are SRE's next focus. +1. During each region's evening hours, to limit disruption, SRE changes the `ClusterVersion.spec.changeManagement.disabledUntil` field + to the current datetime+24h. Changes that were previously pending are now permitted to be initiated for + 24 hours since any configured change management policy is disabled for that period of time. +1. Clusters which have alerts configured to fire when there is no change management policy in place will do so. -1. As clusters are successfully remediated, SRE restores the `MaintenanceSchedule` strategy for its resources. - +1. SRE continues to monitor the rollout but does not need to remove `changeManagement.disabledUntil` since it will + automatically deactivate in 24 hours. +1. Clusters with change management policies setup for their worker-nodes are not updated automatically after the + control-plane update. MCPs will report pending changes, but the MachineConfigOperator will await a + permissive window for each MCP to apply the potentially disruptive update. #### Service Delivery Immediate Remediation 1. A customer raises a ticket for a problem that is eventually determined to be caused by a worker-node system configuration. @@ -654,10 +765,10 @@ When using this strategy, estimated time related metrics are set to 0 (e.g. eta configured maintenance schedule permits the material change from being initiated by the MachineConfigOperator or (b) having SRE override the maintenance schedule and permitting its immediate application. 1. The customer chooses immediate application. -1. SRE applies a change to the relevant control-plane AND worker-node resource's `changeManagement` stanza - (both must be changed because of the change management hierarchy), setting `disabledUntil` to - a time 48 hours in the future. The configured change management schedule is ignored for 48 hours as the system - initiates all necessary node changes. +1. SRE applies a change to the relevant worker-node resource's `changeManagement` stanza, setting `disabledUntil` to + a time 48 hours in the future. The configured change management policy is ignored for 48 hours as the system + initiates all necessary node changes to worker-nodes. +1. If unrelated changes were pending for the control-plane, they will remain pending throughout this process. #### Service Delivery Deferred Remediation 1. A customer raises a ticket for a problem that is eventually determined to be caused by a worker-node system configuration. @@ -672,71 +783,79 @@ When using this strategy, estimated time related metrics are set to 0 (e.g. eta #### On-prem Standalone GitOps Change Management Scenario 1. An on-prem cluster is fully managed by gitops. As changes are committed to git, those changes are applied to cluster resources. 1. Configurable stanzas of the ClusterVersion and MachineConfigPool(s) resources are checked into git. -1. The cluster lifecycle administrator configures `changeManagement` in both the ClusterVersion and worker MachineConfigPool - in git. The MaintenanceSchedule strategy is chosen. The policy permits control-plane and worker-node updates only after - 19:00 Eastern US. +1. The cluster lifecycle administrator configures `spec.changeManagement` in both the ClusterVersion and worker MachineConfigPool + in git. A policy using the MaintenanceSchedule strategy is chosen. The policy permits control-plane and worker-node updates only after + 19:00 UTC. 1. During the working day, users may contribute and merge changes to MachineConfigs or even the `desiredUpdate` of the - ClusterVersion. These resources will be updated in a timeline manner via GitOps. + ClusterVersion. These resources will be updated on the cluster in a timely manner via GitOps. 1. Despite the resource changes, neither the CVO nor MCO will begin to initiate the material changes on the cluster. -1. Privileged users who may be curious as to the discrepancy between git and the cluster state can use `oc get -o=yaml/describe` +1. Privileged users, who may be curious as to the discrepancy between git and the cluster state, can use `oc get -o=yaml/describe` on the resources. They observe that changes are pending and the time at which changes will be initiated. -1. At 19:00 Eastern, the pending changes begin to be initiated. This rollout abides by documented OpenShift constraints +1. At 19:00 UTC, the pending changes begin to be initiated. This rollout abides by documented OpenShift constraints such as the MachineConfigPool `maxUnavailable` setting. -#### On-prem Standalone Manual Strategy Scenario +#### On-prem Standalone Manual Change Rollout Scenario 1. A small, business critical cluster is being run on-prem. 1. There are no reoccurring windows of time when the cluster lifecycle administrator can tolerate downtime. Instead, updates are negotiated and planned far in advance. -1. The cluster workloads are not HA and unplanned drains are considered a business risk. -1. To prevent surprises, the cluster lifecycle administrator sets the Manual strategy on the worker MCP. +1. The cluster workloads are not HA and unplanned drains are a business risk. +1. To prevent surprises, the cluster lifecycle administrator worker-node MCP to refer to a `ChanageManagementPolicy` + using the `Restrictive` strategy (which never permits changes to be initiated). 1. Given the sensitivity of the operation, the lifecycle administrator wants to manually drain and reboot nodes to accomplish the update. 1. The cluster lifecycle administrator sends a company-wide notice about the period during which service may be disrupted. -1. The user determines the most recent rendered worker configuration. They configure the `manual` change - management policy to use that exact configuration as the `desiredConfig`. -1. The MCC is thus being asked to ignite any new node or rebooted node with the desired configuration, but it - is **not** being permitted to initiate that configuration change on existing nodes because change management, in effect, - is paused indefinitely by the manual strategy. - Note that this differs from the MCO's current rollout strategy where new nodes are ignited with an older - configuration until the new configuration has been used on an existing node. The rationale in this change - applies only the manual strategy. The administrator has chosen and advanced strategy and should have a - deterministic set of steps to follow. (a) make the change in the MCP. (b) reboot all extant nodes. If - new nodes were entering the system with old configurations, the list of nodes the administrator needed - to action could be increasing after they query the list of extant nodes. -1. A new annotation `applyOnReboot` will be set on - nodes selected by the MachineConfigPool. The flag indicates to the MCD that it should only - apply the configuration before a node is rebooted (vs initiating its own drain / reboot). +1. The user determines the most recent rendered worker `MachineConfig`. +1. They configure the `MachineConfigPool.spec.machineConfig.name` field to specific that exact configuration as the + target configuration. They also set `MachineConfigPool.spec.machineConfig.validation` to `None`. + By bypassing normal validation, the MCO is being asked to ignite any new node with the specified `MachineConfig` + even without first ensuring an existing node can use the configuration. At the same time, the `Restrictive` + change management policy that is in place is telling the MCO that is it not permitted to initiate + changes on existing nodes. 1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running - the most recently rendered configuration. This is irrespective of the `desiredConfig` in the `manual` - policy. Conceptually, it means, if change management were disabled, whether changes would be initiated. -1. The cluster lifecycle administrator manually drains and reboots nodes in the cluster. + the most recently rendered configuration. Conceptually, it means, if change management were disabled, + whether changes would be initiated. +1. The cluster lifecycle administrator scales in a new node. It receives the specified configuration. They + validate the node's functionality. +1. Satisfied with the new machine configuration, the cluster lifecycle administrator begins manually draining + and rebooting existing nodes on the cluster. Before the node reboots, it takes the opportunity to pivot to the `desiredConfig`. 1. After updating all nodes, the cluster lifecycle administrator does not need make any additional configuration changes. They can leave the `changeManagement` stanza in their MCP as-is. -#### On-prem Standalone Assisted Strategy Scenario +#### On-prem Standalone Assisted Rollout Scenario 1. A large, business critical cluster is being run on-prem. 1. There are no reoccurring windows of time when the cluster lifecycle administrator can tolerate downtime. Instead, updates are negotiated and planned far in advance. -1. The cluster workloads are not HA and unplanned drains are considered a business risk. -1. To prevent surprises, the cluster lifecycle administrator sets the assisted strategy on the worker MCP. -1. In the `assisted` strategy change management policy, the lifecycle administrator configures `pausedUntil: "true"` - and the most recently rendered worker configuration in the policy's `renderedConfigsBefore: `. -1. Conceptually, this is like a paused MCP today, however, it is paused waiting to apply a machine configuration - that may not be latest rendered configuration. The MCO should not prune any `MachineConfig` referenced by - this strategy. +1. The cluster workloads are not HA and unplanned drains are a business risk. +1. To prevent surprises, the cluster lifecycle administrator sets `changeManagement` on the worker MCP + to refer to a `ChangeManagementPolicy` using the `Restictive` strategy. +1. Various cluster updates have been applied to the control-plane, but the cluster lifecycle administrator + has not updated worker-nodes. 1. The MCO metric for the MCP indicating that changes are pending is set because not all nodes are running - the most recent, rendered configuration. This is irrespective of the `renderedConfigsBefore` in the `assisted` - configuration. Conceptually, it means, if change management were disabled, whether changes would be initiated. -1. When the lifecycle administrator is ready to permit disruption, they set `pausedUntil: "false"`. -1. The MCO begins to initiate worker node updates. This rollout abides by documented OpenShift constraints + the desired `MachineConfig`. +1. The cluster lifecycle administrator wants to choose the exact `MachineConfig` to apply as well as the exact + times during which material changes will be made to worker-nodes. However, they do not want to manually + reboot nodes. To have the MCO assist them in the rollout of the change, they set + `MachineConfigPool.spec.machineConfig.name=` and + `MachineConfigPool.spec.changeManagement.disabledUntil=` + to allow the MCO to begin initiating material changes and make progress towards the specified configuration. + The MCO will not prune any `MachineConfig` referenced by an MCP's `machineConfig` stanza. +1. The MCO begins to initiate worker-node updates. This rollout abides by documented OpenShift constraints such as the MachineConfigPool `maxUnavailable` setting. It also abides by currently rollout rules (i.e. - new nodes igniting will receive an older configuration until at least one node has the configuration - selected by `renderedConfigsBefore`). -1. Though new rendered configurations may be created, the assisted strategy will not act until the assisted policy - is updated to permit a more recent creation date. - - + new nodes igniting will receive an older configuration until at least one node has successfully applied + the latest rendered `MachineConfig`). +1. Though new rendered configurations may be created during this process (e.g. another control-plane update + is applied while the worker-nodes are making progress), the MCO will ignore them since it is being asked + to apply `MachineConfigPool.spec.machineConfig.name` as the desired configuration. + +There are other paths through which a similar outcome could be achieved. +- Without creating a `ChangeManagementPolicy` object. The cluster lifecycle administrator could + leave `MachineConfigPool.spec.changeManagement.pausedUntil=True` to achieve the same net result + as a `Restrictive` policy object. +- By toggling the strategy configured in a `ChangeManagementPolicy` referenced by their MCPs + from `strategy: Restrictive` to `strategy: Permissive` and back again after their worker-nodes are + updated. + ### API Extensions API Extensions are CRDs, admission and conversion webhooks, aggregated API servers, @@ -759,13 +878,13 @@ of API Extensions" section. #### Hypershift / Hosted Control Planes -In the HCP topology, the HostedCluster and NodePool resources are enhanced to support the change management strategies -`MaintenanceSchedule` and `Disabled`. +In the HCP topology, the `HostedCluster` and `NodePool` resources are enhanced to support the `spec.changeManagement` +stanza. #### Standalone Clusters -In the Standalone topology, the ClusterVersion and MachineConfigPool resources are enhanced to support the change management strategies -`MaintenanceSchedule` and `Disabled`. The MachineConfigPool also supports the `Manual` and `Assisted` strategies. +In the Standalone topology, the ClusterVersion and MachineConfigPool resources are enhanced to support the +`spec.changeManagement` stanza. #### Single-node Deployments or MicroShift @@ -774,84 +893,67 @@ There is no logical distinction between user workloads and control-plane workloa update will drain user workloads and will cause workload disruption. Though disruption is inevitable, the maintenance schedule feature provides value in these toplogies by -ensuring that disruption happens on a specified schedule. +giving the cluster lifecycle administrator high level controls on exactly when it occurs. #### OCM Managed Profiles OpenShift Cluster Manager (OCM) should expose a user interface allowing users to manage their change management policy. -Standard Fleet clusters will expose the option to configure the MaintenanceSchedule strategy - including -only permit and exclude times. +Standard Fleet clusters will expose the option to configure the control-plane and worker-node `ChangeManagementPolicy` +objects with the `MaintenanceSchedule` strategy - including permit and exclude times. -- Service Delivery will reserve the right to disable this strategy for emergency corrective actions. -- Service Delivery should constrain permit & exclude configurations based on their internal policies. For example, customers may be forced to enable permissive windows which amount to at least 6 hours a month. +- Service Delivery will reserve the right to disable this policy for emergency corrective actions. +- Service Delivery should constrain permit & exclude configurations based on their internal policies. + For example, customers may be forced to enable permissive windows which amount to at least 6 hours a month. ### Implementation Details/Notes/Constraints #### ChangeManagement Stanza -The change management stanza will be introduced into ClusterVersion and MachineConfigPool (for standalone profiles) -and HostedCluster and NodePool (for HCP profiles). The structure of the stanza is: - -```yaml -spec: - changeManagement: - # The active strategy for change management (unless disabled by disabledUntil). - strategy: - - # If set to "true" or a future date (represented as string), the effective change - # management strategy is Disabled. Date must be RFC3339. - disabledUntil: - - # If set to "true" or a future date (represented as string), all strategies other - # than Disabled are paused. Date must be RFC3339. - pausedUntil: - - # If a strategy needs additional configuration information, it can read a - # key bearing its name in the config stanza. - config: - : - ...configuration policy for the strategy... -``` +The `spec.changeManagement` stanza will be introduced into ClusterVersion and MachineConfigPool (for standalone profiles) +and HostedCluster and NodePool (for HCP profiles). #### MaintenanceSchedule Strategy Configuration +When a `ChangeManagementPolicy` is defined to use the `MaintenanceSchedule` strategy, +a `maintenanceSchedule` stanza should also be provided to configure the strategy +(if it is not, the policy is functionally identical to `Restrictive`). + ```yaml +kind: ChanageManagementPolicy spec: - changeManagement: - strategy: MaintenanceSchedule - config: - maintenanceSchedule: - # Specifies a reoccurring permissive window. - permit: - # RRULEs (https://www.rfc-editor.org/rfc/rfc5545#section-3.3.10) are commonly used - # for calendar management metadata. Only a subset of the RFC is supported. - # See "RRULE Constraints" section for details. - # If unset, all dates are permitted and only exclude constrains permissive windows. - recurrence: - # Given the identification of a date by an RRULE, at what time (always UTC) can the - # permissive window begin. "00:00" if unset. - startTime: - # Given the identification of a date by an RRULE, after what offset from the startTime should - # the permissive window close. This can create permissive windows within days that are not - # identified in the RRULE. For example, recurrence="FREQ=Weekly;BYDAY=Sa;", - # startTime="20:00", duration="8h" would permit material change initiation starting - # each Saturday at 8pm and continuing through Sunday 4am (all times are UTC). The default - # duration is 24:00-startTime (i.e. to the end of the day). - duration: - - - # Excluded date ranges override RRULE selections. - exclude: - # Dates should be specified in YYYY-MM-DD. Each date is excluded from 00:00 UTC for 24 hours. - - fromDate: - # Non-inclusive until. If null, until defaults to the day after from (meaning a single day exclusion). - untilDate: - # Provide additional detail for status when the cluster is within an exclusion period. - reason: Optional human readable which will be included in status description. - + strategy: MaintenanceSchedule + + maintenanceSchedule: + # Specifies a reoccurring permissive window. + permit: + # RRULEs (https://www.rfc-editor.org/rfc/rfc5545#section-3.3.10) are commonly used + # for calendar management metadata. Only a subset of the RFC is supported. + # See "RRULE Constraints" section for details. + # If unset, all dates are permitted and only exclude constrains permissive windows. + recurrence: + # Given the identification of a date by an RRULE, at what time (always UTC) can the + # permissive window begin. "00:00" if unset. + startTime: + # Given the identification of a date by an RRULE, after what offset from the startTime should + # the permissive window close. This can create permissive windows within days that are not + # identified in the RRULE. For example, recurrence="FREQ=Weekly;BYDAY=Sa;", + # startTime="20:00", duration="8h" would permit material change initiation starting + # each Saturday at 8pm and continuing through Sunday 4am (all times are UTC). The default + # duration is 24:00-startTime (i.e. to the end of the day). + duration: + + + # Excluded date ranges override RRULE selections. + exclude: + # Dates should be specified in YYYY-MM-DD. Each date is excluded from 00:00 UTC for 24 hours. + - fromDate: + # Non-inclusive until. If null, until defaults to the day after from (meaning a single day exclusion). + untilDate: + # Provide additional detail for status when the cluster is within an exclusion period. + reason: Optional human readable which will be included in status description. ``` -Permitted times (i.e. times at which the strategy enforcement state can be permissive) are specified using a +Permissive windows are specified using a subset of the [RRULE RFC5545](https://www.rfc-editor.org/rfc/rfc5545#section-3.3.10) and, optionally, a -starting and ending time of day. https://freetools.textmagic.com/rrule-generator is a helpful tool to +starting time and duration. https://freetools.textmagic.com/rrule-generator is a helpful tool to review the basic semantics RRULE is capable of expressing. https://exontrol.com/exicalendar.jsp?config=/js#calendar offers more complex expressions. @@ -875,56 +977,60 @@ A valid RRULE for change management: - cannot specify a permissive window more than 2 years away. **Overview of Interactions** -The MaintenanceSchedule strategy, along with `changeManagement.pausedUntil` allows a cluster lifecycle administrator to express +The MaintenanceSchedule strategy, along with `.spec.changeManagement.pausedUntil` allows a cluster lifecycle administrator to express one of the following: | pausedUntil | permit | exclude | Enforcement State (Note that **effective** state must also take into account hierarchy) | |----------------|--------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `null`/`false` | `null` | `null` | Permissive indefinitely | -| `true` | * | * | Paused indefinitely | -| `null`/`false` | set | `null` | Permissive during reoccurring windows time. Paused at all other times. | -| `null`/`false` | set | set | Permissive during reoccurring windows time modulo excluded date ranges during which it is paused. Paused at all other times. | -| `null`/`false` | `null` | set | Permissive except during excluded dates during which it is paused. | +| `nil` | `null` | `null` | Restrictive indefinitely | +| `nil` | set | `null` | Permissive during reoccurring windows time. Paused at all other times. | +| `nil` | set | set | Permissive during reoccurring windows time modulo excluded date ranges during which it is paused. Paused at all other times. | +| `nil` | `null` | set | Permissive except during excluded dates during which it is paused. | | date | * | * | Honor permit and exclude values, but only after the specified date. For example, permit: `null` and exclude: `null` implies the strategy is indefinitely permissive after the specified date. | +| `true` | * | * | Paused/restrictive indefinitely | -#### MachineConfigPool Assisted Strategy Configuration +#### CLI Assisted Rollout Scenarios +Once this proposal has been implemented, it is expected that `oc` will be enhanced to permit users to +trigger assisted rollouts scenarios for worker-nodes (i.e. where they control the timing of a rollout +completely but progress is made using MCO automation). -```yaml -spec: - changeManagement: - strategy: Assisted - config: - assisted: - permit: - # The assisted strategy will allow the MCO to process any rendered configuration - # that was created before the specified datetime. - renderedConfigsBefore: - # When AllowSettings, rendered configurations after the preceding before date - # can be applied if and only if they do not contain changes to osImageURL. - policy: "AllowSettings|AllowNone" -``` - -The primary user of this strategy is `oc` with tentatively planned enhancements to include verbs -like: +`oc adm update` will be enhanced with `worker-node` subverbs to initiate and pause MCO work. ```sh $ oc adm update worker-nodes start ... -$ oc adm update worker-nodes pause ... +$ oc adm update worker-nodes pause/resume ... $ oc adm update worker-nodes rollback ... ``` -These verbs can leverage the assisted strategy and `pausedUntil` to allow the manual initiation of worker-nodes -updates after a control-plane update. +These verbs can leverage `MachineConfigPool.spec.changeManagement` to achieve their goals. +For example, if the MCP is as follows: -#### MachineConfigPool Manual Strategy Configuration +```yaml +kind: MachineConfigPool +spec: + machineConfig: + name: + validation: Default +``` + +- `worker-nodes start` can set a target `spec.machineConfig.name` to initiate progress toward a new update's + recently rendered `MachineConfig`. +- `worker-nodes pause/resume` can toggle `spec.changeManagement.pausedUntil=true/false`. +- `worker-nodes rollback` can restore a previous `spec.machineConfig.name` is backed up in an MCP annotation. + +#### Manual Rollout Scenario + +Cluster lifecycle administrators desired yet more control can initiate node updates & drains themselves. ```yaml +kind: MachineConfigPool spec: changeManagement: - strategy: Manual - config: - manual: - desiredConfig: + pausedUntil: true + + machineConfig: + name: + validation: None ``` The manual strategy requests no automated initiation of updates. New and rebooting @@ -943,7 +1049,6 @@ Value: - `0`: no material changes are pending. - `1`: changes are pending but being initiated. - `2`: changes are pending and blocked based on this resource's change management policy. -- `3`: changes are pending and blocked based on another resource in the change management hierarchy. `change_management_next_change_eta` Labels: @@ -954,8 +1059,9 @@ Labels: Value: - `-2`: Error determining the time at which changes can be initiated (e.g. cannot check with ClusterVersion / change management hierarchy). - `-1`: Material changes are paused indefinitely. -- `0`: Material changes can be initiated now (e.g. change management is disabled or inside machine schedule window). Alternatively, time is not relevant to the strategy (e.g. assisted strategy). -- `> 0`: The number seconds remaining until changes can be initiated OR 1000*24*60*60 (1000 days) if no permissive window can be found within the next 1000 days (this ensures a brute force check of intersecting datetimes with hierarchy RRULEs is a valid method of calculating intersection). +- `0`: Material changes can be initiated now (e.g. change management is disabled or inside machine schedule window). + Alternatively, time is not relevant to the strategy (e.g. restrictive). +- `> 0`: The number seconds remaining until changes can be initiated. `change_management_permissive_remaining` Labels: @@ -965,9 +1071,10 @@ Labels: Value: - `-2`: Error determining the time at which current permissive window will close. -- `-1`: Material changes are permitted indefinitely (e.g. `strategy: disabled`). -- `0`: Material changes are not presently permitted (i.e. the cluster is outside of a permissive window). Alternatively, time is not relevant to the strategy (e.g. assisted strategy). -- `> 0`: The number seconds remaining in the current permissive change window (or the equivalent of 1000 days if end of window cannot be computed). +- `-1`: Material changes are permitted indefinitely (e.g. `strategy: Permissive`). +- `0`: Material changes are not presently permitted (i.e. the cluster is outside of a permissive window). + Alternatively, time is not relevant to the strategy (e.g. restrictive strategy). +- `> 0`: The number seconds remaining in the current permissive change window. `change_management_last_change` Labels: @@ -978,38 +1085,18 @@ Labels: Value: - `-1`: Datetime unknown. - `0`: Material changes are currently permitted. -- `> 0`: The number of seconds which have elapsed since the material changes were last permitted or initiated (for non-time based strategies). +- `> 0`: The number of seconds which have elapsed since the material changes were last permitted or initiated. `change_management_strategy_enabled` Labels: - kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool - object= - system= -- strategy=MaintenanceSchedule|Manual|Assisted +- strategy=MaintenanceSchedule|Permissive|Restrictive Value: -- `0`: Change management for this resource is not subject to this enabled strategy (**does** consider hierarchy based disable). -- `1`: Change management for this resource is directly subject to this enabled strategy. -- `2`: Change management for this resource is indirectly subject to this enabled strategy (i.e. only via control-plane override hierarchy). -- `3`: Change management for this resource is directly and indirectly subject to this enabled strategy. - -#### Change Management Status -Each resource which exposes a `.spec.changeManagement` stanza must also expose `.status.changeManagement` . - -```yaml -status: - changeManagement: - # Always show control-plane level strategy. Disabled if disabledUntil is "true". - clusterStrategy: - # If this a worker-node related resource (e.g. MCP), show local strategy. Disabled if disabledUntil is "true". - workerNodeStrategy: - # Show effective state. - effectiveState: - description: "Human readable message explaining how strategies & configuration are resulting in the effective state." - # The start of the next permissive window, taking into account the hierarchy. "N/A" for indefinite pause. - permitChangesETA: - changesPending: -``` +- `0`: Change management for this resource is not subject to this enabled strategy. +- `1`: Change management for this resource is subject to this enabled strategy. #### Change Management Bypass Annotation In some situations, it may be necessary for a MachineConfig to be applied regardless of the active change @@ -1023,46 +1110,15 @@ the remainder of the control-plane update will be treated as a single material c ### Special Handling #### Change Management on Master MachineConfigPool -In order to allow control-plane updates as a single material change, the MCO will only honor change the management configuration for the +In order to allow control-plane updates as a single material change, the MCO will only honor change management configuration for the master MachineConfigPool if user generated MachineConfigs are the cause for a pending change. To accomplish this, at least one MachineConfig updated by the CVO will have the `machineconfiguration.openshift.io/bypass-change-management` annotation indicating that changes in the MachineConfig must be acted upon irrespective of the master MCP change management policy. -#### Limiting Overlapping Window Search / Permissive Window Calculation -An operator implementing change management for a worker-node related resource must honor the change management hierarchy when -calculating when the next permissive window will occur (called elsewhere in the document, ETA). This is not -straightforward to compute when both the control-plane and worker-nodes have independent MaintenanceSchedule -configurations. - -We can, however, simplify the process by reducing the number of days in the future the algorithm must search for -coinciding permissive windows. 1000 days is a proposed cut-off. - -To calculate coinciding windows then, the implementation can use [rrule-go](https://github.com/teambition/rrule-go) -to iteratively find permissive windows at the cluster / control-plane level. These can be added to an -[interval-tree](https://github.com/rdleal/intervalst) . As dates are added, rrule calculations for the worker-nodes -can be performed. The interval-tree should be able to efficiently determine whether there is an -intersection between the permissive intervals it has stored for the control-plane and the time range tested for the -worker-nodes. - -Since it is possible there is no overlap, limits must be placed on this search. Once dates >1000 days from -the present moment are being tested, the operator can behave as if the next window will occur in -1000 days (prevents infinite search for overlap). - -This outcome does not need to be recomputed unless the operator restarts Or one of the RRULE involved -is modified. - -If an overlap _is_ found, no additional intervals need to be added to the tree and it can be discarded. -The operator can store the start & end datetimes for the overlap and count down the seconds remaining -until it occurs. Obviously, this calculation must be repeated: -1. If either MaintenanceSchedule configuration is updated. -1. The operator is restarted. -1. At the end of a permissive window, in order to determine the next permissive window. - - #### Service Delivery Option Sanitization It is obvious that the range of flexible options provided by change management configurations offers can create risks for inexperienced cluster lifecycle administrators. For example, setting a -standalone cluster to use the assisted strategy and failing to trigger worker-node updates will +standalone cluster to use the restrictive strategy and failing to trigger worker-node updates will leave unpatched CVEs on worker-nodes much longer than necessary. It will also eventually lead to the need to resolve version skew (Upgradeable=False will be reported by the API cluster operator). @@ -1084,23 +1140,26 @@ schedule. In other words, it can be applied immediately, even outside a permissi ### Risks and Mitigations -- Given the range of operators which must implement support for change management, inconsistent behavior or reporting may make it difficult for users to navigate different profiles. +- Given the range of operators which must implement support for change management, inconsistent behavior or reporting + may make it difficult for users to navigate different profiles. - Mitigation: A shared library should be created and vendored for RRULE/exclude/next window calculations/metrics. -- Users familiar with the fully self-managed nature of OpenShift are confused by the lack of material changes be initiated when change management constraints are active. - - Mitigation: The introduction of change management will not change the behavior of existing clusters. Users must make a configuration change. +- Users familiar with the fully self-managed nature of OpenShift are confused by the lack of material changes be + initiated when change management constraints are active. + - Mitigation: The introduction of change management will not change the behavior of existing clusters. + Users must make a configuration change. - Users may put themselves at risk of CVEs by being too conservative with worker-node updates. -- Users leveraging change management may be more likely to reach unsupported kubelet skew configurations vs fully self-managed cluster management. +- Users leveraging change management may be more likely to reach unsupported kubelet skew configurations + vs fully self-managed cluster management. ### Drawbacks The scope of the enhancement - cutting across several operators requires multiple, careful implementations. The enhancement also touches code paths that have been refined for years which assume a fully self-managed cluster approach. Upsetting these -code paths prove challenging. +code paths may prove challenging. ## Open Questions [optional] 1. Can the HyperShift Operator expose a metric for when changes are pending for a subset of worker nodes on the cluster if it can only interact via CAPI resources? -2. Can the MCO interrogate the ClusterVersion change management configuration in order to calculate overlapping permissive intervals in the future? ## Test Plan @@ -1178,7 +1237,7 @@ Operators implementing change management for their resources will not face any new _internal_ version skew complexities due to this enhancement, but change management does increase the odds of prolonged and larger differential kubelet version skew. -For example, particularly given the Manual or Assisted change management strategy, it +For example, particularly given long-lived Restrictive strategies, it becomes easier for a cluster lifecycle administrator to forget to update worker-nodes along with updates to the control-plane. @@ -1250,38 +1309,26 @@ without causing further disruption. This is not a moment when you want to be fig a date string, calculating timezones, or copying around cluster configuration so that you can restore it after you stop the bleeding. -### Implement change control, but do not implement the Manual and/or Assisted strategy for MachineConfigPool +### Implement change management, but only support a MaintenanceSchedule Major enterprise users of our software do not update on a predictable, recurring window of time. Updates require laborious certification processes and testing. Maintenance schedules will not serve these customers well. However, these customers may still benefit significantly from the change management concept -- unexpected / disruptive worker node drains and reboots have bitten even experienced OpenShift operators (e.g. a new MachineConfig being contributed via gitops). -These strategies inform decision-making through metrics and provide facilities for fine-grained control +Alternative strategies inform decision-making through metrics and provide facilities for fine-grained control over exactly when material change is rolled out to a cluster. -The Assisted strategy is also specifically designed to provide a foundation for +The "assisted" rollout scenario described in this proposal is also specifically designed to provide a foundation for the forthcoming `oc adm update worker-nodes` verbs. After separating the control-plane and worker-node update phases, these verbs are intended to provide cluster lifecycle administrators the ability to easily start, pause, cancel, and even rollback worker-node changes. Making accommodations for these strategies should be a subset of the overall implementation of the MaintenanceSchedule strategy and they will enable a foundation for a range of -different persons not served by MaintenanceSchedule. +different personas not served by MaintenanceSchedule. ### Use CRON instead of RRULE The CRON specification is typically used to describe when something should start and does not imply when things should end. CRON also cannot, in a standard way, express common semantics like "The first Saturday of every month." - -### Use a separate CRD instead of updating ClusterVersion, MCP, ... -In this alternative, we introduce a CRD separate from ClusterVersion, MCP, HostedCluster, and NodePool. For example -an independent `UpdatePolicy` CRD where administrator preferences can be captured. This approach [was explored](https://github.com/jupierce/oc-update-design-ideas/commit/a6364ee2f2c1ebf84ed6d50bc277f9736bf793bd). -Ultimately, it felt less intuitive across profiles. Consider a CRD called `UpdatePolicy` that tries to be a central config. - -1. Is it namespaced? If it is, that feels odd for an object that that should control the cluster. If it is not namespaced, the policy feels misplaced for HCP HostedCluster resources which live in a namespace. -1. Where does someone check the status of the policy (e.g. when the next update is going to be possible on a given MCP?). If it is the UpdatePolicy object, you have multiple independent controllers controlling the status, which is an antipattern. If the UpdatePolicy controls multiple different MCPs differently, how do they independently report their status'? It also introduces the problem of having to look at multiple places (MCP and UpdatePolicy to understand what may be happening. -1. As you pack policies for control-plane, MCPs, HCP primitives into an expressive UpdatePolicy, the schema options varied from too complex, to error prone, to overly abstract, to overly limiting. -1. If you split the policy object up to simplify it, e.g. one for each MCP, you have the problem of associating it to the right MCP unambiguously. Planting the policy in the MCP itself solves this problem. - -In summary, it is possible, but the working group didn't believe the alternative was as elegant or user friendly as housing the policies directly in the resources they control. \ No newline at end of file From e1092ab020e4522119fe3a334b51c4997a6e1353 Mon Sep 17 00:00:00 2001 From: Justin Pierce Date: Thu, 12 Sep 2024 19:49:09 -0400 Subject: [PATCH 12/12] Review inspired additions --- ...ge-management-and-maintenance-schedules.md | 22 ++++++++++++++----- 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/enhancements/update/change-management-and-maintenance-schedules.md b/enhancements/update/change-management-and-maintenance-schedules.md index e6f43090e1..57a6872a7a 100644 --- a/enhancements/update/change-management-and-maintenance-schedules.md +++ b/enhancements/update/change-management-and-maintenance-schedules.md @@ -373,9 +373,12 @@ risks and disruption when rolling out changes to their environments. ## Proposal ### Change Management Overview -Establish a new namespaced Custom Resource Definition (CRD) called `ChangeManagementPolicy` which allows cluster lifecycle +Establish a new, namespaced Custom Resource Definition (CRD) called `ChangeManagementPolicy` which allows cluster lifecycle administrators to capture their requirements for when resource(s) associated with the policy can initiate material -changes on the cluster. +changes on the cluster. The new resource is namespaced because it will be used in both Standalone and Hosted Control +Plane (HCP) environments. In HCP, `HostedCluster` (representing hosted control-plane information) and +`NodePool` (representing worker-node information) CRDs are namespaced on the management cluster. It is natural then, +for these namespaced resources to refer to a `ChangeManagementPolicy` in the same namespace. Add a `changeManagement` stanza to several existing resources in the OpenShift ecosystem which can reference the new `ChangeManagementPolicy` resource to restrict when and how their associated controllers can initiate @@ -403,14 +406,14 @@ kind: ChangeManagementPolicy metdata: # ChangeManagementPolicies are namespaced resources. They will normally reside in the # namespace associated with the controller initiating material changes. - # For example, in Standalone namespace/openshift-machine-config for the MCO and + # For example, in Standalone namespace/openshift-machine-config-operator for the MCO and # namespace/openshift-cluster-version for the CVO. # For HCP, the ChangeManagementPolicies for will reside in the same namespace as the # HostedCluster resource. # This namespace can be overridden in resources being constrained by a ChangeManagementPolicy # but RBAC for the resource's controller must permit reading the object from the non-default # namespace. - namespace: openshift-machine-config + namespace: openshift-machine-config-operator name: example-policy spec: @@ -508,7 +511,7 @@ spec: policy: # Namespace is optional. If not specified, the controller assumes the namespace in # which the controller is running. - namespace: openshift-machine-config + namespace: openshift-machine-config-operator # The name of the ChangeManagementPolicy. name: example-policy ``` @@ -1099,7 +1102,7 @@ Value: - `1`: Change management for this resource is subject to this enabled strategy. #### Change Management Bypass Annotation -In some situations, it may be necessary for a MachineConfig to be applied regardless of the active change +In some situations, it may be desirable for a MachineConfig to be applied regardless of the active change management policy for a MachineConfigPool. In such cases, `machineconfiguration.openshift.io/bypass-change-management` can be set to any non-empty string. The MCO will progress until MCPs which select annotated MachineConfigs have all machines running with a desiredConfig containing that MachineConfig's current state. @@ -1107,6 +1110,13 @@ MachineConfigs have all machines running with a desiredConfig containing that Ma This annotation will be present on `00-master` to ensure that, once the CVO updates the MachineConfig, the remainder of the control-plane update will be treated as a single material change. +Rolling out critical machine config changes for worker nodes also is made easier with this annotation. +Instead of, for example, trying to predict a `disabledUntil` date, an SRE/operations team can use this +annotation to specify their goal with precision. Compare this with `disabledUntil`, which +(a) an operations team would generally overestimate in order to ensure that updates complete and +(b) may cause subsequent, non-critical, machine configuration changes to cause further workload disruption +(node reboots) that are unwarranted. + ### Special Handling #### Change Management on Master MachineConfigPool