Skip to content

Commit

Permalink
Incorporate review feedback
Browse files Browse the repository at this point in the history
- All times are now UTC.
- Clarity on not defining system state -- only initiation.
- Change from "endTime" to "duration" to easily create windows that span
  days.
  • Loading branch information
Justin Pierce committed Mar 8, 2024
1 parent bdb8ffd commit 4415b72
Showing 1 changed file with 30 additions and 18 deletions.
48 changes: 30 additions & 18 deletions enhancements/update/change-management-and-maintenance-schedules.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,12 @@ material changes are completed by the close of a permitted change window (e.g. a
may still be draining or rebooting) at the close of a maintenance schedule,
but it does prevent _additional_ material changes from being initiated.

Change management enforcement _does not_ attempt to define or control the detailed state of the
system. It only pertains to whether controllers which support change management
will attempt to initiate material change themselves. For example, if changes are paused in the middle
of a cluster update and a node is manually rebooted, change management does not define
whether the node will rejoin the cluster with the new or old version.

A "material change" may vary by cluster profile and subsystem. For example, a
control-plane update (all components and control-plane nodes updated) is implemented as
a single material change (e.g. the close of a scheduled permissive window
Expand Down Expand Up @@ -186,7 +192,8 @@ configuring platform resources as the top-level Maintenance Schedule control wil
that potentially disruptive changes are limited to well known time windows.

#### Reducing Service Delivery Operational Tooling
Service Delivery, as part of our OpenShift Dedicated, ROSA and other offerings is keenly aware of
Service Delivery, operating Red Hat's Managed OpenShift offerings (OpenShift Dedicated (OSD),
Red Hat OpenShift on AWS (ROSA) and Azure Red Hat OpenShift (ARO) ) is keenly aware of
the issues motivating the Change Management / Maintenance Schedule concept. This is evidenced by their design
and implementation of tooling to fill the gaps in the platform the preceding sections
suggest exist.
Expand All @@ -198,7 +205,7 @@ there are reasons to supersede the customer's preference).

By acknowledging the need for scheduled maintenance in the platform, we reduce the need for Service
Delivery to develop and maintain custom tooling to manage the platform while
simultaneously reducing simplifying management for all customer facing similar challenges.
simultaneously simplifying management for all customer facing similar challenges.

### User Stories
For readability, "cluster lifecycle administrator" is used repeatedly in the user stories. This
Expand Down Expand Up @@ -524,7 +531,7 @@ perspective, this strategy reports as paused indefinitely.
1. The MCO, irrespective of pending changes, exposes a metric for each MCP to indicate the number of seconds remaining until it is
permitted to initiate changes to nodes in that MCP.
1. A privileged user on the cluster notices different options available for `changeManagement` in the ClusterVersion and MachineConfigPool
resources. They try to set them but are prevented by either RBAC or an admission webhook (details for Service Delivery). If they wish
resources. They try to set them but are prevented by a validating admission controller. If they wish
to change the settings, they must update them through OCM.
1. The privileged user does an `oc describe ...` on the resources. They can see that material changes are pending in ClusterVersion for
the control-plane and for worker machine config. They can also see the date and time that the next material change will be permitted.
Expand Down Expand Up @@ -592,7 +599,7 @@ perspective, this strategy reports as paused indefinitely.
1. SRE can address the issue with a system configuration file applied in a MachineConfig.
1. SRE creates the MachineConfig for the customer and provides the customer the option to either (a) wait until their
configured maintenance schedule permits the material change from being initiated by the MachineConfigOperator
or (b) having SRE override the maintenance schedule and permitting its immediate application.
or (b) modify change management to permit immediate application (e.g. setting `disabledUntil`).
1. The problem is not pervasive, so the customer chooses the deferred remediation.
1. The change is initiated and nodes are rebooted during the next permissive window.

Expand Down Expand Up @@ -742,27 +749,29 @@ spec:
# Specifies a reoccurring permissive window.
permit:
# RRULEs (https://www.rfc-editor.org/rfc/rfc5545#section-3.3.10) are commonly used
# for calendar management metadata. Only a subset of the RFC is supported. If
# unset, all dates are permitted and only exclude constrains permissive windows.
# for calendar management metadata. Only a subset of the RFC is supported.
# See "RRULE Constraints" section for details.
# If unset, all dates are permitted and only exclude constrains permissive windows.
recurrence: <rrule|null>
# Given the identification of a date by an RRULE, at what time (relative to timezoneOffset) can the
# Given the identification of a date by an RRULE, at what time (always UTC) can the
# permissive window begin. "00:00" if unset.
startTime: <time-of-day|null>
# Given the identification of a date by an RRULE, at what time (relative to timezoneOffset) should the
# permissive window end. "23:59:59" if unset.
endTime: <time-of-day|null>
# Given the identification of a date by an RRULE, after what offset from the startTime should
# the permissive window close. This can create permissive windows within days that are not
# identified in the RRULE. For example, recurrence="FREQ=Weekly;BYDAY=Sa;",
# startTime="20:00", duration="8h" would permit material change initiation starting
# each Saturday at 8pm and continuing through Sunday 4am (all times are UTC). The default
# duration is 24:00-startTime (i.e. to the end of the day).
duration: <duration|null>
# Excluded date ranges override RRULE selections.
exclude:
# Dates should be specified in YYYY-MM-DD. Each date is excluded from 00:00<timezoneOffset> for 24 hours.
# Dates should be specified in YYYY-MM-DD. Each date is excluded from 00:00 UTC for 24 hours.
- fromDate: <date>
# Non-inclusive until. If null, until defaults to the day after from (meaning a single day exclusion).
untilDate: <date|null>
# Specifies an RFC3339 style timezone offset to be applied across their datetime selections.
# "-07:00" indicates negative 7 hour offset from UTC. "+03:00" indicates positive 3 hour offset. If not set, defaults to "+00:00" (UTC).
timezoneOffset: <null|str>
```

Permitted times (i.e. times at which the strategy enforcement state can be permissive) are specified using a
Expand All @@ -776,10 +785,10 @@ RRULE supports expressions that suggest recurrence without implying an exact dat
- `RRULE:FREQ=YEARLY` - An event that occurs once a year on a specific date.
- `RRULE:FREQ=WEEKLY;INTERVAL=2` - An event that occurs every two weeks.

All such expressions shall be evaluated with a starting date of Jan 1st, 1970 00:00<timezoneOffset>. In other
All such expressions shall be evaluated with a starting date of Jan 1st, 1970 00:00Z. In other
words, `RRULE:FREQ=YEARLY` would be considered permissive, for one day, at the start of each new year.

If no `startTime` or `endTime` is specified, any day selected by the RRULE will suggest a
If no `startTime` or `duration` is specified, any day selected by the RRULE will suggest a
permissive 24h window unless a date is in the `exclude` ranges.

**RRULE Constraints**
Expand Down Expand Up @@ -854,6 +863,7 @@ Labels:
- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool
- object=<object-name>
- system=<control-plane|worker-nodes>

Value:
- `0`: no material changes are pending.
- `1`: changes are pending but being initiated.
Expand All @@ -865,6 +875,7 @@ Labels:
- kind=ClusterVersion|MachineConfigPool|HostedCluster|NodePool
- object=<object-name>
- system=<control-plane|worker-nodes>

Value:
- `-2`: Error determining the time at which changes can be initiated (e.g. cannot check with ClusterVersion / change management hierarchy).
- `-1`: Material changes are paused indefinitely OR no permissive window can be found within the next 1000 days (the latter ensures a brute force check of intersecting datetimes with hierarchy RRULEs is a valid method of calculating intersection).
Expand All @@ -877,14 +888,15 @@ Labels:
- object=<object-name>
- system=<control-plane|worker-nodes>
- strategy=MaintenanceSchedule|Manual|Assisted

Value:
- `0`: Change management for this resource is not subject to this enabled strategy (**does** consider hierarchy based disable).
- `1`: Change management for this resource is directly subject to this enabled strategy.
- `2`: Change management for this resource is indirectly subject to this enabled strategy (i.e. only via control-plane override hierarchy).
- `3`: Change management for this resource is directly and indirectly subject to this enabled strategy.

#### Change Management Status
Each resource which exposes a `.spec.changeManagement` stanza should also expose `.status.changeManagement` .
Each resource which exposes a `.spec.changeManagement` stanza must also expose `.status.changeManagement` .

```yaml
status:
Expand Down

0 comments on commit 4415b72

Please sign in to comment.