Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dogswatch: Update Coordinator (Kubernetes Operator and Friends) #184

Closed
jahkeup opened this issue Aug 22, 2019 · 12 comments
Closed

dogswatch: Update Coordinator (Kubernetes Operator and Friends) #184

jahkeup opened this issue Aug 22, 2019 · 12 comments
Labels
type/enhancement New feature or request

Comments

@jahkeup
Copy link
Member

jahkeup commented Aug 22, 2019

This is a digested design proposal for the collective "dogswatch" system - Thar update coordinators for workload Orchestrators.

Overview / Problem Space

There is potential upgrades to be disruptive in a general purpose cluster
running Thar that is configured to use phased upgrades. This is because the host
would not coordinate to remove its workload and, instead, directly take charge
of removing itself from a compute pool. It is worth noting that this problem
isn't unique to contemporary orchestrators but also to bespoke clustered
applications alike that may have its own mechanism for scaling itself up and
down.

It is possible to reduce, or outright prevent, impact by providing the on-host
update controls to the orchestrator. The exact architecture of this would depend
on the orchestrator itself; a k8s example architecture is outlined below.
Tools handle applying revisions in the way that their respective policies
indicate and would necessarily service as the bridge between the host - using an
interface provided by Thar - and the orchestrator - using the primitives
provided there. Collectively, the implementations are of the dogswatch and are
intended to resemble one another as they'll be standardized around an update
interface with a small surface area.

Kubernetes

For Kubernetes, this design will closely resemble that which was used for the
CoreOS Container Linux Update
Operator
taking hints and inspiration from WeaveWork's
Kured
amongst other similar projects.

The primitives provided by Kubernetes naturally lead to the approach - the
Operator pattern - pioneered by CoreOS, even more naturally given Kuberenetes'
investments made since then.

Thar's Update Operator has 2 components:

  • dogswatch-controller - cluster level coordinator
  • dogswatch-node - node level agent

dogswatch-controller

The dogswatch-controller process watches each node for state changes and applies
administrator supplied policy when considering the next steps to take when an
update is available and ready to be made on a node. The controller process is
dependent upon the node process to gather and communicate its needed update
metadata and state.

Kubernetes' annotation facilities are used to report state from the node (see
dogswatch-node) informing the controller of its state. The controller utilizes
these annotations to make a coordinated decision based on configured policy

Responsibilities:

  • watch nodes for annotation updates

dogswatch-node

The dogswatch-node process handles on-host signaling regarding availability
and application of updates. This process does not directly control its workload
nor does it act of its own accord to stop workloads that are running on the
host. This process' primary function is to communicate state and handle
well-known bidirectional indicators in response.

Policy

To start, a single policy will provided which is a configurable time-window to
execute updates. There will not be a parallelism control to begin with as the
logic should account for Pod's Replication with further configurables to tune
such consideration required by administrators.

Incremental Policies (in no specific order):

  • Update source/repository

  • Update wave/phasing

  • Rollback retries and failure tolerances

  • Rollback telemetry opt-in

Update Integration

It remains to be determined exactly what the mechanism and transport will be
to facilitate:

  • querying of update information
  • controls for update process (download, apply, etc.)
  • method dispatch

For contrast and consideration, the related projects have shown that the
components could:

  1. interface directly with system using commands like yum update -y && reboot
  2. interface indirectly with system using dbus and a daemon such as logind
  3. interface indirectly with another all-in-one update daemon with dbus or socket

There's not an obvious deficiency with 2 or 3, but 1 is infeasible and out of
line with Thar's core competencies and tenants of being minimal and opaque to
its workloads. Both remaining options, 2 and 3, imply that a dedicated well
defined and stable interface be exposed for dogswatch.

Methods (bare minimum):

  • query update information
  • commit update
  • reboot to update

Metadata:

  • boot data (partition details)

  • boot state (partition booted, update and rollback inferential data)

  • update data (update availability and details)

  • update state ("state machine" state in update flow)

There's another issue to talk about these details and their relevance, see #184

Follow up work and questions

Cluster (by policy) configuration of updates:

The methods outlined in Update Integration offer the bare
minimum of required functionality to power dogswatch. Ideally, this
integration would be extended and enriched to offer cluster administrators to
control policy at the cluster and node level by ensuring nodes are correctly
configured to participate in the cluster's update strategy. This requires
one of the follow in a future iteration:

  • Additional specialized API surface area

An additional adaptation of the on-host component to provide an appropriate set
of configurables to dogswatch-node for it to reconfigure update settings as
needed

  • Exposed Thar API endpoint and client implementation

A dogswatch-node process would be given access to the Thar API socket to be
able to configure and commit update settings consistent with the cluster's
settings.

@jahkeup jahkeup changed the title Update Coordinator (Kubernetes Operator and Friends) dogswatch: Update Coordinator (Kubernetes Operator and Friends) Aug 22, 2019
@jahkeup
Copy link
Member Author

jahkeup commented Aug 22, 2019

In the context of a Kubernetes Operator: I think we'll need to pair the use of node labels applied by the service startup in addition to our annotations to scope queries and "watches" (with informers and/or listers) to these nodes. I wasn't able to get a watcher to filter well with annotations - you can do it on the client side but labels are done on the server side - so pairing with labels allows the controller to provide a selector addressing systems reporting a designated label (like thar.amazonaws.com/dogswatch) which are then justifiably expected to report update annotations at some point in their lifetime.

@sam-aws
Copy link
Contributor

sam-aws commented Aug 22, 2019

Methods (bare minimum):

  • query update information
  • commit update
  • reboot to update

1 & 2 (possibly 3) are definitely in the current scope of Updog.
Assuming something like dogswatch-controller is handling the update of the update metadata (bottlerocket-os/bottlerocket-update-operator#17) then Updog should be pretty Coordinator-agnostic. We'd just have something like dogswatch-node call Updog directly, which we could extend with commands like "get update info", "get update wave" etc to validate versions, timing, etc.

To start, a single policy will provided which is a configurable time-window to
execute updates.

Is this something we would be covered via "wave" information in the update metadata? (eg. #103 / bottlerocket-os/bottlerocket-update-operator#17)

@jahkeup
Copy link
Member Author

jahkeup commented Aug 22, 2019

Assuming something like dogswatch-controller is handling the update of the update metadata (bottlerocket-os/bottlerocket-update-operator#17) then Updog should be pretty Coordinator-agnostic.

Yes! This is the intended scoping in this design - the dogswatch-node would, as you've noted, directly interface with the host side (be that updog or otherwise), and communicate host level information for dogswatch-controller. It remains to be seen exactly what "primitives" are needed here. I think the list here cover the majority of its uses and really can be satisfied by adding subcommands or any other reachable interface. To be clear, these processes will be running inside of containers that are placed on nodes by the orchestrator so what ever the interface is it will have to be reachable and usable in the context of the container.

Is this something we would be covered via "wave" information in the update metadata?

Yes and no, at least in the sense that I envision the policies here completely tunable by the cluster's administrators who will know their workload the best to specify a finer grain schedule. This may be a subset of the wave phase or a superset - pending discussion and incidental data structuring.

I think the notion of a maintenance time-window doesn't preclude the usefulness of the waves in that the waves may still limit the extent of the rollout. The administrator's policy may cause a cluster to be even more conservative or block until a greater number of its hosts are "eligible" for update, which may or may not be desirable! Maybe these admins DO want their clusters to make as much update progress as they can during their maintenance windows for business reasons, or perhaps even they want to update a single box first with a bake time to gate the cluster-wide acceptance of an update.

@jahkeup
Copy link
Member Author

jahkeup commented Aug 23, 2019

Interesting, in my initial survey I missed that the CLUO implementation relies on both labels and annotations: https://github.com/coreos/container-linux-update-operator/blob/4bb1486f482bc9c365c71e126129e806b5a0fc97/pkg/constants/constants.go#L13-L15 .

The major difference between the usage in CLUO (where these are used for pinpointing peer nodes) and proposed here is that the Controller (Operator) in our case here will be initiating updates rather than broadcast+observing intent from a Pod's label and annotations.

@jahkeup
Copy link
Member Author

jahkeup commented Aug 27, 2019

The DaemonSet used for dogswatch-node should be targeted to nodes running Thar, this would imply that the kubelet expose some details about the host as labels to enable targeted scheduling. This setup could prime the details exposed about the host that will be ensured by the dogswatch-node during its lifetime.

@jahkeup
Copy link
Member Author

jahkeup commented Aug 28, 2019

As part of the dogswatch-node's responsibilities, the node process will need to be able to affect the host to cause it to make progress with an update according to its dogswatch-controller overlord - the data suggested here is closely related to #103.

I propose that a utility, for the near term, be stood up that provides a stable interface for updates to be instrumented of a Thar host. The set of functionality consolidates as much of the linear update process as possible while still retaining granular control of the progress made for a given update. These steps would, over in the implementation, represent a complete and total update wherein migrations are run at the appropriate stages and updates applied to their respective partitions by their authoritative tools.


Interface

N.B. updatectl is a stand-in for the real tool, whatever dog it may be - though maybe this being a customer facing utility, maybe it isn't a pooch?

Query Update state progression

To inspect the current state of affairs, if --wait wasn't provided, the utility must be able to report its current activity and update application status:

updatectl status
{
    "status": "pending",
    "action": "boot-update",
    "id": "thar-aws-eks-x86_64-1.13.0-m1.20191006"
}

Query Update Information

updatectl list-available

This is roughly exactly what's being identified for use by the update system and is augmented with metadata pertaining to the host for updates.

{
    "schema": "1.0.0",
    "host": {
        "wave": "42"
    },
    "updates": [
        {
            "id": "thar-aws-eks-x86_64-1.13.0-m1.20191006",
            "applicable": false,

            "flavor": "thar-aws-eks",
            "arch": "x86_64",
            "version": "1.13.0",
            "status": "Ready",
            "max_version": "1.20.0",
            "waves": {
                "0": "2019-10-06T15:00:00Z",
                "500":"2019-10-07T15:00:00Z",
                "1024":"2019-10-08T15:00:00Z"
            },
            "images": {
                "boot": "stuff-boot-thar-aws-eks-1.13-m1.20191006.img",
                "root": "stuff-boot-thar-aws-eks-1.13-m1.20191006.img",
                "hash": "stuff-boot-thar-aws-eks-1.13-m1.20191006.img"
            }
        }
    ],
    "migrations": {
        "(0.1, 1.0)": ["migrate_1.0_foo"],
        "(1.0, 1.1)": ["migrate_1.1_foo", "migrate_1.1_bar"]
    },
    "datastore_versions": {
        "1.11.0": "0.1",
        "1.12.0": "1.0",
        "1.13.0": "1.1"
    }
}

There will need to be a unique token or composite-token (a la N-V-R - let's call it F-A-V-R for $flavor-$arch-$version-$release, pronounced favor) to address the update consistently without having to plumb the contents of the intended update throughout invocations. So, a mutating update command would require being provided an id from updatectl query and be able to pass an identifier to the following calls. The FAVR would remain valid as long as it is available in a repository - the scenario where the update is yanked is yet to be handled within this scheme.

Actions to be taken on a invalidated half-committed update is yet to be defined.

Prepare host for an update

updatectl prepare-update --id $favr [--wait]
{
    "status": "prepared",
    "id": "thar-aws-eks-x86_64-1.13.0-m1.20191006"
}

Apply update to host

updatectl apply-update --id $favr [--wait]
{
    "status": "applied",
    "id": "thar-aws-eks-x86_64-1.13.0-m1.20191006"
}

Use update on next boot (and reboot)

updatectl boot-update --id $favr [--wait] [--reboot]
{
    "status": "bootable",
    "id": "thar-aws-eks-x86_64-1.13.0-m1.20191006"
}

cc: @sam-aws @tjkirch

@sam-aws
Copy link
Contributor

sam-aws commented Aug 28, 2019

This all sounds about in line with my understanding, some thoughts:

Query Update state progression

Do we think the tool would be running on its own or only be called as required?

Query Update Information

The tool could just pass the metadata back directly I suppose, or maybe augment it as needed. Is there any extra information needed here that isn't mentioned in #103?

Use update on next boot (and reboot)

Not directly related but I suppose we should split out the update-image and update-boot-flags steps.

In general this all looks and smells like Updog - do we have scenarios that could influence whether we would need something else?

@jahkeup
Copy link
Member Author

jahkeup commented Aug 28, 2019

Do we think the tool would be running on its own or only be called as required?

It would primarily be run on demand, invoked by dogswatch-node in this case.

Is there any extra information needed here that isn't mentioned in #103?

I've added a few keys to encourage consistent usage and interpretation of the metadata:

In jq access paths, these are the keys added: .schema (for versioning), .host.wave, .updates[].id and .updates[].applicable.

I think these are easily derived by the callee but shouldn't be ad-hoc interpreted by the caller without making a contract on the construction of each value ahead of time (I've opted for opaque values that are provided to "encourage" their use, nothing will stop other implementations from calculating these I suppose 😄 ).

Not directly related but I suppose we should split out the update-image and update-boot-flags steps.

I didn't expand on it above, but I had it in mind that the apply-update step would write to disk and boot-update would write boot flags (and optionally reboot as well). Does that align with what you were saying here or did I miss it?

In general this all looks and smells like Updog

Awesome! We should enumerate the requirements to make sure the caller has the right environment to be running the listed commands. I can think of all kinds of terrible things we can do to make it work from within the Pod's container, knowing the list of accesses and permissions needed will help us identify the right approach.

do we have scenarios that could influence whether we would need something else?

I don't think there's others at the moment - I did call out that if there's potentially undefined behavior for an update that's removed* (in whatever way that it would no longer be available or possible to continue with an update on) from the repository while we're performing these operations.

The remaining aspects of integration, as they are currently known, are asks to control some settings for a cluster to enforce "policy" but that's all hand-wavy and not within the scope of the extension to updog outlined here.

@sam-aws
Copy link
Contributor

sam-aws commented Aug 28, 2019

I've added a few keys to encourage consistent usage and interpretation of the metadata:

In jq access paths, these are the keys added: .schema (for versioning), .host.wave, .updates[].id and .updates[].applicable.

I think these are easily derived by the callee but shouldn't be ad-hoc interpreted by the caller without making a contract on the construction of each value ahead of time (I've opted for opaque values that are provided to "encourage" their use, nothing will stop other implementations from calculating these I suppose smile ).

I agree that (bar schema) these are all things that should be determined by the callee/Updog. So we may have two types of metadata, the one read by Updog, and another form that it outputs to callers, even if it is just an augmented form.

Not directly related but I suppose we should split out the update-image and update-boot-flags steps.

I didn't expand on it above, but I had it in mind that the apply-update step would write to disk and boot-update would write boot flags (and optionally reboot as well). Does that align with what you were saying here or did I miss it?

Yep that's what I'm thinking of.

@tjkirch tjkirch added the type/enhancement New feature or request label Oct 2, 2019
@tjkirch tjkirch added this to the Public Preview milestone Oct 24, 2019
@tjkirch tjkirch closed this as completed Jan 22, 2020
@jahkeup
Copy link
Member Author

jahkeup commented Jan 22, 2020

It is worth noting that this issue isn't fully resolved, there are some aspects that are still in need of development and discussion. Some are captured here and should be reviewed during those discussions.

@tjkirch
Copy link
Contributor

tjkirch commented Jan 22, 2020

It is worth noting that this issue isn't fully resolved, there are some aspects that are still in need of development and discussion. Some are captured here and should be reviewed during those discussions.

@jahkeup Do you think we can split out issues for what remains? Dogswatch was a huge project and this is a huge issue, and plus, we accomplished what we needed for a milestone. I think it'd be easier to track now with smaller issues (or just one if you want) that we can assign to the next proper milestone.

@jahkeup
Copy link
Member Author

jahkeup commented Jan 22, 2020

Yes, there are definitely smaller issues to be had here, but really that depends largely on having a higher level discussion about the future of Dogswatch before we can get the "next steps" jotted down. I'll take some time to review the state of development & my thoughts on the path forward and work on putting together a coherent set of discussion points in a different issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants