diff --git a/docs/proposals/003-refresh-certs.md b/docs/proposals/003-refresh-certs.md new file mode 100644 index 00000000..02c0caf9 --- /dev/null +++ b/docs/proposals/003-refresh-certs.md @@ -0,0 +1,373 @@ + + +# Proposal information + + +- **Index**: 003 + + +- **Status**: **ACCEPTED** + + + +- **Name**: ClusterAPI Certificates Refresh + + +- **Owner**: Mateo Florido [@mateoflorido](https://github.com/mateoflorido) + + +# Proposal Details + +## Summary + + +The proposal aims to enhance Canonical Kubernetes Cluster API Providers by +enabling administrators to refresh or renew certificates on cluster nodes +without the need for a rolling upgrade. This feature is particularly beneficial +in resource-constrained environments, such as private or edge clouds, where +performing a full node replacement may not be feasible. + +## Rationale + + +Currently, Cluster API lacks a mechanism for refreshing certificates on cluster +nodes without triggering a full rolling update. For example, while the Kubeadm +provider offers the ability to renew certificates, it requires a rolling update +of the cluster nodes or manual intervention before the certificates expire. + +This proposal aims to address this gap by enabling certificate renewal on +cluster nodes without requiring a rolling update. By providing administrators +with the ability to refresh certificates independently of node upgrades, this +feature improves cluster operation, especially in environments with limited +resources, such as private or edge clouds. + +It will enhance the user experience by minimizing downtime, reducing the need +for additional resources, and simplifying certificate management. This is +particularly valuable for users who need to maintain continuous availability +or operate in environments where rolling updates are not practical due to +resource constraints. + + +## User facing changes + + +Administrators will be able to renew certificates on cluster nodes without +triggering a full rolling update. This can be achieved by annotating the Machine +object, which will initiate the certificate renewal process: + +``` +kubectl annotate machine v1beta2.k8sd.io/refresh-certificates={expires-in} +``` + +`expires-in` specifies how long the certificate will remain valid. It can be +expressed in years, months, days, additionally to other time units supported by +the `time.ParseDuration`. + +For tracking the validity of certificates, the Machine object will include a +`machine.cluster.x-k8s.io/certificates-expiry` annotation that indicates the +expiry date of the certificates. This annotation will be added when the cluster +is deployed and updated when certificates are renewed. The value of this +annotation will be a RFC3339 timestamp. + +## Alternative solutions + + +**Kubeadm Control Plane provider (KCP)** automates certificate rotations for +control plane machines by triggering a machine rollout when certificates are +close to expiration. + +### How to configure: +- In the KCP configuration, set the `rolloutBefore.certificatesExpiryDays` +field. This tells KCP when to trigger the rollout before certificates expire: + +```yaml +spec: + rolloutBefore: + certificatesExpiryDays: 21 # Trigger rollout when certificates expire within 21 days +``` + +### How it works: +- **Automatic Rollouts**: KCP monitors the certificate expiry dates of control +plane machines using the `Machine.Status.CertificatesExpiryDate`. If +certificates are about to expire (based on a configured threshold), KCP +triggers a machine rollout to refresh them. +- **Certificate Expiry Check**: The expiry date is sourced from the +`machine.cluster.x-k8s.io/certificates-expiry` annotation on the Machine or +Bootstrap Config object. + +For manual rotations, the administrator should run the `kubeadm certs renew` +command, ensure all control plane components are restarted, and remove the +expiry annotation for KCP to detect the updated certificate expiry date. + + +## Out of scope + + +This proposal does not cover the orchestration of certificate renewal for the +whole cluster. It focuses on renewing certificates on individual cluster nodes +via the Machine object. + +Rolling updates of the cluster nodes are out of scope. This proposal aims to +renew certificates without triggering a full rolling update of the cluster. + +External certificate authorities (CAs) are also out of scope. This proposal +focuses on renewing self-signed certificates generated by Canonical Kubernetes. + +# Implementation Details + +## API Changes + + +### `GET /k8sd/certificates-expiry` + +This endpoint will return the expiry date of the certificates on a specific +cluster node. The response will include the expiry date of the certificates +in RFC3339 format. The value will be sourced from the Kubernetes API server +certificate. + +```go +type CertificatesExpiryResponse struct { + // ExpiryDate is the expiry date of the certificates on the node. + ExpiryDate string `json:"expiry-date"` +} +``` + +### `POST /x/capi/request-certificates` + +This endpoint will create the necessary Certificate Signing Request (CSR) for +a worker node. The request will include the duration after which the +certificates will expire. + +```go +type RequestCertificatesRequest struct { + // ExpirationSeconds is the duration after which the certificates will expire. + ExpirationSeconds int `json:"expiration-seconds"` +} +``` + +### `POST /x/capi/refresh-certificates/plan` + +This endpoint returns the renewal plan for certificates on a specific node. The +response will include the seed used to generate the Certificate Signing Request +(CSR) and a list of CSRs that need to be approved (for worker nodes). + +This endpoint utilizes the same structures and endpoints as the +`POST /k8sd/refresh-certs/plan`. + +```go +type RefreshCertificatesPlanResponse struct { + // Seed should be passed by clients to the RefreshCertificatesRun RPC. + Seed int `json:"seed"` + // CertificateSigningRequests is a list of names of the CertificateSigningRequests that need to be signed externally (for worker nodes). + CertificateSigningRequests []string `json:"certificate-signing-requests"` +} +``` + +### `POST /x/capi/refresh-certificates/run` + +This endpoint will trigger the renewal of certificates on a specific node. +The request will include the duration after which the certificates will expire +and a list of additional Subject Alternative Names (SANs) to include in the +certificate. + +This endpoint is applicable to both control plane and worker nodes. For worker +nodes, the request will include the seed used to generate the CSR. This +endpoint uses the same structures and endpoints as the +`POST /k8sd/refresh-certs/run`. + +```go +type RefreshCertificatesRequest struct { + // Seed is the seed used to generate the CSR. + Seed string `json:"seed"` + // ExpirationSeconds is the duration after which the certificates will expire. + ExpirationSeconds int `json:"expiration-seconds"` + //ExtraSANs is a list of additional Subject Alternative Names to include in the certificate. + ExtraSANs []string `json:"extra-sans"` +} +``` + +### `POST /x/capi/approve-certificates` + +This endpoint will approve the renewal of certificates for a worker node and +will be run by a control plane node. The request will include the seed used to +generate the CSR. + +```go +type ApproveCertificatesRequest struct { + // Seed is the seed used to generate the CSR. + Seed string `json:"seed"` +} +``` + +## Bootstrap Provider Changes + + +A controller called `CertificatesController` will be added to the bootstrap +provider. This controller will watch for the `v1beta2.k8sd.io/refresh-certificates` +annotation on the Machine object and trigger the certificate renewal process +when the annotation is present. + +### Control Plane Nodes + +The controller would use the value of the +`v1beta2.k8sd.io/refresh-certificates`annotation to determine the duration +after which the certificates will expire. It will then call the +`POST /x/capi/refresh-certificates` endpoint to trigger the certificate +renewal process. + +The controller will share the status of the certificate renewal process by +adding events to the Machine object. The events will indicate the progress of +the renewal process following this pattern: + +- `RefreshCertsInProgress`: The certificate renewal process is in progress, the + event will include the `Refreshing certificates in progress` message. +- `RefreshCertsDone`: The certificate renewal process is complete, the event + will include the `Certificates have been refreshed` message. +- `RefreshCertsFailed`: The certificate renewal process has failed, the event + will include the `Certificates renewal failed: {reason}` message. + +After the certificate renewal process is complete, the controller will update +the `machine.cluster.x-k8s.io/certificates-expiry` annotation on the Machine +object with the new expiry date of the certificates. + +Finally, the controller will remove the `v1beta2.k8sd.io/refresh-certificates` +annotation from the Machine object to indicate that the certificate renewal +process is complete. + +### Worker Nodes + +The controller would use the value of the `k8sd.io/refresh-certificates` +annotation to determine the duration after which the certificates will expire. +It will then call the `POST /x/capi/request-certificates` endpoint to create +the Certificate Signing Request (CSR) for the worker node. + +Using the `k8sd` proxy, the controller can call the +`POST /x/capi/approve-certificates` endpoint with the seed generated in the +previous step to approve the CSRs for the worker node. + +The controller will share the status similar to the control plane nodes by +emitting events to the `Machine` object. The events will indicate the progress +of the renewal process following the same pattern as in the control plane +nodes. + +After the CSR approval process is complete, the worker node will call the +`POST /x/capi/refresh-certificates` endpoint to trigger the certificate renewal +process, using the seed generated to recover the certificates from the CSR +resources. + +After the certificate renewal process is complete, the controller will update +the `machine.cluster.x-k8s.io/certificates-expiry` annotation on the Machine +object with the new expiry date of the certificates. + +Finally, the controller will remove the `v1beta2.k8sd.io/refresh-certificates` +annotation +from the Machine object to indicate that the certificate renewal process is +complete. + +## ControlPlane Provider Changes + + +None + +## Configuration Changes + + +None + +## Documentation Changes + + +This implementation will require adding the following documentation: +- How-to guide for renewing certificates on cluster nodes +- Reference page of the `v1beta2.k8sd.io/refresh-certificates` annotation + +## Testing + + +Integration tests will be added to the current test suite. The tests will +create a cluster, annotate the Machine object with the +`v1beta2.k8sd.io/refresh-certificates` annotation, and verify that the +certificates are renewed in the target node. + +## Considerations for backwards compatibility + + +None + +## Implementation notes and guidelines + + +We can leverage the existing certificate renewal logic in the k8s-snap. +For worker nodes, we need to modify the exisiting code to avoid blocking +the request until the certificates have been approved and issued. Instead, +we can use a multiple step process. Generating the CSRs, approving them, and +then trigger the certificate renewal process. +