Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Make cert lifetime configurable #306

Merged
merged 35 commits into from
Oct 25, 2023
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5aed4de
feat: Make cert lifetime configurable
sbernauer Aug 25, 2023
9dce1e5
refactoring
sbernauer Aug 25, 2023
42c4f69
fix serde stuff
sbernauer Aug 25, 2023
94d3857
wording
sbernauer Aug 25, 2023
7f264aa
Merge remote-tracking branch 'origin/main' into feat/make-cert-lifeti…
sbernauer Sep 13, 2023
1cb479a
Add test
sbernauer Sep 13, 2023
f4fb679
Add docs
sbernauer Sep 13, 2023
eaa0c15
fix docs
sbernauer Sep 13, 2023
40f5b41
address arch meeting feedback
sbernauer Sep 13, 2023
4209194
docs
sbernauer Sep 13, 2023
7866101
Document flow
sbernauer Sep 13, 2023
30663e0
charts
sbernauer Sep 13, 2023
c1f4558
Use new stackable_operator::duration::Duration struct
sbernauer Sep 21, 2023
4d8c4d4
wording
sbernauer Sep 21, 2023
3a2e838
fix test
sbernauer Sep 21, 2023
db0a758
doc fixes
sbernauer Sep 21, 2023
b3d2fe9
Apply suggestions from code review
sbernauer Sep 27, 2023
3d95a5f
Update docs/modules/secret-operator/pages/secretclass.adoc
sbernauer Sep 27, 2023
64f4904
let it compile
sbernauer Sep 27, 2023
7e98482
Merge branch 'main' into feat/make-cert-lifetime-configurable
sbernauer Sep 27, 2023
7d55892
fix docs
sbernauer Sep 27, 2023
ffc7e17
fix docs, part 2
sbernauer Sep 27, 2023
bfe2526
bump: operator-rs 0.53.0
sbernauer Oct 9, 2023
0d7330c
Use time interop
Techassi Oct 13, 2023
839eea0
Apply suggestions from code review
sbernauer Oct 18, 2023
a422edd
Merge branch 'main' into feat/make-cert-lifetime-configurable
sbernauer Oct 18, 2023
3331e92
use operator-rs 0.55.0
sbernauer Oct 18, 2023
462fea7
fixup
sbernauer Oct 18, 2023
28b7b5a
newline
sbernauer Oct 18, 2023
82e8472
docs
sbernauer Oct 18, 2023
c9c39e6
newlines everywhere
sbernauer Oct 18, 2023
e846eed
link to concepts page
sbernauer Oct 18, 2023
ae39005
fix rust doc
sbernauer Oct 18, 2023
1fcfda3
docs: Point to concepts guide
sbernauer Oct 18, 2023
c54cfba
docs: Improve wording
sbernauer Oct 18, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
772 changes: 371 additions & 401 deletions Cargo.lock

Large diffs are not rendered by default.

8 changes: 5 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,10 @@ prost-types = "0.11"
rand = "0.8"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
serde_yaml = "0.9"
snafu = "0.7"
socket2 = { version = "0.5", features = ["all"] }
stackable-operator = { git = "https://github.com/stackabletech/operator-rs.git", tag = "0.46.0" }
stackable-operator = { git = "https://github.com/stackabletech/operator-rs.git", tag = "0.53.0", features = ["time"]}
strum = { version = "0.25", features = ["derive"] }
sys-mount = { version = "2.1", default-features = false }
tempfile = "3.3"
Expand All @@ -51,5 +52,6 @@ yasna = "0.5"
# Workaround for https://github.com/hyperium/tonic/issues/243
h2 = { git = "https://github.com/stackabletech/h2.git", branch = "feature/grpc-uds" }

# [patch."https://github.com/stackabletech/operator-rs.git"]
# stackable-operator = { git = "https://github.com/stackabletech//operator-rs.git", branch = "main" }
# TODO (Techassi): Comment patch out again
[patch."https://github.com/stackabletech/operator-rs.git"]
stackable-operator = { git = "https://github.com/stackabletech//operator-rs.git", branch = "feat/duration-improvs"}
4 changes: 4 additions & 0 deletions deploy/helm/secret-operator/crds/crds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,10 @@ spec:
required:
- secret
type: object
maxCertificateLifetime:
default: 15d
description: Maximum lifetime the created certificates are allowed to have. In case consumers request a longer lifetime than allowed by this setting, the lifetime will be the minimum of both, so this setting takes precedence.
type: string
required:
- ca
type: object
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ spec:
name: secret-provisioner-tls-ca
namespace: default
autoGenerate: true
maxCertificateLifetime: 15d
23 changes: 23 additions & 0 deletions docs/modules/secret-operator/pages/secretclass.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,27 @@ CAUTION: Attributes of the certificate (such as the expiration date, fingerprint

xref:scope.adoc[Scopes] are used to populate the claims (such as `subjectAlternateName`) of the provisioned certificates.

==== Certificate lifetime
We generally aim to use as short-lived certificates as possible.
sbernauer marked this conversation as resolved.
Show resolved Hide resolved
The problem is, that short lifetime periods imply frequent restart of services, which may be totally unnoticeable for some products,
sbernauer marked this conversation as resolved.
Show resolved Hide resolved
annoying for others or devastation for products with a single point of failure and no recovery from outages (e.g. Trino coordinator).
sbernauer marked this conversation as resolved.
Show resolved Hide resolved

We have spent some time thinking about this and did reach the following compromise:
sbernauer marked this conversation as resolved.
Show resolved Hide resolved

1. To enforce security constraints, cluster administrators can set a maximum allowed certificate lifetime
on the SecretClass issuing the certificates (defaults to 15d).
sbernauer marked this conversation as resolved.
Show resolved Hide resolved
2. We default to a certificate lifetime of 24h.
sbernauer marked this conversation as resolved.
Show resolved Hide resolved
3. Pods consuming the certificate can request a longer lifetime or shutdown expiration buffer via annotations
on the xref:volume.adoc[Volume]. If they request a longer lifetime than the SecretClass allows,
it will be shortened to the maximum allowed lifetime.
4. The Pods will be evicted 6h before the certificate expires, to ensure that no Pods are left running with expired secrets. Consumers can override this buffer using Volume annotations.
sbernauer marked this conversation as resolved.
Show resolved Hide resolved
This buffer must be long enough that the product is guaranteed to gracefully shut down.

Most of our product operators will not add any annotation to the Volume, so the (short) default certificate lifetime applies.
Techassi marked this conversation as resolved.
Show resolved Hide resolved
In case an operator sets a higher lifetime, a tracking issue must be created to document and track the steps to reduce the certificate lifetime.

Users can use podOverrides to extend the certificate lifetime if really needed by adding volume annotations (we might add this to product CRDs in the future).
sbernauer marked this conversation as resolved.
Show resolved Hide resolved

==== Reference

[source,yaml]
Expand All @@ -48,13 +69,15 @@ spec:
name: secret-provisioner-tls-ca
namespace: default
autoGenerate: true
maxCertificateLifetime: 15d # optional
----

`autoTls`:: Declares that the `autoTls` backend is used.
`autoTls.ca`:: Configures the certificate authority used to issue `Pod` certificates.
`autoTls.ca.secret`:: Reference (`name` and `namespace`) to a K8s `Secret` object where the CA certificate and key is stored in the keys `ca.crt`
and `ca.key` respectively.
`autoTls.ca.autoGenerate`:: Whether the certificate authority should be provisioned if it can not be found.
`autoTls.maxCertificateLifetime`:: Maximum lifetime the created certificates are allowed to have. In case consumers request a longer lifetime than allowed by this setting, the lifetime will be the minimum of both.

[#backend-kerberoskeytab]
=== `kerberosKeytab`
Expand Down
28 changes: 28 additions & 0 deletions docs/modules/secret-operator/pages/volume.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,34 @@

This can be either the default output format of the xref:secretclass.adoc#backend[backend], or a format that it defines a conversion into.

=== `secrets.stackable.tech/backend.autotls.cert.lifetime`

*Required*: false

*Default value*: `1d`

*Backends*: xref:secretclass.adoc#backend-autotls[]

The lifetime of the created certificate.
Please note that you can not request a lifetime longer than allowed by `maxCertificateLifetime` on the SecretClass.
If you do so, `maxCertificateLifetime` will be used as the certificate lifetime, to adhere to the requirements
set by the admin that created the SecretClass.
sbernauer marked this conversation as resolved.
Show resolved Hide resolved
The format is `1d2h3m4s` (components are optional, `1m` is also a valid value).

Check notice on line 64 in docs/modules/secret-operator/pages/volume.adoc

View workflow job for this annotation

GitHub Actions / LanguageTool

[LanguageTool] docs/modules/secret-operator/pages/volume.adoc#L64

Insert a space between the numerical value and the unit symbol. (UNIT_SPACE[1]) Suggestions: `1 m` Rule: https://community.languagetool.org/rule/show/UNIT_SPACE?lang=en-US&subId=1 Category: PUNCTUATION
Raw output
docs/modules/secret-operator/pages/volume.adoc:64:52: Insert a space between the numerical value and the unit symbol. (UNIT_SPACE[1])
 Suggestions: `1 m`
 Rule: https://community.languagetool.org/rule/show/UNIT_SPACE?lang=en-US&subId=1
 Category: PUNCTUATION
Techassi marked this conversation as resolved.
Show resolved Hide resolved

=== `secrets.stackable.tech/backend.autotls.cert.restart-buffer`

*Required*: false

*Default value*: `6h`

*Backends*: xref:secretclass.adoc#backend-autotls[]

The amount of time the Pod using the cert gets restarted before the cert expires.
Keep in mind that there can be multiple Pods - such as 80 datanodes - trying to
shut down at the same time. It can take some hours until all Pods are restarted
in a rolling fashion.
The format is `1d2h3m4s` (components are optional, `1m` is also a valid value).

Check notice on line 78 in docs/modules/secret-operator/pages/volume.adoc

View workflow job for this annotation

GitHub Actions / LanguageTool

[LanguageTool] docs/modules/secret-operator/pages/volume.adoc#L78

Insert a space between the numerical value and the unit symbol. (UNIT_SPACE[1]) Suggestions: `1 m` Rule: https://community.languagetool.org/rule/show/UNIT_SPACE?lang=en-US&subId=1 Category: PUNCTUATION
Raw output
docs/modules/secret-operator/pages/volume.adoc:78:52: Insert a space between the numerical value and the unit symbol. (UNIT_SPACE[1])
 Suggestions: `1 m`
 Rule: https://community.languagetool.org/rule/show/UNIT_SPACE?lang=en-US&subId=1
 Category: PUNCTUATION

=== `secrets.stackable.tech/kerberos.service.names`

*Required*: false
Expand Down
2 changes: 1 addition & 1 deletion rust/operator-binary/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ uuid.workspace = true
yasna.workspace = true

[dev-dependencies]
time.workspace = true
serde_yaml.workspace = true

[build-dependencies]
built.workspace = true
Expand Down
10 changes: 8 additions & 2 deletions rust/operator-binary/src/backend/dynamic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -102,9 +102,15 @@ pub async fn from_class(
secret,
auto_generate,
},
max_certificate_lifetime,
}) => from(
super::TlsGenerate::get_or_create_k8s_certificate(client, &secret, auto_generate)
.await?,
super::TlsGenerate::get_or_create_k8s_certificate(
client,
&secret,
auto_generate,
max_certificate_lifetime,
)
.await?,
),
crd::SecretClassBackend::KerberosKeytab(crd::KerberosKeytabBackend {
realm_name,
Expand Down
38 changes: 35 additions & 3 deletions rust/operator-binary/src/backend/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,10 @@ pub mod tls;

use async_trait::async_trait;
use serde::{Deserialize, Deserializer};
use stackable_operator::k8s_openapi::chrono::{DateTime, FixedOffset};
use stackable_operator::{
k8s_openapi::chrono::{DateTime, FixedOffset},
time::Duration,
};
use std::{collections::HashSet, convert::Infallible};

pub use dynamic::Dynamic;
Expand All @@ -22,6 +25,8 @@ use scope::SecretScope;

use crate::format::{SecretData, SecretFormat};

use self::tls::{DEFAULT_CERT_LIFETIME, DEFAULT_CERT_RESTART_BUFFER};

/// Configuration provided by the `Volume` selecting what secret data should be provided
///
/// Fields beginning with `csi.storage.k8s.io/` are provided by the Kubelet
Expand Down Expand Up @@ -68,8 +73,8 @@ pub struct SecretVolumeSelector {
/// The Kerberos service names (`SERVICE_NAME/hostname@realm`)
#[serde(
rename = "secrets.stackable.tech/kerberos.service.names",
default = "SecretVolumeSelector::default_kerberos_service_names",
deserialize_with = "SecretVolumeSelector::deserialize_str_vec"
deserialize_with = "SecretVolumeSelector::deserialize_str_vec",
default = "SecretVolumeSelector::default_kerberos_service_names"
)]
pub kerberos_service_names: Vec<String>,

Expand All @@ -83,6 +88,33 @@ pub struct SecretVolumeSelector {
default
)]
pub compat_tls_pkcs12_password: Option<String>,

/// The TLS cert lifetime (`24h`, `1d` or `7d`).
#[serde(
rename = "secrets.stackable.tech/backend.autotls.cert.lifetime",
default = "default_cert_lifetime"
)]
pub autotls_cert_lifetime: Duration,

/// The amount of time the Pod using the cert gets restarted before the cert expires.
/// Keep in mind that there can be multiple Pods - such as 80 datanodes - trying to
/// shut down at the same time. It can take some hours until all Pods are restarted
/// in a rolling fashion.
///
/// The format is `1d2h3m4s` (components are optional, `1m` is also a valid value).
Techassi marked this conversation as resolved.
Show resolved Hide resolved
#[serde(
rename = "secrets.stackable.tech/backend.autotls.cert.restart-buffer",
default = "default_cert_restart_buffer"
)]
pub autotls_cert_restart_buffer: Duration,
}

fn default_cert_restart_buffer() -> Duration {
DEFAULT_CERT_RESTART_BUFFER
}

fn default_cert_lifetime() -> Duration {
DEFAULT_CERT_LIFETIME
}

impl SecretVolumeSelector {
Expand Down
66 changes: 57 additions & 9 deletions rust/operator-binary/src/backend/tls.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
//! Dynamically provisions TLS certificates

use std::cmp::min;

use async_trait::async_trait;
use openssl::{
asn1::{Asn1Integer, Asn1Time},
Expand All @@ -26,8 +28,9 @@ use stackable_operator::{
ByteString,
},
kube::runtime::reflector::ObjectRef,
time::Duration,
};
use time::{Duration, OffsetDateTime};
use time::OffsetDateTime;

use crate::format::{well_known, SecretData, WellKnownSecretData};

Expand All @@ -36,6 +39,22 @@ use super::{
SecretBackend, SecretBackendError, SecretContents,
};

/// As the Pods will be evicted [`DEFAULT_CERT_RESTART_BUFFER`] before
/// the cert actually expires, this results in a restart in approx every 2 weeks,
/// which matches the rolling re-deploy of k8s nodes of e.g.:
/// * 1 week for IONOS
/// * 2 weeks for some on-prem k8s clusters
pub const DEFAULT_MAX_CERT_LIFETIME: Duration = Duration::from_days_unchecked(15);

/// Default lifetime of certs when no annotations are set on the Volume.
pub const DEFAULT_CERT_LIFETIME: Duration = Duration::from_hours_unchecked(24);

/// When a StatefulSet has many Pods (e.g. 80 HDFS datanodes or Trino workers) a rolling
/// redeployment can take multiple hours. When the certificates of all datanodes expire approximately
/// at the same time and PodDisruptionBudgets are in place, Pods can need to this time to properly shut down,
Techassi marked this conversation as resolved.
Show resolved Hide resolved
/// so we need to evict them enough time in advance
pub const DEFAULT_CERT_RESTART_BUFFER: Duration = Duration::from_hours_unchecked(6);

#[derive(Debug, Snafu)]
pub enum Error {
#[snafu(display("failed to generate certificate key"))]
Expand Down Expand Up @@ -71,6 +90,11 @@ pub enum Error {
},
#[snafu(display("invalid certificate lifetime"))]
InvalidCertLifetime { source: DateTimeOutOfBoundsError },
#[snafu(display("certificate expiring at {expires_at} would schedule the pod to be restarted at {restart_at}, which is in the past (and we don't have a time machine (yet))"))]
TooShortCertLifetimeRequiresTimeTravel {
expires_at: OffsetDateTime,
restart_at: OffsetDateTime,
},
}
type Result<T, E = Error> = std::result::Result<T, E>;

Expand All @@ -92,17 +116,19 @@ impl SecretBackendError for Error {
Error::SerializeCertificate { .. } => tonic::Code::FailedPrecondition,
Error::SaveCaCertificate { .. } => tonic::Code::Unavailable,
Error::InvalidCertLifetime { .. } => tonic::Code::Internal,
Error::TooShortCertLifetimeRequiresTimeTravel { .. } => tonic::Code::InvalidArgument,
}
}
}

pub struct TlsGenerate {
ca_cert: X509,
ca_key: PKey<Private>,
max_cert_lifetime: Duration,
}

impl TlsGenerate {
pub fn new_self_signed() -> Result<Self> {
pub fn new_self_signed(max_cert_lifetime: Duration) -> Result<Self> {
let subject_name = X509NameBuilder::new()
.and_then(|mut name| {
name.append_entry_by_nid(Nid::COMMONNAME, "secret-operator self-signed")?;
Expand All @@ -111,8 +137,8 @@ impl TlsGenerate {
.context(BuildCertificateSnafu { tpe: CertType::Ca })?
.build();
let now = OffsetDateTime::now_utc();
let not_before = now - Duration::minutes(5);
let not_after = now + Duration::days(2 * 365);
let not_before = now - Duration::from_minutes_unchecked(5);
let not_after = now + Duration::from_days_unchecked(2 * 365);
let conf = Conf::new(ConfMethod::default()).unwrap();
let ca_key = Rsa::generate(2048)
.and_then(PKey::try_from)
Expand Down Expand Up @@ -153,7 +179,11 @@ impl TlsGenerate {
})
.context(BuildCertificateSnafu { tpe: CertType::Ca })?
.build();
Ok(Self { ca_key, ca_cert })
Ok(Self {
ca_key,
ca_cert,
max_cert_lifetime,
})
}

/// Check if a signing CA has already been instantiated in a specified Kubernetes secret - if
Expand All @@ -166,6 +196,7 @@ impl TlsGenerate {
client: &stackable_operator::client::Client,
secret_ref: &SecretReference,
auto_generate_if_missing: bool,
max_cert_lifetime: Duration,
) -> Result<Self> {
let (k8s_secret_name, k8s_ns) = match secret_ref {
SecretReference {
Expand Down Expand Up @@ -193,11 +224,12 @@ impl TlsGenerate {
&ca_data.get("ca.crt").context(MissingCaCertificateSnafu)?.0,
)
.context(LoadCertificateSnafu { tpe: CertType::Ca })?,
max_cert_lifetime,
}
}
Err(_) if auto_generate_if_missing => {
// Failed to get existing cert, try to create a new self-signed one
let ca = Self::new_self_signed()?;
let ca = Self::new_self_signed(max_cert_lifetime)?;
// Use create rather than apply so that we crash and retry on conflicts (to avoid creating spurious certs that we throw away immediately)
client
.create(&Secret {
Expand Down Expand Up @@ -251,9 +283,25 @@ impl SecretBackend for TlsGenerate {
pod_info: PodInfo,
) -> Result<SecretContents, Self::Error> {
let now = OffsetDateTime::now_utc();
let not_before = now - Duration::minutes(5);
let not_after = now + Duration::days(1);
let expire_pod_after = not_after - Duration::minutes(30);
let not_before = now - Duration::from_minutes_unchecked(5);

// Extract and convert consumer input from the Volume annotations.
let cert_lifetime = selector.autotls_cert_lifetime;
let cert_restart_buffer = selector.autotls_cert_restart_buffer;

// We need to check that the cert lifetime it is not longer than allowed,
// by capping it to the maximum configured at the SecretClass.
let cert_lifetime = min(cert_lifetime, self.max_cert_lifetime);
let not_after = now + cert_lifetime;
let expire_pod_after = not_after - cert_restart_buffer;
if expire_pod_after <= now {
TooShortCertLifetimeRequiresTimeTravelSnafu {
expires_at: not_after,
restart_at: expire_pod_after,
}
.fail()?;
}

let conf = Conf::new(ConfMethod::default()).unwrap();
let pod_key = Rsa::generate(2048)
.and_then(PKey::try_from)
Expand Down
Loading