Skip to content

Commit

Permalink
fix(tenant/timeline metrics): race condition during shutdown + recrea…
Browse files Browse the repository at this point in the history
…tion (#7064)

Tenant::shutdown or Timeline::shutdown completes and becomes externally
observable before the corresponding Tenant/Timeline object is dropped.

For example, after observing a Tenant::shutdown to complete, we could
attach the same tenant_id again. The shut down Tenant object might still
be around at the time of the attach.

The race is then the following:
- old object's metrics are still around
- new object uses with_label_values
- old object calls remove_label_values

The outcome is that the new object will have the metric objects (they're
an Arc internall) but the metrics won't be part of the internal registry
and hence they'll be missing in `/metrics`.

Later, when the new object gets shut down and tries to
remove_label_value, it will observe an error because
the metric was already removed by the old object.

Changes
-------

This PR moves metric removal to `shutdown()`.

An alternative design would be to multi-version the metrics using a
distinguishing label, or, to use a better metrics crate that allows
removing metrics from the registry through the locally held metric
handle instead of interacting with the (globally shared) registry.

refs #7051
  • Loading branch information
problame authored Mar 11, 2024
1 parent 2b0f354 commit 8224580
Show file tree
Hide file tree
Showing 3 changed files with 5 additions and 8 deletions.
4 changes: 1 addition & 3 deletions pageserver/src/metrics.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2017,10 +2017,8 @@ impl TimelineMetrics {
pub(crate) fn resident_physical_size_get(&self) -> u64 {
self.resident_physical_size_gauge.get()
}
}

impl Drop for TimelineMetrics {
fn drop(&mut self) {
pub(crate) fn shutdown(&self) {
let tenant_id = &self.tenant_id;
let timeline_id = &self.timeline_id;
let shard_id = &self.shard_id;
Expand Down
7 changes: 2 additions & 5 deletions pageserver/src/tenant.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1846,6 +1846,8 @@ impl Tenant {
// Wait for any in-flight operations to complete
self.gate.close().await;

remove_tenant_metrics(&self.tenant_shard_id);

Ok(())
}

Expand Down Expand Up @@ -3557,11 +3559,6 @@ async fn run_initdb(
Ok(())
}

impl Drop for Tenant {
fn drop(&mut self) {
remove_tenant_metrics(&self.tenant_shard_id);
}
}
/// Dump contents of a layer file to stdout.
pub async fn dump_layerfile_from_path(
path: &Utf8Path,
Expand Down
2 changes: 2 additions & 0 deletions pageserver/src/tenant/timeline.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1257,6 +1257,8 @@ impl Timeline {

// Finally wait until any gate-holders are complete
self.gate.close().await;

self.metrics.shutdown();
}

pub(crate) fn set_state(&self, new_state: TimelineState) {
Expand Down

1 comment on commit 8224580

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2576 tests run: 2440 passed, 2 failed, 134 skipped (full report)


Failures on Postgres 15

  • test_ts_of_lsn_api: debug

Failures on Postgres 14

  • test_bulk_insert[neon-github-actions-selfhosted]: release
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_bulk_insert[neon-release-pg14-github-actions-selfhosted] or test_ts_of_lsn_api[debug-pg15]"
Flaky tests (1)

Postgres 14

  • test_compute_pageserver_connection_stress: debug

Test coverage report is not available

The comment gets automatically updated with the latest test results
8224580 at 2024-03-11T15:58:49.282Z :recycle:

Please sign in to comment.