Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix(tenant/timeline metrics): race condition during shutdown + recrea…
…tion (#7064) Tenant::shutdown or Timeline::shutdown completes and becomes externally observable before the corresponding Tenant/Timeline object is dropped. For example, after observing a Tenant::shutdown to complete, we could attach the same tenant_id again. The shut down Tenant object might still be around at the time of the attach. The race is then the following: - old object's metrics are still around - new object uses with_label_values - old object calls remove_label_values The outcome is that the new object will have the metric objects (they're an Arc internall) but the metrics won't be part of the internal registry and hence they'll be missing in `/metrics`. Later, when the new object gets shut down and tries to remove_label_value, it will observe an error because the metric was already removed by the old object. Changes ------- This PR moves metric removal to `shutdown()`. An alternative design would be to multi-version the metrics using a distinguishing label, or, to use a better metrics crate that allows removing metrics from the registry through the locally held metric handle instead of interacting with the (globally shared) registry. refs #7051
- Loading branch information
8224580
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2576 tests run: 2440 passed, 2 failed, 134 skipped (full report)
Failures on Postgres 15
test_ts_of_lsn_api
: debugFailures on Postgres 14
test_bulk_insert[neon-github-actions-selfhosted]
: releaseFlaky tests (1)
Postgres 14
test_compute_pageserver_connection_stress
: debugTest coverage report is not available
8224580 at 2024-03-11T15:58:49.282Z :recycle: