pageserver: stuck detach operation #7427

problame · 2024-04-18T15:25:16Z

Stuck /location_config operation while transitioning from attached to secondary (i.e. Tenant::shutdown), clearly a bug in pageserver. We mitigated by restarting pageserver, but, we should debug this.

https://neondb.slack.com/archives/C03H1K0PGKH/p1713453288707669?thread_ts=1713444234.032949&cid=C03H1K0PGKH

This occurred during a /location_config call transitioning a pageserver from attached to secondary.

Root Cause Analysis

#7427 (comment)

tl;dr: no direct smoking gun, but, symptoms so far point to page_service keeping gate open for too long.

Fixing It

Refactor page_service so that gate guard lifetime is easy to understand and obviously bounded to the requests that are ongoing at time of timeline shutdown.

Sketch: #8286

Tasks

Give feedback

fix: noisy logging when download gets cancelled during shutdown #8224
pageserver_live_connections: track as counter pair #8227
fix dashboards that used pageserver_live_connections
postgres_backend: stop using async_trait #8296
refactor: postgres_backend: replace abstract shutdown_watcher with CancellationToken #8295
pageserver: move page_service's import basebackup / import wal to mgmt API #8292
pageserver: remove trace_read_requests #8338
page_service: stop exposing get_last_record_rlsn #8244
remove page_service show <tenant_id> #8372
refactor(pageserver) remove task_mgr for most global tasks #8449
refactor(page_service): Timeline gate guard holding + cancellation + shutdown #8339
rollout to staging
monitor cancellation behvior / number of live connections in staging
rollout to prod
fix: stop leaking BackgroundPurges #8650
fix: drain completed page_service connections #8632
larger scale testing of migrations to ensure it's truly fixed
neon-storage-controller: add chaosInterval setting helm-charts#96
Options

Follow-Ups

Give feedback

The text was updated successfully, but these errors were encountered:

problame · 2024-04-18T18:15:08Z

Happened again during that migration.

jcsp · 2024-04-18T18:34:39Z

Suspected regression from changes around shutdown recently: #7233

jcsp · 2024-05-02T11:43:36Z

Next steps: look carefully at timeline shutdown code.

jcsp · 2024-06-20T11:08:57Z

We have plenty of examples of kept the gate from closing messages in the field.

We see them in handle_get_page_at_lsn_request, and initial_size_calculation{...}:logical_size_calculation_task:get_or_maybe_download

Those are signals that something in those code paths is not promptly respecting a cancellation token. However, they may not be responsible for actual hangs, as that's a message printed when a gate guard eventually releases the gate, not if it holds it forever.

jcsp · 2024-06-28T15:44:55Z

Let's invest some time next week in looking for issues that might have been introduced in #7233 -- I think a couple of hours of quality time staring at the code might yield results.

This week I spent some time writing a test (https://github.com/neondatabase/neon/tree/jcsp/shutdown-under-load-test) to try and reproduce shutdown hangs, without any success: this isn't completely surprising, because in the field we see one hang out of every ~10,000 migrations, so it's clearly something pretty niche.

Generally counter pairs are preferred over gauges. In this case, I found myself asking what the typical rate of accepted page_service connections on a pageserver is, and I couldn't answer it with the gauge metric. There are a few dashboards using this metric: https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_live_connections&type=code I'll convert them to use the new metric once this PR reaches prod. refs #7427

Before this PR, during timeline shutdown, we'd occasionally see log lines like this one: ``` 2024-06-26T18:28:11.063402Z INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down Stack backtrace: 0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27 pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}} at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13 pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14 pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}} at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74 ``` We can eliminate the anyhow backtrace with no loss of information because the conversion to anyhow::Error happens in exactly one place. refs #7427

…ncellationToken Preliminary refactoring while working on #7427 and specifically #8286