Code RFC: decouple page_service from Mgr/Tenant/Timeline lifecycle #8286

problame · 2024-07-05T15:35:23Z

Context for this is

pageserver: stuck detach operation #7427

After lengthy investigation, I have no definitive smoking gun, but the hottest lead based on log message correlation is that HandlerTimeline holds the Timeline's GateGuard open.

More details here (internal doc): https://www.notion.so/neondatabase/2024-07-01-stuck-detach-investigation-f0d4ae0247b347ab9bff95355fce1b25?pvs=4#c93179d978bc41d1a47f4b3821b12b81

Also, while reading the page_service code, I found other minor issues with page_service design.

Therefore, I think we should invest some time to decouple the page_service from the Mgr/Tenant/Timeline objects' lifecycles.

The sketch in this PR illustrates how that would look like.

Discuss concrete questions about the code using the PR review feature.
Discuss the bid idea on Slack: https://neondb.slack.com/archives/C033RQ5SPDH/p1720195874156919

github-actions · 2024-07-05T16:21:09Z

3042 tests run: 2926 passed, 1 failed, 115 skipped (full report)

Failures on Postgres 15

test_lsn_lease_size[False]: debug

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_lsn_lease_size[debug-pg15-False]"

Flaky tests (7)

Postgres 16

test_delete_timeline_client_hangup: debug

Postgres 15

test_ondemand_wal_download_in_replication_slot_funcs: debug
test_tenant_creation_fails: debug

Postgres 14

test_statvfs_pressure_usage: debug
test_subscriber_restart: release
test_tenant_creation_fails: debug
test_ancestor_detach_branched_from[False-False-earlier]: debug

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
bc534e8 at 2024-07-05T16:21:08.576Z :recycle:}

…ncellationToken Preliminary refactoring while working on #7427 and specifically #8286

koivunej

Wouldn't it be simpler to:

keep caching logic
enter cached timeline's gate for the duration of the request
on timeout (smaller than read timeout) do cache cleanup

--

After lengthy investigation, I have no definitive smoking gun, but the hottest lead based on log message correlation is that HandlerTimeline holds the Timeline's GateGuard open.

But don't we have logic right now which would detect these situations, and if this was the case, meaning that the current logic does not work.

koivunej · 2024-07-08T07:30:38Z

pageserver/examples/sketch_decoupled_pageservice.rs

+                    let pre = handlers.len();
+                    handlers.retain(|handler| !Arc::ptr_eq(handler, &cached));
+                    let post = handlers.len();
+                    info!("Removed {} handlers", pre - post);


this would be 0 or 1 and be done N times, better to collect all upgradeable handlers and retain once.

koivunej · 2024-07-08T07:33:26Z

pageserver/examples/sketch_decoupled_pageservice.rs

+struct HandlerTimeline {
+    _gate_guard: utils::sync::gate::GateGuard,
+    timeline: Arc<Timeline>,
+}


This would be a Arc<(Arc<_>, Arc<_>)> which certainly sounds like something was wrong.

…ncellationToken (#8295) Preliminary refactoring while working on #7427 and specifically #8286

`trace_read_requests` is a per `Tenant`-object option. But the `handle_pagerequests` loop doesn't know which `Tenant` object (i.e., which shard) the request is for. The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](#7427 (comment)). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](#8286) for the high level idea. Note that we can always bring the tracing functionality if we need it. But since it's actually about logging the `page_service` wire bytes, it should be a `page_service`-level config option, not per-Tenant. And for enabling tracing on a single connection, we can implement a `set pageserver_trace_connection;` option.

The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](#7427 (comment)). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](#8286) for the high level idea.

…ncellationToken (#8295) Preliminary refactoring while working on #7427 and specifically #8286

`trace_read_requests` is a per `Tenant`-object option. But the `handle_pagerequests` loop doesn't know which `Tenant` object (i.e., which shard) the request is for. The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](#7427 (comment)). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](#8286) for the high level idea. Note that we can always bring the tracing functionality if we need it. But since it's actually about logging the `page_service` wire bytes, it should be a `page_service`-level config option, not per-Tenant. And for enabling tracing on a single connection, we can implement a `set pageserver_trace_connection;` option.

…shutdown (#8339) Since the introduction of sharding, the protocol handling loop in `handle_pagerequests` cannot know anymore which concrete `Tenant`/`Timeline` object any of the incoming `PagestreamFeMessage` resolves to. In fact, one message might resolve to one `Tenant`/`Timeline` while the next one may resolve to another one. To avoid going to tenant manager, we added the `shard_timelines` which acted as an ever-growing cache that held timeline gate guards open for the lifetime of the connection. The consequence of holding the gate guards open was that we had to be sensitive to every cached `Timeline::cancel` on each interaction with the network connection, so that Timeline shutdown would not have to wait for network connection interaction. We can do better than that, meaning more efficiency & better abstraction. I proposed a sketch for it in * #8286 and this PR implements an evolution of that sketch. The main idea is is that `mod page_service` shall be solely concerned with the following: 1. receiving requests by speaking the protocol / pagestream subprotocol 2. dispatching the request to a corresponding method on the correct shard/`Timeline` object 3. sending response by speaking the protocol / pagestream subprotocol. The cancellation sensitivity responsibilities are clear cut: * while in `page_service` code, sensitivity to page_service cancellation is sufficient * while in `Timeline` code, sensitivity to `Timeline::cancel` is sufficient To enforce these responsibilities, we introduce the notion of a `timeline::handle::Handle` to a `Timeline` object that is checked out from a `timeline::handle::Cache` for **each request**. The `Handle` derefs to `Timeline` and is supposed to be used for a single async method invocation on `Timeline`. See the lengthy doc comment in `mod handle` for details of the design.

add sketch

bc534e8

problame mentioned this pull request Jul 5, 2024

pageserver: stuck detach operation #7427

Open

problame added a commit that referenced this pull request Jul 5, 2024

refactor: postgres_backend: replace abstract shutdown_watcher with Ca…

4b98b5c

…ncellationToken Preliminary refactoring while working on #7427 and specifically #8286

This was referenced Jul 5, 2024

refactor: postgres_backend: replace abstract shutdown_watcher with CancellationToken #8295

Merged

postgres_backend: stop using async_trait #8296

Closed

problame added a commit that referenced this pull request Jul 5, 2024

refactor: postgres_backend: replace abstract shutdown_watcher with Ca…

b971d1d

…ncellationToken Preliminary refactoring while working on #7427 and specifically #8286

koivunej reviewed Jul 8, 2024

View reviewed changes

problame added a commit that referenced this pull request Jul 9, 2024

refactor: postgres_backend: replace abstract shutdown_watcher with Ca…

3f7aebb

…ncellationToken (#8295) Preliminary refactoring while working on #7427 and specifically #8286

This was referenced Jul 10, 2024

pageserver: remove trace_read_requests #8338

Merged

refactor(page_service): Timeline gate guard holding + cancellation + shutdown #8339

Merged

skyzh pushed a commit that referenced this pull request Jul 15, 2024

refactor: postgres_backend: replace abstract shutdown_watcher with Ca…

e619e87

…ncellationToken (#8295) Preliminary refactoring while working on #7427 and specifically #8286

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code RFC: decouple page_service from Mgr/Tenant/Timeline lifecycle #8286

Code RFC: decouple page_service from Mgr/Tenant/Timeline lifecycle #8286

problame commented Jul 5, 2024 •

edited

Loading

github-actions bot commented Jul 5, 2024

Postgres 16

Postgres 15

Postgres 14

koivunej left a comment

koivunej Jul 8, 2024

koivunej Jul 8, 2024

Code RFC: decouple page_service from Mgr/Tenant/Timeline lifecycle #8286

Are you sure you want to change the base?

Code RFC: decouple page_service from Mgr/Tenant/Timeline lifecycle #8286

Conversation

problame commented Jul 5, 2024 • edited Loading

github-actions bot commented Jul 5, 2024

3042 tests run: 2926 passed, 1 failed, 115 skipped (full report)

Failures on Postgres 15

Postgres 16

Postgres 15

Postgres 14

Test coverage report is not available

koivunej left a comment

Choose a reason for hiding this comment

koivunej Jul 8, 2024

Choose a reason for hiding this comment

koivunej Jul 8, 2024

Choose a reason for hiding this comment

problame commented Jul 5, 2024 •

edited

Loading