control_plane: revise compute_hook locking (don't serialise all calls) #7088

jcsp · 2024-03-11T14:30:11Z

Problem

Previously, an async mutex was held for the duration of ComputeHook::notify. This served multiple purposes:
- Ensure updates to a given tenant are sent in the proper order
- Prevent concurrent calls into neon_local endpoint updates in test environments (neon_local is not safe to call concurrently)
- Protect the inner ComputeHook::state hashmap that is used to calculate when to send notifications.

This worked, but had the major downside that while we're waiting for a compute hook request to the control plane to succeed, we can't notify about any other tenants. Notifications block progress of live migrations, so this is a problem.

Summary of changes

Protect ComputeHook::state with a sync lock instead of an async lock
Use a separate async lock ( ComputeHook::neon_local_lock ) for preventing concurrent calls into neon_local, and only take this in the neon_local code path.
Add per-tenant async locks in ShardedComputeHookTenant, and use these to ensure that only one remote notification can be sent at once per tenant. If several shards update concurrently, their updates will be coalesced.
Add an explicit semaphore that limits concurrency of calls into the cloud control plane.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-03-11T15:27:41Z

2802 tests run: 2678 passed, 0 failed, 124 skipped (full report)

Flaky tests (3)

Postgres 14

test_timeline_init_break_before_checkpoint: debug
test_empty_branch_remote_storage_upload: debug
test_branched_from_many_empty_parents_size: debug

Code coverage* (full report)

functions: 28.0% (6406 of 22857 functions)
lines: 46.9% (45098 of 96230 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
cf1b476 at 2024-04-06T20:07:07.354Z :recycle:}

control_plane/attachment_service/src/compute_hook.rs

jcsp · 2024-04-02T16:58:22Z

This didn't reduce down as neatly as I had hoped, but I think the logic is sound: we try to acquire a ComputeHookTenant's async lock before sending an update, and if we can't acquire it we drop out, wait for the lock, and try again. The request we send is always derived from latest state of the ComputeHookTenant, so there is no longer any sensitivity to which order notify() calls complete in.

I considered using an async lock for ComputeHook::state to make the control flow a bit less tortured, but on balance I prefer to use a sync lock so that the compiler enforces that we do not hold it while waiting for IO.

problame

I think this code has no locking bugs, but, it's all quite brittle / hard to follow.

My only strong request for chagnes is the #7088 (comment)

The naming changes would be pretty nice, but, if you don't have time for them, I can do them in a follow-up.

control_plane/attachment_service/src/compute_hook.rs

…e-hook-concurrency

jcsp · 2024-04-05T17:00:15Z

Comments on names/structs are all quite reasonable, amended in cf1b476

…7495) ## Problem Previously, we try to send compute notifications in startup_reconcile before completing that function, with a time limit. Any notifications that don't happen within the time limit result in tenants having their `pending_compute_notification` flag set, which causes them to spawn a Reconciler next time the background reconciler loop runs. This causes two problems: - Spawning a lot of reconcilers after startup caused a spike in memory (this is addressed in #7493) - After #7493, spawning lots of reconcilers will block some other operations, e.g. a tenant creation might fail due to lack of reconciler semaphore units while the controller is busy running all the Reconcilers for its startup compute notifications. When the code was first written, ComputeHook didn't have internal ordering logic to ensure that notifications for a shard were sent in the right order. Since that was added in #7088, we can use it to avoid waiting for notifications to complete in startup_reconcile. Related to: #7460 ## Summary of changes - Add a `notify_background` method to ComputeHook. - Call this from startup_reconcile instead of doing notifications inline - Process completions from `notify_background` in `process_results`, and if a notification failed then set the `pending_compute_notification` flag on the shard. The result is that we will only spawn lots of Reconcilers if the compute notifications _fail_, not just because they take some significant amount of time. Test coverage for this case is in #7475

jcsp added t/feature Issue type: feature, for new features or requests c/storage/controller Component: Storage Controller labels Mar 11, 2024

jcsp requested a review from a team as a code owner March 11, 2024 14:30

jcsp requested a review from arpad-m March 11, 2024 14:30

jcsp mentioned this pull request Mar 13, 2024

Epic: storage controller (née sharding service) #6342

Open

jcsp requested a review from problame March 20, 2024 12:38

problame reviewed Mar 21, 2024

View reviewed changes

control_plane/attachment_service/src/compute_hook.rs Outdated Show resolved Hide resolved

jcsp marked this pull request as draft March 28, 2024 21:09

jcsp force-pushed the jcsp/storcon-compute-hook-concurrency branch from b139bd9 to f049ea1 Compare March 28, 2024 21:10

jcsp added 2 commits April 2, 2024 17:54

control_plane: revise compute_hook locking (don't serialize all calls)

5984c63

storage controller: limit concurrency of compute_hook calls

283dbbc

jcsp force-pushed the jcsp/storcon-compute-hook-concurrency branch from f049ea1 to 283dbbc Compare April 2, 2024 16:55

jcsp marked this pull request as ready for review April 2, 2024 16:55

jcsp requested a review from problame April 5, 2024 10:43

problame approved these changes Apr 5, 2024

View reviewed changes

jcsp added 2 commits April 5, 2024 17:17

Merge remote-tracking branch 'upstream/main' into jcsp/storcon-comput…

3fdb24f

…e-hook-concurrency

Clean up names, structs and comments

cf1b476

jcsp enabled auto-merge (squash) April 5, 2024 17:00

jcsp merged commit 74b2314 into main Apr 6, 2024
53 checks passed

jcsp deleted the jcsp/storcon-compute-hook-concurrency branch April 6, 2024 19:52

jcsp mentioned this pull request Apr 24, 2024

storage controller: send startup compute notifications in background #7495

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

control_plane: revise compute_hook locking (don't serialise all calls) #7088

control_plane: revise compute_hook locking (don't serialise all calls) #7088

jcsp commented Mar 11, 2024 •

edited

Loading

github-actions bot commented Mar 11, 2024 •

edited

Loading

Postgres 14

jcsp commented Apr 2, 2024

problame left a comment

jcsp commented Apr 5, 2024

control_plane: revise compute_hook locking (don't serialise all calls) #7088

control_plane: revise compute_hook locking (don't serialise all calls) #7088

Conversation

jcsp commented Mar 11, 2024 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Mar 11, 2024 • edited Loading

2802 tests run: 2678 passed, 0 failed, 124 skipped (full report)

Postgres 14

Code coverage* (full report)

jcsp commented Apr 2, 2024

problame left a comment

Choose a reason for hiding this comment

jcsp commented Apr 5, 2024

jcsp commented Mar 11, 2024 •

edited

Loading

github-actions bot commented Mar 11, 2024 •

edited

Loading