-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
control_plane: revise compute_hook locking (don't serialise all calls) #7088
Conversation
2802 tests run: 2678 passed, 0 failed, 124 skipped (full report)Flaky tests (3)Postgres 14Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
cf1b476 at 2024-04-06T20:07:07.354Z :recycle: |
b139bd9
to
f049ea1
Compare
f049ea1
to
283dbbc
Compare
This didn't reduce down as neatly as I had hoped, but I think the logic is sound: we try to acquire a ComputeHookTenant's async lock before sending an update, and if we can't acquire it we drop out, wait for the lock, and try again. The request we send is always derived from latest state of the ComputeHookTenant, so there is no longer any sensitivity to which order notify() calls complete in. I considered using an async lock for ComputeHook::state to make the control flow a bit less tortured, but on balance I prefer to use a sync lock so that the compiler enforces that we do not hold it while waiting for IO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code has no locking bugs, but, it's all quite brittle / hard to follow.
My only strong request for chagnes is the #7088 (comment)
The naming changes would be pretty nice, but, if you don't have time for them, I can do them in a follow-up.
Comments on names/structs are all quite reasonable, amended in cf1b476 |
…7495) ## Problem Previously, we try to send compute notifications in startup_reconcile before completing that function, with a time limit. Any notifications that don't happen within the time limit result in tenants having their `pending_compute_notification` flag set, which causes them to spawn a Reconciler next time the background reconciler loop runs. This causes two problems: - Spawning a lot of reconcilers after startup caused a spike in memory (this is addressed in #7493) - After #7493, spawning lots of reconcilers will block some other operations, e.g. a tenant creation might fail due to lack of reconciler semaphore units while the controller is busy running all the Reconcilers for its startup compute notifications. When the code was first written, ComputeHook didn't have internal ordering logic to ensure that notifications for a shard were sent in the right order. Since that was added in #7088, we can use it to avoid waiting for notifications to complete in startup_reconcile. Related to: #7460 ## Summary of changes - Add a `notify_background` method to ComputeHook. - Call this from startup_reconcile instead of doing notifications inline - Process completions from `notify_background` in `process_results`, and if a notification failed then set the `pending_compute_notification` flag on the shard. The result is that we will only spawn lots of Reconcilers if the compute notifications _fail_, not just because they take some significant amount of time. Test coverage for this case is in #7475
Problem
ComputeHook::notify
. This served multiple purposes:This worked, but had the major downside that while we're waiting for a compute hook request to the control plane to succeed, we can't notify about any other tenants. Notifications block progress of live migrations, so this is a problem.
Summary of changes
ComputeHook::state
with a sync lock instead of an async lockComputeHook::neon_local_lock
) for preventing concurrent calls into neon_local, and only take this in the neon_local code path.Checklist before requesting a review
Checklist before merging