-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement timeline_manager for safekeeper background tasks #7768
Conversation
3096 tests run: 2969 passed, 0 failed, 127 skipped (full report)Flaky tests (2)Postgres 16Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
e030838 at 2024-05-21T23:14:08.491Z :recycle: |
1e7f2c6
to
14af1f0
Compare
+1, I dislike these lock wal backup launcher notifications which must be outside lock so thought about something similar. The simplest thing which first came to my mind was just to remove any launcher and spin up / stop task directly by any task updating the status (under lock). However, this is problematic because we don't want to have two running two tasks of the type at the same time, so there should be some await'ing when we stop the task, which shouldn't happen under global timeline lock due to potential deadlock danger. Separate manager task solves this because 1) notifying manager doesn't need await due to watch channel instead of usual one 2) manager itself can keep its state locally or under different lock than timeline lock. It should probably still publish tasks state in some shared memory to allow outside code wait for tasks termination (pause, probably freezing when timeline is being evicted), but that's a different subject. I also think we'd likely benefit from generalizing task itself to have common code for cancelling and waiting for termination. |
b3c4735
to
9bf2384
Compare
Ok, so I've done everything I've wanted in this PR. Now I'll look into test failures and that should be it.
Sure, one step at a time and we'll get there. |
f0a5beb
to
78459ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Pushed minor fix for logging backup task stop and logging context for broker pull (wanted to learn who calls write_shared_state and it missed it).
Also, high level comment for timeline_manager.rs would be nice (partial copy of PR description would work). |
Structurally this looks like a good direction. Couple of general non-PR-blocking thoughts as we continue to evolve this code:
|
1246d53
to
e030838
Compare
Yes, should be done later. Also it makes sense to better (re)organize safekeeper code into directories.
Agree, but needs more refactoring to make all tasks use the gate. Created an issue #7831
This change looks good, would like to commit it right after this PR. |
I looked at the metrics from #7768 on staging and it seems that manager does too many iterations. This is probably caused by background job `remove_wal.rs` which iterates over all timelines and tries to remove WAL and persist control file. This causes shared state updates and wakes up the manager. The fix is to skip notifying about the updates if nothing was updated.
In safekeepers we have several background tasks. Previously
WAL backup
task was spawned by another task calledwal_backup_launcher
. That task received notifications viawal_backup_launcher_rx
and decided to spawn or kill existing backup task associated with the timeline. This was inconvenient because each code segment that touched shared state was responsible for pushing notification intowal_backup_launcher_tx
channel. This was error prone because it's easy to miss and could lead to deadlock in some cases, if notification pushing was done in the wrong order.We also had a similar issue with
is_active
timeline flag. That flag was calculated based on the state and code modifying the state had to call function to update the flag. We had a few bugs related to that, when we forgot to updateis_active
flag in some places where it could change.To fix these issues, this PR adds a new
timeline_manager
background task associated with each timeline. This task is responsible for managing all background tasks, includingis_active
flag which is used for pushing broker messages. It is subscribed for updates in timeline state in a loop and decides to spawn/kill background tasks when needed.There is a new structure called
TimelinesSet
. It stores a set ofArc<Timeline>
and allows to copy the set to iterate without holding the mutex. This is what replacedis_active
flag for the broker. Now broker push task holds a reference to theTimelinesSet
with active timelines and use it instead of iterating over all timelines and filtering byis_active
flag.Also added some metrics for manager iterations and active backup tasks. Ideally manager should be doing not too many iterations and we should not have a lot of backup tasks spawned at the same time.
Fixes #7751