Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor safekeepers to support special timeline modes #6337

Closed
Tracked by #6220
petuhovskiy opened this issue Jan 11, 2024 · 2 comments · Fixed by #8022
Closed
Tracked by #6220

Refactor safekeepers to support special timeline modes #6337

petuhovskiy opened this issue Jan 11, 2024 · 2 comments · Fixed by #8022
Assignees
Labels
c/storage/safekeeper Component: storage: safekeeper

Comments

@petuhovskiy
Copy link
Member

petuhovskiy commented Jan 11, 2024

In safekeeper, we have multiple per-timeline and global background tasks. Some of these tasks require to know only about latest LSNs, other tasks should be able to read WAL from disk to function.

The idea is to keep track of all per-timeline tasks and add timeline state that can be Loaded | Offloaded. After this, the behaviour should be following:

  • If task wants to read WAL and current state is Offloaded, we should download WAL from S3 and change the state before running the task.
  • If current state is Loaded, no tasks wanted to read WAL recently and partial backup has uploaded single local WAL segment, change state to Offloaded.
@petuhovskiy
Copy link
Member Author

Ok, I gave it more thought and I'm currently not happy with how backup task is spawned and how is_active flag is updated. Going to fix this first before adding new mode for offloaded timelines.
#7751

petuhovskiy added a commit that referenced this issue May 31, 2024
This is a preparation for
#6337.

The idea is to add FullAccessTimeline, which will act as a guard for
tasks requiring access to WAL files. Eviction will be blocked on these
tasks and WAL won't be deleted from disk until there is at least one
active FullAccessTimeline.

To get FullAccessTimeline, tasks call `tli.full_access_guard().await?`.
After eviction is implemented, this function will be responsible for
downloading missing WAL file and waiting until the download finishes.

This commit also contains other small refactorings:
- Separate `get_tenant_dir` and `get_timeline_dir` functions for
building a local path. This is useful for looking at usages and finding
tasks requiring access to local filesystem.
- `timeline_manager` is now responsible for spawning all background
tasks
- WAL removal task is now spawned instantly after horizon is updated
a-masterov pushed a commit that referenced this issue Jun 3, 2024
This is a preparation for
#6337.

The idea is to add FullAccessTimeline, which will act as a guard for
tasks requiring access to WAL files. Eviction will be blocked on these
tasks and WAL won't be deleted from disk until there is at least one
active FullAccessTimeline.

To get FullAccessTimeline, tasks call `tli.full_access_guard().await?`.
After eviction is implemented, this function will be responsible for
downloading missing WAL file and waiting until the download finishes.

This commit also contains other small refactorings:
- Separate `get_tenant_dir` and `get_timeline_dir` functions for
building a local path. This is useful for looking at usages and finding
tasks requiring access to local filesystem.
- `timeline_manager` is now responsible for spawning all background
tasks
- WAL removal task is now spawned instantly after horizon is updated
@petuhovskiy
Copy link
Member Author

The latest plan to implement eviction:

  1. Add eviction_state to TimelinePersistentState (state.rs)
  2. Make partial_backup task return uploaded segment when it's fully finished uploading
  3. manager should detect condition where actual state is uploaded to S3 and there is no active tasks at all
  4. then it will move timeline to evicted state, Safekeeper in SharedState will be replaced with shorter state without WAL access
  5. fix everything to be able to use shorter state
  6. and add download/verify to un-evict timelines from S3

conradludgate pushed a commit that referenced this issue Jun 27, 2024
Fixes #6337

Add safekeeper support to switch between `Present` and
`Offloaded(flush_lsn)` states. The offloading is disabled by default,
but can be controlled using new cmdline arguments:

```
      --enable-offload
          Enable automatic switching to offloaded state
      --delete-offloaded-wal
          Delete local WAL files after offloading. When disabled, they will be left on disk
      --control-file-save-interval <CONTROL_FILE_SAVE_INTERVAL>
          Pending updates to control file will be automatically saved after this interval [default: 300s]
```

Manager watches state updates and detects when there are no actvity on
the timeline and actual partial backup upload in remote storage. When
all conditions are met, the state can be switched to offloaded.

In `timeline.rs` there is `StateSK` enum to support switching between
states. When offloaded, code can access only control file structure and
cannot use `SafeKeeper` to accept new WAL.

`FullAccessTimeline` is now renamed to `WalResidentTimeline`. This
struct contains guard to notify manager about active tasks requiring
on-disk WAL access. All guards are issued by the manager, all requests
are sent via channel using `ManagerCtl`. When manager receives request
to issue a guard, it unevicts timeline if it's currently evicted.

Fixed a bug in partial WAL backup, it used `term` instead of
`last_log_term` previously.

After this commit is merged, next step is to roll this change out, as in
issue #6338.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/safekeeper Component: storage: safekeeper
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant