layer file creation: fsync timeline directories using `VirtualFile::sync_all()` #6986

problame · 2024-03-01T12:48:57Z

Except for the involvement of the VirtualFile fd cache, this is
equivalent to what happened before at runtime.

Future PR #6378 will implement VirtualFile::sync_all() using
tokio-epoll-uring if that's configured as the io engine.
This PR is preliminary work for that.

part of #6663

The `writer.finish()` methods already fsync the inode, using `VirtualFile::sync_all()`. All that the callers need to do is fsync their directory, i.e., the timeline directory. Note that there's a call in the new compaction code that is apparently dead-at-runtime, so, I couldn't fix up any fsyncs there [Link](https://github.com/neondatabase/neon/blob/502b69b33bbd4ad1b0647e921a9c665249a2cd62/pageserver/src/tenant/timeline/compaction.rs#L204-L211). In the grand scheme of things, layer durability probably doesn't matter anymore because the remote storage is authoritative at all times as of #5198. But, let's not break the discipline in htis commit. part of #6663

…nc_all() Except for the involvement of the VirtualFile fd cache, this is equivalent to what happened before at runtime. Future PR #6378 will implement `VirtualFile::sync_all()` using tokio-epoll-uring if that's configured as the io engine. This PR is preliminary work for that.

As pointed out in the comments added in this PR: the in-memory state of the filesystem already has the layer file in its final place. If the fsync fails, but pageserver continues to execute, it's quite easy for subsequent pageserver code to observe the file being there and assume it's durable, where it really isn't. It can happen that we get ENOSPC during the fsync. However, 1. the timeline dir is small (remember, the big layer _file_ has already been synced). Small data means ENOSPC due to delayed allocation races etc are less likely. 2. what elase are we going to do in that case? If we decide to bubble up the error, the file remains on disk. We could try to unlink it and fsync after the unlink. If that fails, we would _definitely_ need to error out. Is it worth the trouble though? Side note: all this logic about not carrying on after fsync failure implies that we `sync` the filesystem successfully before we restart the pageserver. Our systemd unit currently does not do that, but should.

…kio-epoll-uring/layer-write-path-fsync-cleanups

…sync-cleanups' into problame/integrate-tokio-epoll-uring/create-layer-fatal-err-on-fsync

github-actions · 2024-03-01T13:26:00Z

2484 tests run: 2361 passed, 0 failed, 123 skipped (full report)

Flaky tests (1)

Postgres 14

test_compute_pageserver_connection_stress: debug

Code coverage* (full report)

functions: 28.7% (6933 of 24172 functions)
lines: 47.2% (42515 of 90097 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
c654263 at 2024-03-04T13:20:20.883Z :recycle:}

koivunej

Looking good!

pageserver/src/tenant.rs

…kio-epoll-uring/layer-write-path-fsync-cleanups

…sync-cleanups' into problame/integrate-tokio-epoll-uring/create-layer-fatal-err-on-fsync

…-err-on-fsync' into problame/integrate-tokio-epoll-uring/ioengine-par-fsync

…e-par-fsync

part of #6663 See that epic for more context & related commits. Problem ------- Before this PR, the layer-file-creating code paths were using VirtualFile, but under the hood these were still blocking system calls. Generally this meant we'd stall the executor thread, unless the caller "knew" and used the following pattern instead: ``` spawn_blocking(|| { Handle::block_on(async { VirtualFile::....().await; }) }).await ``` Solution -------- This PR adopts `tokio-epoll-uring` on the layer-file-creating code paths in pageserver. Note that on-demand downloads still use `tokio::fs`, these will be converted in a future PR. Design: Avoiding Regressions With `std-fs` ------------------------------------------ If we make the VirtualFile write path truly async using `tokio-epoll-uring`, should we then remove the `spawn_blocking` + `Handle::block_on` usage upstack in the same commit? No, because if we’re still using the `std-fs` io engine, we’d then block the executor in those places where previously we were protecting us from that through the `spawn_blocking` . So, if we want to see benefits from `tokio-epoll-uring` on the write path while also preserving the ability to switch between `tokio-epoll-uring` and `std-fs` , where `std-fs` will behave identical to what we have now, we need to ***conditionally* use `spawn_blocking + Handle::block_on`** . I.e., in the places where we use that know, we’ll need to make that conditional based on the currently configured io engine. It boils down to investigating all the places where we do `spawn_blocking(... block_on(... VirtualFile::...))`. Detailed [write-up of that investigation in Notion](https://neondatabase.notion.site/Surveying-VirtualFile-write-path-usage-wrt-tokio-epoll-uring-integration-spawn_blocking-Handle-bl-5dc2270dbb764db7b2e60803f375e015?pvs=4 ), made publicly accessible. tl;dr: Preceding PRs addressed the relevant call sites: - `metadata` file: turns out we could simply remove it (#6777, #6769, #6775) - `create_delta_layer()`: made sensitive to `virtual_file_io_engine` in #6986 NB: once we are switched over to `tokio-epoll-uring` everywhere in production, we can deprecate `std-fs`; to keep macOS support, we can use `tokio::fs` instead. That will remove this whole headache. Code Changes In This PR ----------------------- - VirtualFile API changes - `VirtualFile::write_at` - implement an `ioengine` operation and switch `VirtualFile::write_at` to it - `VirtualFile::metadata()` - curiously, we only use it from the layer writers' `finish()` methods - introduce a wrapper `Metadata` enum because `std::fs::Metadata` cannot be constructed by code outside rust std - `VirtualFile::sync_all()` and for completeness sake, add `VirtualFile::sync_data()` Testing & Rollout ----------------- Before merging this PR, we ran the CI with both io engines. Additionally, the changes will soak in staging. We could have a feature gate / add a new io engine `tokio-epoll-uring-write-path` to do a gradual rollout. However, that's not part of this PR. Future Work ----------- There's still some use of `std::fs` and/or `tokio::fs` for directory namespace operations, e.g. `std::fs::rename`. We're not addressing those in this PR, as we'll need to add the support in tokio-epoll-uring first. Note that rename itself is usually fast if the directory is in the kernel dentry cache, and only the fsync after rename is slow. These fsyncs are using tokio-epoll-uring, so, the impact should be small.

problame added 6 commits March 1, 2024 11:56

Merge remote-tracking branch 'origin/main' into problame/integrate-to…

ce251ac

…kio-epoll-uring/layer-write-path-fsync-cleanups

Merge branch 'problame/integrate-tokio-epoll-uring/layer-write-path-f…

c4f7a19

…sync-cleanups' into problame/integrate-tokio-epoll-uring/create-layer-fatal-err-on-fsync

rebase on fatal_err changes

1fe80d7

problame mentioned this pull request Mar 1, 2024

Epic: adopt tokio-epoll-uring on the write path #6663

Closed

problame requested a review from koivunej March 1, 2024 14:49

problame marked this pull request as ready for review March 1, 2024 14:49

problame requested a review from a team as a code owner March 1, 2024 14:49

koivunej approved these changes Mar 1, 2024

View reviewed changes

pageserver/src/tenant.rs Show resolved Hide resolved

problame added 3 commits March 1, 2024 15:25

Merge remote-tracking branch 'origin/main' into problame/integrate-to…

5528b16

…kio-epoll-uring/layer-write-path-fsync-cleanups

Merge branch 'problame/integrate-tokio-epoll-uring/layer-write-path-f…

c972d17

…sync-cleanups' into problame/integrate-tokio-epoll-uring/create-layer-fatal-err-on-fsync

Merge branch 'problame/integrate-tokio-epoll-uring/create-layer-fatal…

7299a0a

…-err-on-fsync' into problame/integrate-tokio-epoll-uring/ioengine-par-fsync

problame mentioned this pull request Mar 1, 2024

tokio-epoll-uring: use it on the layer-creating code paths #6378

Merged

Base automatically changed from problame/integrate-tokio-epoll-uring/create-layer-fatal-err-on-fsync to main March 4, 2024 12:18

Merge branch 'main' into problame/integrate-tokio-epoll-uring/ioengin…

c654263

…e-par-fsync

problame enabled auto-merge (squash) March 4, 2024 12:31

problame merged commit 944cac9 into main Mar 4, 2024
53 checks passed

problame deleted the problame/integrate-tokio-epoll-uring/ioengine-par-fsync branch March 4, 2024 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

layer file creation: fsync timeline directories using `VirtualFile::sync_all()` #6986

layer file creation: fsync timeline directories using `VirtualFile::sync_all()` #6986

problame commented Mar 1, 2024

github-actions bot commented Mar 1, 2024 •

edited

Loading

Postgres 14

koivunej left a comment

layer file creation: fsync timeline directories using VirtualFile::sync_all() #6986

layer file creation: fsync timeline directories using VirtualFile::sync_all() #6986

Conversation

problame commented Mar 1, 2024

github-actions bot commented Mar 1, 2024 • edited Loading

2484 tests run: 2361 passed, 0 failed, 123 skipped (full report)

Postgres 14

Code coverage* (full report)

koivunej left a comment

Choose a reason for hiding this comment

layer file creation: fsync timeline directories using `VirtualFile::sync_all()` #6986

layer file creation: fsync timeline directories using `VirtualFile::sync_all()` #6986

github-actions bot commented Mar 1, 2024 •

edited

Loading