perf(walingest): mitigate bulk ingest throughput regression through larger EphemeralFile in-memory buffer #7273

problame · 2024-03-28T14:46:03Z

Problem

The test_bulk_ingest benchmark shows about 2x lower throughput with tokio-epoll-uring compared to std-fs.
That's why we temporarily disabled it in #7238.

The reason for this regression is that the benchmark runs on a system without memory pressure and thus std-fs writes don't block on disk IO but only copy the data into the kernel page cache.
The tokio-epoll-uring cannot beat that at this time, and possibly never.
(However, under memory pressure, std-fs would stall the executor thread on kernel page cache writeback disk IO. That's why we want to use tokio-epoll-uring. And we likely want to use O_DIRECT in the future, at which point std-fs becomes an absolute show-stopper.)

Further, bulk walingest is a pathological case because the InMemoryLayer into which ingest happens uses EphemeralFile. EphemeralFile flushes its mutable_tail every 8k of WAL. And it does not do double-buffering / pipelining, i.e., while flushing the buffer, no new WAL can be ingested.

More elaborate analysis: https://neondatabase.notion.site/Why-test_bulk_ingest-is-slower-with-tokio-epoll-uring-918c5e619df045a7bd7b5f806cfbd53f?pvs=4

Changes

As a medium-term fix, increase the buffer size of EphemeralFile to 64k. Longer-term, we probably want to do double-buffering / pipelined IO.

While thinking about this problem, I had the insight that we can generalize the recently added owned_buffers_io abstractions a bit, eliminating the special Writer inside the EphemeralFile::write_blob function.
That's what I did in this PR.

We lose the pageserver PageCache priming aspect that the old EphemeralFile had. The commit history contains an attempt & revert to add it, but, it junks up the owned_buffers_io interface even further.

decide/experiment with we want to abandon the pre-warming
- this could be a preliminary PR that we can ship to prod & measure impact
re-enable the benchmark
run this PR with with benchmarks enabled, for both std-fs and tokio-epoll-uring, ensure it passes, compare results

Resource Usage

We currently do not flush the buffer when freezing the InMemoryLayer.
That means a single Timeline can have multiple 64k mutables tails alive, esp if flushing is slow.

We should either bound the number of frozen layers.
Or we should change the freezing code to flush the buffer and drop the allocation.

Performance

I last ran the test_bulk_ingest benchmark when I still had the page-cache pre-warming.
Report in commit 261f116a2d47b395005858d818bfa9a64b98dc30.

test_bulk_insert[neon-release-pg14-std-fs].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 8.381 s

test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 11.760 s

So, still somewhat of a regression.

TODO:

experiment with 128k
re-run once we have decided on whether we need the pre-warming or not.

Alternative

The alternative would be to keep the write_blob's Writer and duplicate the buffered-writer functionality from owned_buffers_io into it.

…6378)" This reverts commit 3da410c.

…paths (#6378)"" This reverts commit 9876045.

…ed (same for get_vectored_impl)

…never used (same for get_vectored_impl)" This reverts commit 5b8888c.

This reverts commit e85a631.

With this test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 16.053 s down from test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 17.669 s but the regression is from baseline test_bulk_insert[neon-release-pg14-std-fs].wal_written: 345 MB test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 9.335 s

This reverts commit 57241c1.

…ode paths (#6378)" Unchanged test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 9.194 s This reverts commit 3da410c.

…eating code paths (#6378)"" This reverts commit d3c157e.

…a spawn_blocking single-threaded runtime This makes things worse with tokio-epoll-uring test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 19.574 s This is a partial revert of 3da410c

…in a spawn_blocking thread single-threaded runtime builds on top of the previous commit test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 13.153 s

together with previous commits, this brings us back down to pre-regression test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 9.991 s

…he page cache ... by forcing each write system call to go to disk test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 346 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 92.559 s test_bulk_insert[neon-release-pg14-std-fs].wal_written: 346 MB test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 81.998 s => 10%ish worse instead of 2x

…the page cache" This reverts commit 2edbc07.

This reverts commit c8c04c0.

… engine in a spawn_blocking thread single-threaded runtime" This reverts commit 4a8e7f8.

… inside a spawn_blocking single-threaded runtime" This reverts commit 72a8e09.

… performs better because it hits the page cache ... by forcing each write system call to go to disk test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 346 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 93.417 s test_bulk_insert[neon-release-pg14-std-fs].wal_written: 346 MB test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 86.009 s => ~8% instead of 2x difference

…t std-fs performs better because it hits the page cache" This reverts commit d66ccba.

The OwnedAsyncWrite stuff is based on the code in tokio-epoll-uring on-demand download PR (#6992), which hasn't merged yet.

…ile speciality of reading zeroes past end-of-file This makes it diverge semantically from what's in the tokio-epoll-uring download PR :(

… assert it

…ious commit, yet more churn -,-

test_bulk_insert[neon-release-pg14-std-fs].wal_written: 345 MB test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 8.381 s test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 11.760 s

…arger-buffers Conflicts: control_plane/src/local_env.rs control_plane/src/pageserver.rs the above have been preliminaried to `main` since pageserver/src/virtual_file.rs pageserver/src/virtual_file/owned_buffers_io/util/size_tracking_writer.rs pageserver/src/virtual_file/owned_buffers_io/write.rs strategy: pick ours (likely breaks the build, will fix subsequently)

It would require plumbing through the RequestContext, and its utility is somewhat dubious anyway.

github-actions · 2024-03-28T15:31:37Z

2760 tests run: 2642 passed, 0 failed, 118 skipped (full report)

Code coverage* (full report)

functions: 28.1% (6466 of 23041 functions)
lines: 46.4% (44771 of 96571 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
9281f54 at 2024-04-17T14:19:43.543Z :recycle:}

problame · 2024-04-03T11:53:29Z

Talked to @jcsp about this yesterday, he also thinks the page-cache pre-warming is probably counter-productive.

VladLazar

Some thoughts:

Why do we even need the borrowed write interface? The upstream call stack does borrowed writes, but I don't see
a strong reason why it has to do that (perhaps something related to the atomicity of Timeline::flush|commit -
could be worked around if that's the case).
My intuition is that we don't need to pre-warm the cache. Monitoring get_page latency on Peter's benches
should give us some clues there.
Having multiple tails alive for each timeline isn't great, but that was the case previously, right? Although
the tail was shorter.
The stacking of abstractions in the ephemeral file is very clever, but I'm wondering if a separate "all-in-one" abstraction would be easier on the eye (I don't feel strongly about this).

VladLazar · 2024-04-03T13:34:19Z

pageserver/src/virtual_file/owned_buffers_io/write/zero_initialized_buffer.rs

+    }
+
+    fn bytes_total(&self) -> usize {
+        self.written // ?


Shouldn't this be N?

I guess, yeah.

Tangentially, I wonder why IoBuf needs the bytes_total in the first place. For IoBufMut , sure, it needs to know the bounds of where it needs to write. Buf for IoBuf, which is the equivalent of &[u8], I don't understand why it would need to know the length including uninitialized memory.

…arger-buffers

jcsp · 2024-04-17T16:06:32Z

pageserver/src/tenant/ephemeral_file.rs

-    /// We keep the last page, which can still be modified, in [`Self::mutable_tail`].
-    /// The other pages, which can no longer be modified, are accessed through the page cache.
+    /// We sandwich the buffered writer between two size-tracking writers.
+    /// This allows us to "elegantly" track in-memory bytes vs flushed bytes,


"elegantly" :-D

Future: I think this could be made elegant though, if the sandwich was wrapped up in a separate type, and that separate type implemented a read() method that handled the read from the buffer. Although we may find it makes more sense to just flush to disk if the data we want to read is still buffered, since that should be a pretty rare case (we might find there are frequent callers from wal ingest that should really be caching something at a higher level rather than reading-back very recent writes)

So, there are really two differents asks in this comment

Future: I think this could be made elegant though, if the sandwich was wrapped up in a separate type, and that separate type implemented a read() method that handled the read from the buffer.

This is cleanliness / encapsulation for the "read from buffer" approach that we find in this PR.
I can see myself implementing that before merging this PR.

Although we may find it makes more sense to just flush to disk if the data we want to read is still buffered, since that should be a pretty rare case (we might find there are frequent callers from wal ingest that should really be caching something at a higher level rather than reading-back very recent writes)

I like the idea of not serving reads from memory faster than from disk.

Are you suggestion to write out < 64KiB blocks in that case?
That would be a different PR on top of this one though.

(I think before going that route, we could just add a 100us tokio sleep before serving a read from memory).

jcsp · 2024-04-17T16:06:53Z

pageserver/src/tenant/ephemeral_file.rs

-    /// None <=> IO is ongoing.
-    /// Size is fixed to PAGE_SZ at creation time and must not be changed.
-    mutable_tail: Option<BytesMut>,
+    /// TODO: longer-term, we probably wand to get rid of this in favor


Yes -- I'd be inclined to build the pipelining into BufferedWriter

jcsp · 2024-04-17T16:09:11Z

In a microbenchmark (#7409, case ingest-big-values/ingest 128MB/8k seq, no delta) of 8k put_value to inmemory layer, this PR increases throughput by an order of magnitude when writing to memory (~1.6GB/s -> 7.5 GB/s on my workstation), so it clearly removes a major CPU bottleneck.

I like that this change removes the mutable_tail in favor of a generic buffered writer. The "sandwich" of writers is a bit strange on first read, but seems fairly robust.

jcsp · 2024-04-17T16:12:32Z

pageserver/src/tenant/ephemeral_file.rs

-            self.len += 4;
-        }
-        self.len += srcbuf.len() as u64;
+        // TODO: bring back pre-warming of page cache, using another sandwich layer


I think we can just drop this TODO -- warming page cache with very recently written data is contrary to the I/O flow we expect, as you hint in the PR messages.

jcsp · 2024-04-17T16:15:22Z

pageserver/src/virtual_file.rs

+
+    #[inline(always)]
+    async fn write_all_borrowed(&mut self, _buf: &[u8]) -> std::io::Result<usize> {
+        // TODO: ensure this through the type system


Could you expand the commit: it's not obvious why this should be illegal: why can't one write from a &[u8] into a VirtualFile? Once done, this can probably lose the TODO prefix unless we have a specific idea for how to do it.

why can't one write from a &[u8] into a VirtualFile?

Because a &[u8] is not owned which is incompatible with Rust's async model & io_uring.
Rust's ownership / borrowing system can't model that the kernel holds a reference to the memory pointed to by &[u8] while the io_uring operation is ongoing.
Hence Rust allows dropping the future while the operation is still ongoing, which then leads to memory corruption.
That's why I converted the whole write (and read!) path to owned IO.

Once done, this can probably lose the TODO prefix unless we have a specific idea for how to do it.

I think for the purposes of this coment, it's sufficient to state that

The VirtualFile API requires owned buffers, not slices, because that's what tokio_epoll_uring requires.

Perhaps the write_all_borrowed should be an extension trait that is closer to the InMemoryLayer where we actually need it.

problame · 2024-04-23T14:37:52Z

I split up this PR into a bunch of refactorings that can be reviewed independently.
Careful review is required because this is sensitive code where we can't be wrong.

See the parent issue #7124 's task list "Short-Term: Increase Buffer Sizes".

problame added 30 commits March 12, 2024 18:53

Revert "tokio-epoll-uring: use it on the layer-creating code paths (#…

9876045

…6378)" This reverts commit 3da410c.

Revert "Revert "tokio-epoll-uring: use it on the layer-creating code …

95deea4

…paths (#6378)"" This reverts commit 9876045.

make changes preparing next commit

e85a631

expose that pageserver_virtual_file_io_engine test param was never us…

5b8888c

…ed (same for get_vectored_impl)

Revert "expose that pageserver_virtual_file_io_engine test param was …

dbabd4e

…never used (same for get_vectored_impl)" This reverts commit 5b8888c.

expose that virtual_file_io_engine and get_vectored_impl were never set

6348833

Revert "make changes preparing next commit"

5a02774

This reverts commit e85a631.

DO NOT MERGE: always parametrize

095130c

wip

35f8735

fix the parametrization

b9c30db

Revert "experiment: StdFs for EphemeralFile writes isn't the bottleneck"

c600355

This reverts commit 57241c1.

experiment: Revert "tokio-epoll-uring: use it on the layer-creating c…

d3c157e

…ode paths (#6378)" Unchanged test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 9.194 s This reverts commit 3da410c.

Revert "experiment: Revert "tokio-epoll-uring: use it on the layer-cr…

e0ea465

…eating code paths (#6378)"" This reverts commit d3c157e.

experiment: for EphemeralFile write path, use StdFs io engine

c8c04c0

together with previous commits, this brings us back down to pre-regression test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 9.991 s

previous commit's numbers were with all the preceding experiments

9431105

"experiment: demonstrate that std-fs performs better because it hits …

746fc53

…the page cache" This reverts commit 2edbc07.

Revert "experiment: for EphemeralFile write path, use StdFs io engine"

1339834

This reverts commit c8c04c0.

Revert "experiment: for create_delta_layer _write path_, use StdFs io…

c9d1f51

… engine in a spawn_blocking thread single-threaded runtime" This reverts commit 4a8e7f8.

Revert "experiment: for create_delta_layer, use global io_engine, but…

578a2d5

… inside a spawn_blocking single-threaded runtime" This reverts commit 72a8e09.

Revert "experiment(repeat, without preceding reverts) demonstrate tha…

644c5e2

…t std-fs performs better because it hits the page cache" This reverts commit d66ccba.

larger buffers for the write path

4b84f23

The OwnedAsyncWrite stuff is based on the code in tokio-epoll-uring on-demand download PR (#6992), which hasn't merged yet.

junk up owned_buffers_io from previous commit to deal with EphemeralF…

91bd729

…ile speciality of reading zeroes past end-of-file This makes it diverge semantically from what's in the tokio-epoll-uring download PR :(

figure out why & when exactly zeroes past write offset are required &…

40200b7

… assert it

WIP: page_cache_priming_writer (is it really worth it?)

f1f0452

WIP: page_cache_priming_writer: plumb through RequestContext for prev…

eb1ccd7

…ious commit, yet more churn -,-

problame added 7 commits March 14, 2024 17:15

WIP: page_cache_priming_writer: bugfixes

a1f4eb2

WIP: page_cache_priming_writer: clippy

f0a67c4

page_cache_priming_writer: done

79a6cda

drop the parametrization changes, not part of the feature PR

fcd1ccf

Revert the page_cache_priming_writer changes

5e14362

It would require plumbing through the RequestContext, and its utility is somewhat dubious anyway.

problame mentioned this pull request Mar 28, 2024

test_bulk_insert / walingest generally is slower with tokio-epoll-uring #7124

Closed

2 tasks

problame requested a review from VladLazar April 3, 2024 11:51

VladLazar reviewed Apr 3, 2024

View reviewed changes

problame added 4 commits April 17, 2024 13:19

Merge remote-tracking branch 'origin/main' into problame/write-path-l…

78233dc

…arger-buffers

clippy nitpickery

99056dd

safety comment

717b7cb

fix doc comments

9281f54

jcsp reviewed Apr 17, 2024

View reviewed changes

problame mentioned this pull request Apr 23, 2024

refactor(owned_buffers_io::BufferedWriter): be generic over the type of buffer #7482

Merged

problame closed this Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(walingest): mitigate bulk ingest throughput regression through larger EphemeralFile in-memory buffer #7273

perf(walingest): mitigate bulk ingest throughput regression through larger EphemeralFile in-memory buffer #7273

problame commented Mar 28, 2024 •

edited

Loading

github-actions bot commented Mar 28, 2024 •

edited

Loading

problame commented Apr 3, 2024

VladLazar left a comment •

edited

Loading

VladLazar Apr 3, 2024

problame Apr 17, 2024

jcsp Apr 17, 2024

problame Apr 22, 2024

jcsp Apr 17, 2024

jcsp commented Apr 17, 2024

jcsp Apr 17, 2024

jcsp Apr 17, 2024

problame Apr 22, 2024

problame commented Apr 23, 2024

perf(walingest): mitigate bulk ingest throughput regression through larger EphemeralFile in-memory buffer #7273

perf(walingest): mitigate bulk ingest throughput regression through larger EphemeralFile in-memory buffer #7273

Conversation

problame commented Mar 28, 2024 • edited Loading

Problem

Changes

Resource Usage

Performance

Alternative

github-actions bot commented Mar 28, 2024 • edited Loading

2760 tests run: 2642 passed, 0 failed, 118 skipped (full report)

Code coverage* (full report)

problame commented Apr 3, 2024

VladLazar left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcsp commented Apr 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

problame commented Apr 23, 2024

problame commented Mar 28, 2024 •

edited

Loading

github-actions bot commented Mar 28, 2024 •

edited

Loading

VladLazar left a comment •

edited

Loading