-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(walingest): mitigate bulk ingest throughput regression through larger EphemeralFile in-memory buffer #7273
Conversation
…ed (same for get_vectored_impl)
…never used (same for get_vectored_impl)" This reverts commit 5b8888c.
This reverts commit e85a631.
With this test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 16.053 s down from test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 17.669 s but the regression is from baseline test_bulk_insert[neon-release-pg14-std-fs].wal_written: 345 MB test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 9.335 s
This reverts commit 57241c1.
…a spawn_blocking single-threaded runtime This makes things worse with tokio-epoll-uring test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 19.574 s This is a partial revert of 3da410c
…in a spawn_blocking thread single-threaded runtime builds on top of the previous commit test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 13.153 s
together with previous commits, this brings us back down to pre-regression test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 9.991 s
…he page cache ... by forcing each write system call to go to disk test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 346 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 92.559 s test_bulk_insert[neon-release-pg14-std-fs].wal_written: 346 MB test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 81.998 s => 10%ish worse instead of 2x
…the page cache" This reverts commit 2edbc07.
This reverts commit c8c04c0.
… engine in a spawn_blocking thread single-threaded runtime" This reverts commit 4a8e7f8.
… inside a spawn_blocking single-threaded runtime" This reverts commit 72a8e09.
… performs better because it hits the page cache ... by forcing each write system call to go to disk test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 346 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 93.417 s test_bulk_insert[neon-release-pg14-std-fs].wal_written: 346 MB test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 86.009 s => ~8% instead of 2x difference
…t std-fs performs better because it hits the page cache" This reverts commit d66ccba.
The OwnedAsyncWrite stuff is based on the code in tokio-epoll-uring on-demand download PR (#6992), which hasn't merged yet.
…ile speciality of reading zeroes past end-of-file This makes it diverge semantically from what's in the tokio-epoll-uring download PR :(
…ious commit, yet more churn -,-
test_bulk_insert[neon-release-pg14-std-fs].wal_written: 345 MB test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 8.381 s test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 11.760 s
…arger-buffers Conflicts: control_plane/src/local_env.rs control_plane/src/pageserver.rs the above have been preliminaried to `main` since pageserver/src/virtual_file.rs pageserver/src/virtual_file/owned_buffers_io/util/size_tracking_writer.rs pageserver/src/virtual_file/owned_buffers_io/write.rs strategy: pick ours (likely breaks the build, will fix subsequently)
It would require plumbing through the RequestContext, and its utility is somewhat dubious anyway.
2760 tests run: 2642 passed, 0 failed, 118 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
9281f54 at 2024-04-17T14:19:43.543Z :recycle: |
Talked to @jcsp about this yesterday, he also thinks the page-cache pre-warming is probably counter-productive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts:
- Why do we even need the borrowed write interface? The upstream call stack does borrowed writes, but I don't see
a strong reason why it has to do that (perhaps something related to the atomicity ofTimeline::flush|commit
-
could be worked around if that's the case). - My intuition is that we don't need to pre-warm the cache. Monitoring
get_page
latency on Peter's benches
should give us some clues there. - Having multiple tails alive for each timeline isn't great, but that was the case previously, right? Although
the tail was shorter. - The stacking of abstractions in the ephemeral file is very clever, but I'm wondering if a separate "all-in-one" abstraction would be easier on the eye (I don't feel strongly about this).
} | ||
|
||
fn bytes_total(&self) -> usize { | ||
self.written // ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be N
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, yeah.
Tangentially, I wonder why IoBuf
needs the bytes_total
in the first place. For IoBufMut
, sure, it needs to know the bounds of where it needs to write. Buf for IoBuf
, which is the equivalent of &[u8]
, I don't understand why it would need to know the length including uninitialized memory.
/// We keep the last page, which can still be modified, in [`Self::mutable_tail`]. | ||
/// The other pages, which can no longer be modified, are accessed through the page cache. | ||
/// We sandwich the buffered writer between two size-tracking writers. | ||
/// This allows us to "elegantly" track in-memory bytes vs flushed bytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"elegantly" :-D
Future: I think this could be made elegant though, if the sandwich was wrapped up in a separate type, and that separate type implemented a read() method that handled the read from the buffer. Although we may find it makes more sense to just flush to disk if the data we want to read is still buffered, since that should be a pretty rare case (we might find there are frequent callers from wal ingest that should really be caching something at a higher level rather than reading-back very recent writes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, there are really two differents asks in this comment
Future: I think this could be made elegant though, if the sandwich was wrapped up in a separate type, and that separate type implemented a read() method that handled the read from the buffer.
This is cleanliness / encapsulation for the "read from buffer" approach that we find in this PR.
I can see myself implementing that before merging this PR.
Although we may find it makes more sense to just flush to disk if the data we want to read is still buffered, since that should be a pretty rare case (we might find there are frequent callers from wal ingest that should really be caching something at a higher level rather than reading-back very recent writes)
I like the idea of not serving reads from memory faster than from disk.
Are you suggestion to write out < 64KiB blocks in that case?
That would be a different PR on top of this one though.
(I think before going that route, we could just add a 100us tokio sleep before serving a read from memory).
/// None <=> IO is ongoing. | ||
/// Size is fixed to PAGE_SZ at creation time and must not be changed. | ||
mutable_tail: Option<BytesMut>, | ||
/// TODO: longer-term, we probably wand to get rid of this in favor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- I'd be inclined to build the pipelining into BufferedWriter
In a microbenchmark (#7409, case I like that this change removes the mutable_tail in favor of a generic buffered writer. The "sandwich" of writers is a bit strange on first read, but seems fairly robust. |
self.len += 4; | ||
} | ||
self.len += srcbuf.len() as u64; | ||
// TODO: bring back pre-warming of page cache, using another sandwich layer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just drop this TODO -- warming page cache with very recently written data is contrary to the I/O flow we expect, as you hint in the PR messages.
|
||
#[inline(always)] | ||
async fn write_all_borrowed(&mut self, _buf: &[u8]) -> std::io::Result<usize> { | ||
// TODO: ensure this through the type system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand the commit: it's not obvious why this should be illegal: why can't one write from a &[u8] into a VirtualFile? Once done, this can probably lose the TODO prefix unless we have a specific idea for how to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't one write from a &[u8] into a VirtualFile?
Because a &[u8]
is not owned which is incompatible with Rust's async model & io_uring.
Rust's ownership / borrowing system can't model that the kernel holds a reference to the memory pointed to by &[u8]
while the io_uring operation is ongoing.
Hence Rust allows dropping the future while the operation is still ongoing, which then leads to memory corruption.
That's why I converted the whole write (and read!) path to owned IO.
Once done, this can probably lose the TODO prefix unless we have a specific idea for how to do it.
I think for the purposes of this coment, it's sufficient to state that
The VirtualFile API requires owned buffers, not slices, because that's what
tokio_epoll_uring
requires.
Perhaps the write_all_borrowed
should be an extension trait that is closer to the InMemoryLayer
where we actually need it.
I split up this PR into a bunch of refactorings that can be reviewed independently. See the parent issue #7124 's task list "Short-Term: Increase Buffer Sizes". |
refs #7124
Problem
The
test_bulk_ingest
benchmark shows about 2x lower throughput withtokio-epoll-uring
compared tostd-fs
.That's why we temporarily disabled it in #7238.
The reason for this regression is that the benchmark runs on a system without memory pressure and thus std-fs writes don't block on disk IO but only copy the data into the kernel page cache.
The
tokio-epoll-uring
cannot beat that at this time, and possibly never.(However, under memory pressure, std-fs would stall the executor thread on kernel page cache writeback disk IO. That's why we want to use
tokio-epoll-uring
. And we likely want to use O_DIRECT in the future, at which point std-fs becomes an absolute show-stopper.)Further, bulk walingest is a pathological case because the
InMemoryLayer
into which ingest happens usesEphemeralFile
.EphemeralFile
flushes itsmutable_tail
every 8k of WAL. And it does not do double-buffering / pipelining, i.e., while flushing the buffer, no new WAL can be ingested.More elaborate analysis: https://neondatabase.notion.site/Why-test_bulk_ingest-is-slower-with-tokio-epoll-uring-918c5e619df045a7bd7b5f806cfbd53f?pvs=4
Changes
As a medium-term fix, increase the buffer size of EphemeralFile to 64k. Longer-term, we probably want to do double-buffering / pipelined IO.
While thinking about this problem, I had the insight that we can generalize the recently added
owned_buffers_io
abstractions a bit, eliminating the specialWriter
inside theEphemeralFile::write_blob
function.That's what I did in this PR.
We lose the pageserver PageCache priming aspect that the old EphemeralFile had. The commit history contains an attempt & revert to add it, but, it junks up the
owned_buffers_io
interface even further.Resource Usage
We currently do not flush the buffer when freezing the InMemoryLayer.
That means a single Timeline can have multiple 64k mutables tails alive, esp if flushing is slow.
We should either bound the number of frozen layers.
Or we should change the freezing code to flush the buffer and drop the allocation.
Performance
I last ran the
test_bulk_ingest
benchmark when I still had the page-cache pre-warming.Report in commit
261f116a2d47b395005858d818bfa9a64b98dc30
.So, still somewhat of a regression.
TODO:
Alternative
The alternative would be to keep the
write_blob
'sWriter
and duplicate the buffered-writer functionality fromowned_buffers_io
into it.