Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(walingest): mitigate bulk ingest throughput regression through larger EphemeralFile in-memory buffer #7273

Closed
wants to merge 41 commits into from

Conversation

problame
Copy link
Contributor

@problame problame commented Mar 28, 2024

refs #7124

Problem

The test_bulk_ingest benchmark shows about 2x lower throughput with tokio-epoll-uring compared to std-fs.
That's why we temporarily disabled it in #7238.

The reason for this regression is that the benchmark runs on a system without memory pressure and thus std-fs writes don't block on disk IO but only copy the data into the kernel page cache.
The tokio-epoll-uring cannot beat that at this time, and possibly never.
(However, under memory pressure, std-fs would stall the executor thread on kernel page cache writeback disk IO. That's why we want to use tokio-epoll-uring. And we likely want to use O_DIRECT in the future, at which point std-fs becomes an absolute show-stopper.)

Further, bulk walingest is a pathological case because the InMemoryLayer into which ingest happens uses EphemeralFile. EphemeralFile flushes its mutable_tail every 8k of WAL. And it does not do double-buffering / pipelining, i.e., while flushing the buffer, no new WAL can be ingested.

More elaborate analysis: https://neondatabase.notion.site/Why-test_bulk_ingest-is-slower-with-tokio-epoll-uring-918c5e619df045a7bd7b5f806cfbd53f?pvs=4

Changes

As a medium-term fix, increase the buffer size of EphemeralFile to 64k. Longer-term, we probably want to do double-buffering / pipelined IO.

While thinking about this problem, I had the insight that we can generalize the recently added owned_buffers_io abstractions a bit, eliminating the special Writer inside the EphemeralFile::write_blob function.
That's what I did in this PR.

We lose the pageserver PageCache priming aspect that the old EphemeralFile had. The commit history contains an attempt & revert to add it, but, it junks up the owned_buffers_io interface even further.

  • decide/experiment with we want to abandon the pre-warming
    • this could be a preliminary PR that we can ship to prod & measure impact
  • re-enable the benchmark
  • run this PR with with benchmarks enabled, for both std-fs and tokio-epoll-uring, ensure it passes, compare results

Resource Usage

We currently do not flush the buffer when freezing the InMemoryLayer.
That means a single Timeline can have multiple 64k mutables tails alive, esp if flushing is slow.

We should either bound the number of frozen layers.
Or we should change the freezing code to flush the buffer and drop the allocation.

Performance

I last ran the test_bulk_ingest benchmark when I still had the page-cache pre-warming.
Report in commit 261f116a2d47b395005858d818bfa9a64b98dc30.

test_bulk_insert[neon-release-pg14-std-fs].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 8.381 s

test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 11.760 s

So, still somewhat of a regression.

TODO:

  • experiment with 128k
  • re-run once we have decided on whether we need the pre-warming or not.

Alternative

The alternative would be to keep the write_blob's Writer and duplicate the buffered-writer functionality from owned_buffers_io into it.

…never used (same for get_vectored_impl)"

This reverts commit 5b8888c.
With this

test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 16.053 s

down from

test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 17.669 s

but the regression is from baseline

test_bulk_insert[neon-release-pg14-std-fs].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 9.335 s
…ode paths (#6378)"

Unchanged

test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 9.194 s

This reverts commit 3da410c.
…a spawn_blocking single-threaded runtime

This makes things worse with tokio-epoll-uring

test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 19.574 s

This is a partial revert of 3da410c
…in a spawn_blocking thread single-threaded runtime

builds on top of the previous commit

test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 13.153 s
together with previous commits, this brings us back down to
pre-regression

test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 9.991 s
…he page cache

... by forcing each write system call to go to disk

test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 346 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 92.559 s

test_bulk_insert[neon-release-pg14-std-fs].wal_written: 346 MB
test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 81.998 s

=> 10%ish worse instead of 2x
… engine in a spawn_blocking thread single-threaded runtime"

This reverts commit 4a8e7f8.
… inside a spawn_blocking single-threaded runtime"

This reverts commit 72a8e09.
… performs better because it hits the page cache

... by forcing each write system call to go to disk

test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 346 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 93.417 s

test_bulk_insert[neon-release-pg14-std-fs].wal_written: 346 MB
test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 86.009 s

=> ~8% instead of 2x difference
…t std-fs performs better because it hits the page cache"

This reverts commit d66ccba.
The OwnedAsyncWrite stuff is based on the code in
tokio-epoll-uring on-demand download PR (#6992), which hasn't merged
yet.
…ile speciality of reading zeroes past end-of-file

This makes it diverge semantically from what's in the tokio-epoll-uring
download PR :(
test_bulk_insert[neon-release-pg14-std-fs].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-std-fs].wal_recovery: 8.381 s

test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_written: 345 MB
test_bulk_insert[neon-release-pg14-tokio-epoll-uring].wal_recovery: 11.760 s
…arger-buffers

Conflicts:
	control_plane/src/local_env.rs
	control_plane/src/pageserver.rs
the above have been preliminaried to `main` since
	pageserver/src/virtual_file.rs
	pageserver/src/virtual_file/owned_buffers_io/util/size_tracking_writer.rs
	pageserver/src/virtual_file/owned_buffers_io/write.rs
strategy: pick ours (likely breaks the build, will fix subsequently)
It would require plumbing through the RequestContext, and its utility is
somewhat dubious anyway.
Copy link

github-actions bot commented Mar 28, 2024

2760 tests run: 2642 passed, 0 failed, 118 skipped (full report)


Code coverage* (full report)

  • functions: 28.1% (6466 of 23041 functions)
  • lines: 46.4% (44771 of 96571 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
9281f54 at 2024-04-17T14:19:43.543Z :recycle:

@problame problame requested a review from VladLazar April 3, 2024 11:51
@problame
Copy link
Contributor Author

problame commented Apr 3, 2024

Talked to @jcsp about this yesterday, he also thinks the page-cache pre-warming is probably counter-productive.

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts:

  • Why do we even need the borrowed write interface? The upstream call stack does borrowed writes, but I don't see
    a strong reason why it has to do that (perhaps something related to the atomicity of Timeline::flush|commit -
    could be worked around if that's the case).
  • My intuition is that we don't need to pre-warm the cache. Monitoring get_page latency on Peter's benches
    should give us some clues there.
  • Having multiple tails alive for each timeline isn't great, but that was the case previously, right? Although
    the tail was shorter.
  • The stacking of abstractions in the ephemeral file is very clever, but I'm wondering if a separate "all-in-one" abstraction would be easier on the eye (I don't feel strongly about this).

}

fn bytes_total(&self) -> usize {
self.written // ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be N?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, yeah.

Tangentially, I wonder why IoBuf needs the bytes_total in the first place. For IoBufMut , sure, it needs to know the bounds of where it needs to write. Buf for IoBuf, which is the equivalent of &[u8], I don't understand why it would need to know the length including uninitialized memory.

/// We keep the last page, which can still be modified, in [`Self::mutable_tail`].
/// The other pages, which can no longer be modified, are accessed through the page cache.
/// We sandwich the buffered writer between two size-tracking writers.
/// This allows us to "elegantly" track in-memory bytes vs flushed bytes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"elegantly" :-D

Future: I think this could be made elegant though, if the sandwich was wrapped up in a separate type, and that separate type implemented a read() method that handled the read from the buffer. Although we may find it makes more sense to just flush to disk if the data we want to read is still buffered, since that should be a pretty rare case (we might find there are frequent callers from wal ingest that should really be caching something at a higher level rather than reading-back very recent writes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there are really two differents asks in this comment

Future: I think this could be made elegant though, if the sandwich was wrapped up in a separate type, and that separate type implemented a read() method that handled the read from the buffer.

This is cleanliness / encapsulation for the "read from buffer" approach that we find in this PR.
I can see myself implementing that before merging this PR.

Although we may find it makes more sense to just flush to disk if the data we want to read is still buffered, since that should be a pretty rare case (we might find there are frequent callers from wal ingest that should really be caching something at a higher level rather than reading-back very recent writes)

I like the idea of not serving reads from memory faster than from disk.

Are you suggestion to write out < 64KiB blocks in that case?
That would be a different PR on top of this one though.

(I think before going that route, we could just add a 100us tokio sleep before serving a read from memory).

/// None <=> IO is ongoing.
/// Size is fixed to PAGE_SZ at creation time and must not be changed.
mutable_tail: Option<BytesMut>,
/// TODO: longer-term, we probably wand to get rid of this in favor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- I'd be inclined to build the pipelining into BufferedWriter

@jcsp
Copy link
Collaborator

jcsp commented Apr 17, 2024

In a microbenchmark (#7409, case ingest-big-values/ingest 128MB/8k seq, no delta) of 8k put_value to inmemory layer, this PR increases throughput by an order of magnitude when writing to memory (~1.6GB/s -> 7.5 GB/s on my workstation), so it clearly removes a major CPU bottleneck.

I like that this change removes the mutable_tail in favor of a generic buffered writer. The "sandwich" of writers is a bit strange on first read, but seems fairly robust.

self.len += 4;
}
self.len += srcbuf.len() as u64;
// TODO: bring back pre-warming of page cache, using another sandwich layer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just drop this TODO -- warming page cache with very recently written data is contrary to the I/O flow we expect, as you hint in the PR messages.


#[inline(always)]
async fn write_all_borrowed(&mut self, _buf: &[u8]) -> std::io::Result<usize> {
// TODO: ensure this through the type system
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand the commit: it's not obvious why this should be illegal: why can't one write from a &[u8] into a VirtualFile? Once done, this can probably lose the TODO prefix unless we have a specific idea for how to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't one write from a &[u8] into a VirtualFile?

Because a &[u8] is not owned which is incompatible with Rust's async model & io_uring.
Rust's ownership / borrowing system can't model that the kernel holds a reference to the memory pointed to by &[u8] while the io_uring operation is ongoing.
Hence Rust allows dropping the future while the operation is still ongoing, which then leads to memory corruption.
That's why I converted the whole write (and read!) path to owned IO.

Once done, this can probably lose the TODO prefix unless we have a specific idea for how to do it.

I think for the purposes of this coment, it's sufficient to state that

The VirtualFile API requires owned buffers, not slices, because that's what tokio_epoll_uring requires.

Perhaps the write_all_borrowed should be an extension trait that is closer to the InMemoryLayer where we actually need it.

@problame
Copy link
Contributor Author

I split up this PR into a bunch of refactorings that can be reviewed independently.
Careful review is required because this is sensitive code where we can't be wrong.

See the parent issue #7124 's task list "Short-Term: Increase Buffer Sizes".

@problame problame closed this Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants