L0 flush: opt-in mechanism to bypass PageCache reads and writes #8190

problame · 2024-06-27T18:18:13Z

part of #7418

Motivation

(reproducing #7418)

When we do an InMemoryLayer::write_to_disk, there is a tremendous amount of random read I/O, as deltas from the ephemeral file (written in LSN order) are written out to the delta layer in key order.

In benchmarks (#7409) we can see that this delta layer writing phase is substantially more expensive than the initial ingest of data, and that within the delta layer write a significant amount of the CPU time is spent traversing the page cache.

High-Level Changes

Add a new mode for L0 flush that works as follows:

Read the full ephemeral file into memory -- layers are much smaller than total memory, so this is afforable
Do all the random reads directly from this in memory buffer instead of using blob IO/page cache/disk reads.
Add a semaphore to limit how many timelines may concurrently do this (limit peak memory).
Make the semaphore configurable via PS config.

Implementation Details

The new BlobReaderRef::Slice is a temporary hack until we can ditch blob_io for InMemoryLayer => Plan for this is laid out in #8183

Correctness

The correctness of this change is quite obvious to me: we do what we did before (blob_io) but read from memory instead of going to disk.

The highest bug potential is in doing owned-buffers IO. I refactored the API a bit in preliminary PR #8186 to make it less error-prone, but still, careful review is requested.

Performance

I manually measured single-client ingest performance from pgbench -i ....

Full report: https://neondatabase.notion.site/2024-06-28-benchmarking-l0-flush-performance-e98cff3807f94cb38f2054d8c818fe84?pvs=4

tl;dr:

no speed improvements during ingest, but
significantly lower pressure on PS PageCache (eviction rate drops to 1/3)
- (that's why I'm working on this)
noticable but modestly lower CPU time

This is good enough for merging this PR because the changes require opt-in.

We'll do more testing in staging & pre-prod.

Stability / Monitoring

memory consumption: there's no hard limit on max InMemoryLayer size (aka "checkpoint distance") , hence there's no hard limit on the memory allocation we do for flushing. In practice, we a) log a warning when we flush oversized layers, so we'd know which tenant is to blame and b) if we were to put a hard limit in place, we would have to decide what to do if there is an InMemoryLayer that exceeds the limit.
It seems like a better option to guarantee a max size for frozen layer, dependent on checkpoint_distance. Then limit concurrency based on that.

metrics: we do have the flush_time_histo, but that includes the wait time for the semaphore. We could add a separate metric for the time spent after acquiring the semaphore, so one can infer the wait time. Seems unnecessary at this point, though.

…-delta-layer-writes + some hacking

…for read path page-caching

…-delta-layer-writes

github-actions · 2024-06-27T18:29:32Z

3000 tests run: 2885 passed, 0 failed, 115 skipped (full report)

Code coverage* (full report)

functions: 32.7% (6922 of 21173 functions)
lines: 50.0% (54264 of 108471 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
c5bc214 at 2024-07-02T11:42:57.331Z :recycle:}

problame · 2024-06-28T08:53:14Z

the remote_storage changes will hopefully land before this PR in separate PR #8193

pageserver/src/tenant/ephemeral_file/page_caching.rs

pageserver/src/tenant/timeline/layer_manager.rs

problame · 2024-06-28T10:18:36Z

Did some manual perf testing. Updated PR description, report here: https://www.notion.so/neondatabase/2024-06-28-benchmarking-l0-flush-performance-e98cff3807f94cb38f2054d8c818fe84?pvs=4

problame · 2024-06-28T10:18:53Z

Stability / Monitoring

TODO

Existing page cache metrics are sufficient to quantify the impact on PageCache.

Do we want metrics on semaphore wait queue length or are higher-level metrics sufficient?

jcsp · 2024-07-01T10:17:25Z

Do we want metrics on semaphore wait queue length or are higher-level metrics sufficient?

If it's easy, then a queue depth stat is a nice thing to have in our back pocket. Not mandatory though.

problame · 2024-07-02T10:55:41Z

@jcsp I addressed your review comments, see latest pushes.

Also, given that I plumbed through the l0_flush::L0FlushGlobalState, maybe we want to move the GlobalResourceUnits there, so it no longer is a global lazy static? (Obviously in a follow-up PR)

neon/pageserver/src/tenant/storage_layer/inmemory_layer.rs

Lines 116 to 121 in 511d664

    
           // Per-timeline RAII struct for its contribution to [`GlobalResources`] 
        
           struct GlobalResourceUnits { 
        
               // How many dirty bytes have I added to the global dirty_bytes: this guard object is responsible 
        
               // for decrementing the global counter by this many bytes when dropped. 
        
               dirty_bytes: u64, 
        
           }

part of #7418 # Motivation (reproducing #7418) When we do an `InMemoryLayer::write_to_disk`, there is a tremendous amount of random read I/O, as deltas from the ephemeral file (written in LSN order) are written out to the delta layer in key order. In benchmarks (#7409) we can see that this delta layer writing phase is substantially more expensive than the initial ingest of data, and that within the delta layer write a significant amount of the CPU time is spent traversing the page cache. # High-Level Changes Add a new mode for L0 flush that works as follows: * Read the full ephemeral file into memory -- layers are much smaller than total memory, so this is afforable * Do all the random reads directly from this in memory buffer instead of using blob IO/page cache/disk reads. * Add a semaphore to limit how many timelines may concurrently do this (limit peak memory). * Make the semaphore configurable via PS config. # Implementation Details The new `BlobReaderRef::Slice` is a temporary hack until we can ditch `blob_io` for `InMemoryLayer` => Plan for this is laid out in #8183 # Correctness The correctness of this change is quite obvious to me: we do what we did before (`blob_io`) but read from memory instead of going to disk. The highest bug potential is in doing owned-buffers IO. I refactored the API a bit in preliminary PR #8186 to make it less error-prone, but still, careful review is requested. # Performance I manually measured single-client ingest performance from `pgbench -i ...`. Full report: https://neondatabase.notion.site/2024-06-28-benchmarking-l0-flush-performance-e98cff3807f94cb38f2054d8c818fe84?pvs=4 tl;dr: * no speed improvements during ingest, but * significantly lower pressure on PS PageCache (eviction rate drops to 1/3) * (that's why I'm working on this) * noticable but modestly lower CPU time This is good enough for merging this PR because the changes require opt-in. We'll do more testing in staging & pre-prod. # Stability / Monitoring **memory consumption**: there's no _hard_ limit on max `InMemoryLayer` size (aka "checkpoint distance") , hence there's no hard limit on the memory allocation we do for flushing. In practice, we a) [log a warning](https://github.com/neondatabase/neon/blob/23827c6b0d400cbb9a972d4d05d49834816c40d1/pageserver/src/tenant/timeline.rs#L5741-L5743) when we flush oversized layers, so we'd know which tenant is to blame and b) if we were to put a hard limit in place, we would have to decide what to do if there is an InMemoryLayer that exceeds the limit. It seems like a better option to guarantee a max size for frozen layer, dependent on `checkpoint_distance`. Then limit concurrency based on that. **metrics**: we do have the [flush_time_histo](https://github.com/neondatabase/neon/blob/23827c6b0d400cbb9a972d4d05d49834816c40d1/pageserver/src/tenant/timeline.rs#L3725-L3726), but that includes the wait time for the semaphore. We could add a separate metric for the time spent after acquiring the semaphore, so one can infer the wait time. Seems unnecessary at this point, though.

problame added 26 commits June 25, 2024 14:08

WIP

b67c760

wip

14eb9a1

Merge branch 'main' into problame/fast-delta-layer-writes

fc8ece0

read_exact_at_impl: accept a BoundedBuf

282a633

WIP

79a33dc

WIP

c7fc169

it compiles

27cae3c

Merge branch 'main' into problame/virtualfile-use-boundedbuf

1e5c126

Merge branch 'problame/virtualfile-use-boundedbuf' into problame/fast…

8070fc8

…-delta-layer-writes + some hacking

hack hack

5031f41

refactor: generalize toml_edit::Item deserialization

fdeff87

hack hack hack

eb593a7

hack hack hack

38608b3

fix some bugs

d4f419a

don't prewarm page cache on EphemeralFile write and add TODO comment …

9cdbc7b

…for read path page-caching

self-review

2f8d12b

Merge branch 'main' into problame/virtualfile-use-boundedbuf

f49b32f

get rid of read_exact_at alltogether

647f084

re-add read_exact_at

928c1dc

finish & fix some ub with std-fs (will pull this into a preliminary)

df56595

Merge branch 'main' into problame/virtualfile-use-boundedbuf

b481740

Merge branch 'problame/virtualfile-use-boundedbuf' into problame/fast…

562c091

…-delta-layer-writes

fix test and pretty up

98d8721

Merge branch 'problame/virtualfile-use-boundedbuf' into problame/fast…

2dbcabe

…-delta-layer-writes

Merge branch 'main' into problame/virtualfile-use-boundedbuf

9fa4f47

Merge branch 'problame/virtualfile-use-boundedbuf' into problame/fast…

d0cbb3d

…-delta-layer-writes

problame mentioned this pull request Jun 28, 2024

bypass PageCache for L0 flush #7418

Open

fixups for config stuff

1391b07

Base automatically changed from problame/virtualfile-use-boundedbuf to main June 28, 2024 09:20

Merge branch 'main' into problame/fast-delta-layer-writes

c8a636a

jcsp reviewed Jun 28, 2024

View reviewed changes

pageserver/src/tenant/ephemeral_file/page_caching.rs Outdated Show resolved Hide resolved

jcsp reviewed Jun 28, 2024

View reviewed changes

pageserver/src/tenant/ephemeral_file/page_caching.rs Outdated Show resolved Hide resolved

jcsp reviewed Jun 28, 2024

View reviewed changes

pageserver/src/tenant/ephemeral_file/page_caching.rs Outdated Show resolved Hide resolved

jcsp reviewed Jun 28, 2024

View reviewed changes

pageserver/src/tenant/timeline/layer_manager.rs Outdated Show resolved Hide resolved

problame added 7 commits July 1, 2024 10:23

naming: #8190 (comment)

7378276

naming: #8190 (comment)

dc34ac3

fmt and linters

5a34d2b

less awkwardness: #8190 (comment)

740344e

use ::default() in all the places

ad004b7

infer prewarm_on_write from PageServerConf; #8190 (comment)

511d664

Merge branch 'main' into problame/fast-delta-layer-writes

c5bc214

problame marked this pull request as ready for review July 2, 2024 10:54

problame requested a review from a team as a code owner July 2, 2024 10:54

problame requested a review from VladLazar July 2, 2024 10:54

jcsp approved these changes Jul 2, 2024

View reviewed changes

problame merged commit 5de896e into main Jul 2, 2024
65 checks passed

problame deleted the problame/fast-delta-layer-writes branch July 2, 2024 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L0 flush: opt-in mechanism to bypass PageCache reads and writes #8190

L0 flush: opt-in mechanism to bypass PageCache reads and writes #8190

problame commented Jun 27, 2024 •

edited

Loading

github-actions bot commented Jun 27, 2024 •

edited

Loading

problame commented Jun 28, 2024

problame commented Jun 28, 2024

problame commented Jun 28, 2024

Stability / Monitoring

jcsp commented Jul 1, 2024

problame commented Jul 2, 2024 •

edited

Loading

L0 flush: opt-in mechanism to bypass PageCache reads and writes #8190

L0 flush: opt-in mechanism to bypass PageCache reads and writes #8190

Conversation

problame commented Jun 27, 2024 • edited Loading

Motivation

High-Level Changes

Implementation Details

Correctness

Performance

Stability / Monitoring

github-actions bot commented Jun 27, 2024 • edited Loading

3000 tests run: 2885 passed, 0 failed, 115 skipped (full report)

Code coverage* (full report)

problame commented Jun 28, 2024

problame commented Jun 28, 2024

problame commented Jun 28, 2024

Stability / Monitoring

jcsp commented Jul 1, 2024

problame commented Jul 2, 2024 • edited Loading

problame commented Jun 27, 2024 •

edited

Loading

github-actions bot commented Jun 27, 2024 •

edited

Loading

problame commented Jul 2, 2024 •

edited

Loading