pageserver: backpressure on layer freeze/flush #7317

jcsp · 2024-04-04T14:15:44Z

Currently the flush loop can flush any number of layers: they get enqueued for upload but we don't wait for the upload to complete.

If S3 uploads are slower than we are ingesting data, then we can build up rather large buffers of flushed-but-not-uploaded layers: more data than we can upload within our clean shutdown time budget, and so much data that the next startup is quite stressful as we try to re-ingest it all.

We may apply some simple backpressure by adding a wait in the flush loop: immediately after we schedule the index upload, we may wait for the remote timeline client to catch up.

The resulting buffer size for each tenant would be two layers: one in-memory layer they're writing to, and one frozen/flushed layer that is waiting to upload to S3.

@jcsp

part of #7124 # Problem (Re-stating the problem from #7124 for posterity) The `test_bulk_ingest` benchmark shows about 2x lower throughput with `tokio-epoll-uring` compared to `std-fs`. That's why we temporarily disabled it in #7238. The reason for this regression is that the benchmark runs on a system without memory pressure and thus std-fs writes don't block on disk IO but only copy the data into the kernel page cache. `tokio-epoll-uring` cannot beat that at this time, and possibly never. (However, under memory pressure, std-fs would stall the executor thread on kernel page cache writeback disk IO. That's why we want to use `tokio-epoll-uring`. And we likely want to use O_DIRECT in the future, at which point std-fs becomes an absolute show-stopper.) More elaborate analysis: https://neondatabase.notion.site/Why-test_bulk_ingest-is-slower-with-tokio-epoll-uring-918c5e619df045a7bd7b5f806cfbd53f?pvs=4 # Changes This PR increases the buffer size of `blob_io` and `EphemeralFile` from PAGE_SZ=8k to 64k. Longer-term, we probably want to do double-buffering / pipelined IO. # Resource Usage We currently do not flush the buffer when freezing the InMemoryLayer. That means a single Timeline can have multiple 64k buffers alive, esp if flushing is slow. This poses an OOM risk. We should either bound the number of frozen layers (#7317). Or we should change the freezing code to flush the buffer and drop the allocation. However, that's future work. # Performance (Measurements done on i3en.3xlarge.) The `test_bulk_insert.py` is too noisy, even with instance storage. It varies by 30-40%. I suspect that's due to compaction. Raising amount of data by 10x doesn't help with the noisiness.) So, I used the `bench_ingest` from @jcsp 's #7409 . Specifically, the `ingest-small-values/ingest 128MB/100b seq` and `ingest-small-values/ingest 128MB/100b seq, no delta` benchmarks. | | | seq | seq, no delta | |-----|-------------------|-----|---------------| | 8k | std-fs | 55 | 165 | | 8k | tokio-epoll-uring | 37 | 107 | | 64k | std-fs | 55 | 180 | | 64k | tokio-epoll-uring | 48 | 164 | The `8k` is from before this PR, the `64k` is with this PR. The values are the throughput reported by the benchmark (MiB/s). We see that this PR gets `tokio-epoll-uring` from 67% to 87% of `std-fs` performance in the `seq` benchmark. Notably, `seq` appears to hit some other bottleneck at `55 MiB/s`. CC'ing #7418 due to the apparent bottlenecks in writing delta layers. For `seq, no delta`, this PR gets `tokio-epoll-uring` from 64% to 91% of `std-fs` performance.

Makes `flush_frozen_layer` add a barrier to the upload queue and makes it wait for that barrier to be reached until it lets the flushing be completed. This gives us backpressure and ensures that writes can't build up in an unbounded fashion. Fixes #7317

jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Apr 4, 2024

jcsp mentioned this issue Apr 5, 2024

OOMs in staging triggered by production-like pageserver workload benchmark #6939

Closed

problame mentioned this issue Apr 23, 2024

perf!: use larger buffers for blob_io and ephemeral_file #7485

Merged

problame mentioned this issue Apr 30, 2024

Epic: per-tenant write throttle #7564

Open

jcsp mentioned this issue Jul 15, 2024

Epic: pageserver backpressure #8390

Open

jcsp assigned arpad-m Jul 29, 2024

arpad-m mentioned this issue Jul 30, 2024

Wait for completion of the upload queue in flush_frozen_layer #8550

Merged

koivunej mentioned this issue Aug 1, 2024

Test test_backpressure_received_lsn_lag does not pass when failpoint is properly enabled #1587

Open

arpad-m closed this as completed in #8550 Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: backpressure on layer freeze/flush #7317

pageserver: backpressure on layer freeze/flush #7317

jcsp commented Apr 4, 2024

pageserver: backpressure on layer freeze/flush #7317

pageserver: backpressure on layer freeze/flush #7317

Comments

jcsp commented Apr 4, 2024