test(pageserver): quantify compaction outcome #7867

skyzh · 2024-05-23T19:04:06Z

Problem

A simple API to collect some statistics after compaction to easily understand the result.

The tool reads the layer map, and analyze range by range instead of doing single-key operations, which is more efficient than doing a benchmark to collect the result. It currently computes two key metrics:

Latest data access efficiency, which finds how many delta layers / image layers the system needs to iterate before returning any key in a key range.
(Approximate) PiTR efficiency, as in quantification of compaction algorithms #7770, which is simply the number of delta files in the range. The reason behind that is, assume no image layer is created, PiTR efficiency is simply the cost of collect records from the delta layers, and the replay time. Number of delta files (or in the future, estimated size of reads) is a simple yet efficient way of estimating how much effort the page server needs to reconstruct a page.

Summary of changes

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-05-23T19:47:40Z

3268 tests run: 3116 passed, 0 failed, 152 skipped (full report)

Flaky tests (1)

Postgres 15

test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

functions: 31.5% (6596 of 20943 functions)
lines: 48.5% (51038 of 105301 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
45cad40 at 2024-06-05T19:26:39.506Z :recycle:}

Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh · 2024-05-31T20:02:54Z

Updated the tool to become an HTTP interface so that the Python tests can read it.

Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh · 2024-06-04T16:02:33Z

I believe CI will pass this time and ready for review :) Thanks! @arpad-m @problame

problame

This PR adds code to determine, for a given layer map snapshot, the amount of delta layers that need to be visited before we hit an image layer when reconstructing any key in the layer map.

That metric is what I'd laxely call delta layer stack height.

It is a rough proxy metric for random getpage@lsn IO amplification under the assumption of uniform density on key & LSN dimension among all delta layers in the layer map. I.e., the probability of finding X amount of information about a random (key,lsn) \in key-lsn-range(L) of a given layer L is the same for all layers L.

While this is useful, we wanted the point-in-time & total space efficiency metrics.

I suppose to calculate worst-case point-in-time space usage, we'd need a similar analysis but along the LSN dimension.

In addition to the missing metrics, I suggest to move the analysis code into a sub-module of mod timeline that extends impl Timeline.

E.g., like we did for the compaction code:

neon/pageserver/src/tenant/timeline/compaction.rs

Line 46 in 3860bc9

impl Timeline {

Lastly, what about branching? Not covered in this PR.

I suggest the way forward:

Apply renaming to submodule in this PR, then let's get it merged.
Another PR to add branching support (just build a temporary LayerMap instance that contains all the layers from all recusrive ancestors)
Another PR to extend the analysis for point-in-time space efficiency.

pageserver/src/http/routes.rs

problame · 2024-06-04T18:08:54Z

Hm, and one more thought: The max_num_of_deltas_above_image is only for @latest LSN, but what we (also) care about is worst-case max_num_of_deltas_above_image at random LSN.

Co-authored-by: Christian Schwarz <christian@neon.tech>

Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh · 2024-06-05T18:06:31Z

but what we (also) care about is worst-case max_num_of_deltas_above_image at random LSN.

Yep, that makes sense. I will submit a separate pull request for that.

skyzh · 2024-06-05T18:07:15Z

Ready for review again :) Hopefully I've resolved all the concerns and we have quite some future works for this analysis code to be very useful.

problame · 2024-06-10T08:43:23Z

but what we (also) care about is worst-case max_num_of_deltas_above_image at random LSN.

Yep, that makes sense. I will submit a separate pull request for that.

I'll continue to work on the quantification efforts, implementing my asks above, while @skyzh will work on #7948 and follow-ups.

skyzh requested a review from arpad-m May 23, 2024 19:04

skyzh requested a review from a team as a code owner May 23, 2024 19:04

problame mentioned this pull request May 28, 2024

quantification of compaction algorithms #7770

Closed

test(pageserver): quantify compaction outcome

fd46dc5

Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh force-pushed the skyzh/compaction-estimation-tool branch from c16c201 to fd46dc5 Compare May 31, 2024 20:02

skyzh changed the title ~~feat(pagectl): tool to estimate compaction outcome~~ test(pageserver): quantify compaction outcome May 31, 2024

skyzh added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label May 31, 2024

skyzh added 2 commits June 3, 2024 16:18

fix clippy

ecca5d7

Signed-off-by: Alex Chi Z <chi@neon.tech>

fix metrics collection

f8ed5a7

Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh force-pushed the skyzh/compaction-estimation-tool branch from daf73bb to f8ed5a7 Compare June 4, 2024 15:43

problame reviewed Jun 4, 2024

View reviewed changes

pageserver/src/http/routes.rs Outdated Show resolved Hide resolved

pageserver/src/http/routes.rs Outdated Show resolved Hide resolved

skyzh and others added 2 commits June 4, 2024 15:30

Update pageserver/src/http/routes.rs

b49e97a

Co-authored-by: Christian Schwarz <christian@neon.tech>

resolve comments

45cad40

Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh requested a review from problame June 5, 2024 18:06

problame approved these changes Jun 10, 2024

View reviewed changes

problame merged commit 3e63d0f into main Jun 10, 2024
68 checks passed

problame deleted the skyzh/compaction-estimation-tool branch June 10, 2024 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(pageserver): quantify compaction outcome #7867

test(pageserver): quantify compaction outcome #7867

skyzh commented May 23, 2024 •

edited

Loading

github-actions bot commented May 23, 2024 •

edited

Loading

Postgres 15

skyzh commented May 31, 2024

skyzh commented Jun 4, 2024

problame left a comment

problame commented Jun 4, 2024

skyzh commented Jun 5, 2024

skyzh commented Jun 5, 2024

problame commented Jun 10, 2024

test(pageserver): quantify compaction outcome #7867

test(pageserver): quantify compaction outcome #7867

Conversation

skyzh commented May 23, 2024 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented May 23, 2024 • edited Loading

3268 tests run: 3116 passed, 0 failed, 152 skipped (full report)

Postgres 15

Code coverage* (full report)

skyzh commented May 31, 2024

skyzh commented Jun 4, 2024

problame left a comment

Choose a reason for hiding this comment

problame commented Jun 4, 2024

skyzh commented Jun 5, 2024

skyzh commented Jun 5, 2024

problame commented Jun 10, 2024

skyzh commented May 23, 2024 •

edited

Loading

github-actions bot commented May 23, 2024 •

edited

Loading