Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(pageserver): quantify compaction outcome #7867

Merged
merged 5 commits into from
Jun 10, 2024

Conversation

skyzh
Copy link
Member

@skyzh skyzh commented May 23, 2024

Problem

A simple API to collect some statistics after compaction to easily understand the result.

The tool reads the layer map, and analyze range by range instead of doing single-key operations, which is more efficient than doing a benchmark to collect the result. It currently computes two key metrics:

  • Latest data access efficiency, which finds how many delta layers / image layers the system needs to iterate before returning any key in a key range.
  • (Approximate) PiTR efficiency, as in quantification of compaction algorithms #7770, which is simply the number of delta files in the range. The reason behind that is, assume no image layer is created, PiTR efficiency is simply the cost of collect records from the delta layers, and the replay time. Number of delta files (or in the future, estimated size of reads) is a simple yet efficient way of estimating how much effort the page server needs to reconstruct a page.

Summary of changes

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@skyzh skyzh requested a review from arpad-m May 23, 2024 19:04
@skyzh skyzh requested a review from a team as a code owner May 23, 2024 19:04
Copy link

github-actions bot commented May 23, 2024

3268 tests run: 3116 passed, 0 failed, 152 skipped (full report)


Flaky tests (1)

Postgres 15

  • test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

  • functions: 31.5% (6596 of 20943 functions)
  • lines: 48.5% (51038 of 105301 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
45cad40 at 2024-06-05T19:26:39.506Z :recycle:

Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh skyzh force-pushed the skyzh/compaction-estimation-tool branch from c16c201 to fd46dc5 Compare May 31, 2024 20:02
@skyzh skyzh changed the title feat(pagectl): tool to estimate compaction outcome test(pageserver): quantify compaction outcome May 31, 2024
@skyzh
Copy link
Member Author

skyzh commented May 31, 2024

Updated the tool to become an HTTP interface so that the Python tests can read it.

@skyzh skyzh added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label May 31, 2024
skyzh added 2 commits June 3, 2024 16:18
Signed-off-by: Alex Chi Z <chi@neon.tech>
Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh skyzh force-pushed the skyzh/compaction-estimation-tool branch from daf73bb to f8ed5a7 Compare June 4, 2024 15:43
@skyzh
Copy link
Member Author

skyzh commented Jun 4, 2024

I believe CI will pass this time and ready for review :) Thanks! @arpad-m @problame

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds code to determine, for a given layer map snapshot, the amount of delta layers that need to be visited before we hit an image layer when reconstructing any key in the layer map.

That metric is what I'd laxely call delta layer stack height.

It is a rough proxy metric for random getpage@lsn IO amplification under the assumption of uniform density on key & LSN dimension among all delta layers in the layer map. I.e., the probability of finding X amount of information about a random (key,lsn) \in key-lsn-range(L) of a given layer L is the same for all layers L.

While this is useful, we wanted the point-in-time & total space efficiency metrics.

I suppose to calculate worst-case point-in-time space usage, we'd need a similar analysis but along the LSN dimension.


In addition to the missing metrics, I suggest to move the analysis code into a sub-module of mod timeline that extends impl Timeline.

E.g., like we did for the compaction code:


Lastly, what about branching? Not covered in this PR.


I suggest the way forward:

  1. Apply renaming to submodule in this PR, then let's get it merged.
  2. Another PR to add branching support (just build a temporary LayerMap instance that contains all the layers from all recusrive ancestors)
  3. Another PR to extend the analysis for point-in-time space efficiency.

pageserver/src/http/routes.rs Outdated Show resolved Hide resolved
pageserver/src/http/routes.rs Outdated Show resolved Hide resolved
@problame
Copy link
Contributor

problame commented Jun 4, 2024

Hm, and one more thought: The max_num_of_deltas_above_image is only for @latest LSN, but what we (also) care about is worst-case max_num_of_deltas_above_image at random LSN.

skyzh and others added 2 commits June 4, 2024 15:30
Co-authored-by: Christian Schwarz <christian@neon.tech>
Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh
Copy link
Member Author

skyzh commented Jun 5, 2024

but what we (also) care about is worst-case max_num_of_deltas_above_image at random LSN.

Yep, that makes sense. I will submit a separate pull request for that.

@skyzh skyzh requested a review from problame June 5, 2024 18:06
@skyzh
Copy link
Member Author

skyzh commented Jun 5, 2024

Ready for review again :) Hopefully I've resolved all the concerns and we have quite some future works for this analysis code to be very useful.

@problame problame merged commit 3e63d0f into main Jun 10, 2024
68 checks passed
@problame problame deleted the skyzh/compaction-estimation-tool branch June 10, 2024 08:42
@problame
Copy link
Contributor

but what we (also) care about is worst-case max_num_of_deltas_above_image at random LSN.

Yep, that makes sense. I will submit a separate pull request for that.

I'll continue to work on the quantification efforts, implementing my asks above, while @skyzh will work on #7948 and follow-ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants