pageserver: add time based image layer creation check #8247

VladLazar · 2024-07-03T14:28:41Z

Problem

Assume a timeline with the following workload: very slow ingest of updates to a small number of keys
that fit within the same partition (as decided by KeySpace::partition). These tenants will create small
L0 layers since due to time based rolling, and, consequently, the L1 layers will also be small.

Currently, by default, we need to ingest 512 MiB of WAL before checking if an image layer is required.
This scheme works fine under the assumption that L1s are roughly of checkpoint distance size, but
as the first paragraph explained, that's not the case for all workloads.

Summary of changes

Check if new image layers are required at least once every checkpoint timeout interval.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

koivunej

This should work. Sadly I cannot see a way to test this, except in staging.

VladLazar · 2024-07-03T14:42:21Z

Sadly I cannot see a way to test this, except in staging.

That was my plan: get it staging asap and validate the metric goes down

pageserver/src/tenant/timeline.rs

skyzh

LGTM and I probably also need to modify the metadata image layer generation trigger.

github-actions · 2024-07-03T16:18:10Z

3111 tests run: 2984 passed, 0 failed, 127 skipped (full report)

Flaky tests (1)

Postgres 14

test_basebackup_with_high_slru_count[github-actions-selfhosted-sequential-10-13-30]: release

Code coverage* (full report)

functions: 32.6% (6931 of 21275 functions)
lines: 50.0% (54495 of 108968 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
5c3cdff at 2024-07-05T12:28:31.364Z :recycle:}

Dismissing because I am unsure about the time+size based

jcsp

The major change here is that tiny tenants can generate image layers up to every 48 hours, whereas previously they'd never generate any at all: that's okay, but let's monitor it carefully to see how much extra work we're doing.

VladLazar · 2024-07-05T09:16:52Z

The major change here is that tiny tenants can generate image layers up to every 48 hours, whereas previously they'd never generate any at all: that's okay, but let's monitor it carefully to see how much extra work we're doing.

Will keep an eye on it

## Problem Assume a timeline with the following workload: very slow ingest of updates to a small number of keys that fit within the same partition (as decided by `KeySpace::partition`). These tenants will create small L0 layers since due to time based rolling, and, consequently, the L1 layers will also be small. Currently, by default, we need to ingest 512 MiB of WAL before checking if an image layer is required. This scheme works fine under the assumption that L1s are roughly of checkpoint distance size, but as the first paragraph explained, that's not the case for all workloads. ## Summary of changes Check if new image layers are required at least once every checkpoint timeout interval.

VladLazar requested a review from a team as a code owner July 3, 2024 14:28

VladLazar requested review from petuhovskiy, koivunej and skyzh and removed request for petuhovskiy July 3, 2024 14:28

VladLazar added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Jul 3, 2024

koivunej previously approved these changes Jul 3, 2024

View reviewed changes

jcsp reviewed Jul 3, 2024

View reviewed changes

pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved

skyzh approved these changes Jul 3, 2024

View reviewed changes

VladLazar requested review from jcsp, skyzh and koivunej July 4, 2024 09:44

VladLazar changed the title ~~pageserver: add time based imake layer creation check~~ pageserver: add time based image layer creation check Jul 4, 2024

jcsp approved these changes Jul 5, 2024

View reviewed changes

VladLazar added 3 commits July 5, 2024 11:21

pageserver: add time based imake layer creation check

537650f

Format and switch to >=

610c884

review: special case for big/small tenants

5c3cdff

VladLazar force-pushed the vlad/time-based-img-layer-check branch from 2c1884d to 5c3cdff Compare July 5, 2024 10:22

VladLazar merged commit 7dd2e44 into main Jul 5, 2024
69 checks passed

VladLazar deleted the vlad/time-based-img-layer-check branch July 5, 2024 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: add time based image layer creation check #8247

pageserver: add time based image layer creation check #8247

VladLazar commented Jul 3, 2024

koivunej left a comment

VladLazar commented Jul 3, 2024

skyzh left a comment

github-actions bot commented Jul 3, 2024 •

edited

Loading

Postgres 14

jcsp left a comment •

edited

Loading

VladLazar commented Jul 5, 2024

pageserver: add time based image layer creation check #8247

pageserver: add time based image layer creation check #8247

Conversation

VladLazar commented Jul 3, 2024

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

koivunej left a comment

Choose a reason for hiding this comment

VladLazar commented Jul 3, 2024

skyzh left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 3, 2024 • edited Loading

3111 tests run: 2984 passed, 0 failed, 127 skipped (full report)

Postgres 14

Code coverage* (full report)

jcsp left a comment • edited Loading

Choose a reason for hiding this comment

VladLazar commented Jul 5, 2024

github-actions bot commented Jul 3, 2024 •

edited

Loading

jcsp left a comment •

edited

Loading