storage scrubber: GC ancestor shard layers #8196

jcsp · 2024-06-28T10:01:58Z

Problem

After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space.

Summary of changes

Extend the physical GC command with --mode=full, which includes cleaning up unreferenced ancestor shard layers
Add test test_scrubber_physical_gc_ancestors
Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much.
Refactor storage controller API client code out of storcon_client into a storage_controller/client crate
During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-06-28T10:41:55Z

3169 tests run: 3048 passed, 0 failed, 121 skipped (full report)

Flaky tests (1)

Postgres 16

test_timeline_deletion_with_files_stuck_in_upload_queue: debug

Code coverage* (full report)

functions: 32.7% (6992 of 21411 functions)
lines: 50.0% (55243 of 110439 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
fde87d2 at 2024-07-19T16:01:14.259Z :recycle:}

problame

We should clean up the variable names when speaking about tenants (tenant_shard_id vs tenant_id).

I would prefer we call it parent shard, so at least there is a minimal distinction from the word ancestor which we already use for Timeline.

What if the remote storage lists some but not all shards, e.g. because we're in the middle of Tenant::split_prepare ?

Edit: ah, nice, you handled that.

Kind of an important thing to regress-test. Shouldn't be too hard to test this with a pausable failpoint, should it?

Edit 2: but, did you also consider the same condition for the stream_tenant_timelines call made by gc_ancestor? I think it can observe a tenant shard in S3 but some of its timelines' IndexParts haven't been uploaded by Tenant::split_prepare yet.

Let's make sure we understand which races can happen and why they are fine / how they are handled. And: where will we document this? module-level comment doesn't seem like the worst choice, but, hard to discover.

About AncestorRefs: why is it not a more phyiscal refcount map, like on the absolute_key ? Sure, a bit less memory efficient but much harder to screw up. Similarly, I don't think we should go through the trouble of only tracking the cross-ancestor references.

I'm feeling kind of strongly about this, I think the code could be much simpler and lower risk.

test_runner/regress/test_storage_scrubber.py

storage_scrubber/src/pageserver_physical_gc.rs

jcsp · 2024-07-08T16:34:22Z

What if the remote storage lists some but not all shards, e.g. because we're in the middle of Tenant::split_prepare ?

I need to think about this more. The hardest case is if we list things to do GC, then a new split happens while we are GC'ing, which references layers we will delete. This is only an issue if doing multiple splits.

One possible soft approach is to always upload parent index_part during split, and then apply the min_age to parent index mtime

jcsp · 2024-07-09T15:16:19Z

. The hardest case is if we list things to do GC, then a new split happens while we are GC'ing, which references layers we will delete. This is only an issue if doing multiple splits.

Thought more about this. When we GC an ancestor layer, it means that the indices for children no longer refer to it. So any further splits cannot refer to the ancestral layer.

Importantly, the split process shuts down the remote timeline client for the parent shard before it uploads child shards. That means that once we upload the child shards, the parents may be considered "frozen" and do not risk derefencing any ancestor layers (as long as the pageserver is still running). If the pageserver dies halfway through a split, the storage controller will retry the split, which will eventually still write child shard indices which do not reference any ancestral layers that their parent doesn't reference.

jcsp · 2024-07-09T15:30:07Z

What if the remote storage lists some but not all shards, e.g. because we're in the middle of Tenant::split_prepare ?

Considering the "hard case" of multiple shard splits 0000 -> 0002,0102 ->0004,0104,0204,0304

A) If we see an incomplete set of the **04 shards, we'll refuse to do any ancestor GC because the code checks that for the highest shard count we've seen, we have all the shards at that count
B) If we do not see any of the **04 shards, then we'll treat the **02 shards as the latest ones, and that's safe because any ancestor referenced by **04 shards must be referenced by a **02 shard, because when we split we stop updates to the parent index before writing the child index, and in failure cases we will always re-run the split, freezing the parent again and writing the child again.
C) If we see the **04 shards but not the **02 shards (e.g. because we did two splits in really fast succession while a scrub was running), then that's usually okay because the integrity of the **02 shards doesn't matter once the **04 shards exist.
- problem case: if the split wrote the **04 shards, then the pageserver crashed, such that on next restart it's going to fire up the **02 shards again.
- problem case: if the split has written **04 shards, and those child shards have done some GC/compaction such that they no longer ref the parent, but the pageserver's response didn't make it to the storage controller, and the storage controller restarts, it will try to attach **02 shards again.

The most robust approach would be to have some persistent "split complete" marker in tenant shards: this would tell us that it's safe to assume that only higher shardcounts matter, and it's safe to drop things that are referenced by this shardcount. It is only safe to set the "split complete" marker once the storage controller has persisted the completion of the split, i.e. nothing will ever try to attach the ancestral shards.

…r controller API

jcsp · 2024-07-12T12:25:52Z

About AncestorRefs: why is it not a more phyiscal refcount map, like on the absolute_key ? Sure, a bit less memory efficient but much harder to screw up.

Memory efficiency is relatively important here, because we build the AncestorRefs for all timelines in an entire region.

Similarly, I don't think we should go through the trouble of only tracking the cross-ancestor references.

Same as above: because we're building this map globally, it's important that we limit the scope of what we store in memory.

Could we rework this to avoid building the monolithic ancestor refs map? Yes, but at the cost of having to first build global list of tenant shards, then go back and do the per-tenant work (with a smaller AncestorRefs scoped to one tenant). It's a tradeoff: for this PR, I'm preferring to take the memory cost of listing all ancestor refs, vs. the memory cost of storing a total list of all tenants (100ks).

I wrote this avoiding the assumption that listings are reliably returned in order, S3 kinda-sorta promises this, but I have limited trust in third party implementations which might offer a more readdir-like lack of ordering.

jcsp · 2024-07-12T12:35:08Z

I would prefer we call it parent shard, so at least there is a minimal distinction from the word ancestor which we already use for Timeline.

The use of "ancestor" is intentional, because a shard can reference layers that are from it's parent's parents. It's annoying that we have an "ancestor shard" and "ancestor timeline" concept, but I think it's the right word. We could call them "forebears" or some other synonym but I think that would be worse :-)

problame · 2024-07-12T13:15:51Z

Could we rework this to avoid building the monolithic ancestor refs map? Yes, but at the cost of having to first build global list of tenant shards, then go back and do the per-tenant work (with a smaller AncestorRefs scoped to one tenant)

If we're going to integrate scrubber with storcon anyway then we might as well avoid all the "discover tenants/shards through ListObjectsv2" and instead simply ask storcon for a list of tenants to scrub. And that list would include a consistent list of which shards exist.

…estral-gc

problame

LGTM, sorry for the delay

storage_scrubber/src/pageserver_physical_gc.rs

## Problem After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space. ## Summary of changes - Extend the physical GC command with `--mode=full`, which includes cleaning up unreferenced ancestor shard layers - Add test `test_scrubber_physical_gc_ancestors` - Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much. - Refactor storage controller API client code out of storcon_client into a `storage_controller/client` crate - During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.

Part of #8128, followed by #8502. ## Problem Currently we lack mechanism to alert unhealthy `scan_metadata` status if we start running this scrubber command as part of a cronjob. With the storage controller client introduced to storage scrubber in #8196, it is viable to set up alert by storing health status in the storage controller database. We intentionally do not store the full output to the database as the json blobs potentially makes the table really huge. Instead, only a health status and a timestamp recording the last time metadata health status is posted on a tenant shard. Signed-off-by: Yuchen Liang <yuchen@neon.tech>

jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver c/storage/scrubber Component: s3_scrubber labels Jun 28, 2024

jcsp changed the title ~~Jcsp/issue 7043 ancestral gc~~ storage scrubber: GC ancestor shard layers Jun 28, 2024

jcsp mentioned this pull request Jun 28, 2024

pageserver: clean up ancestral layers after split, old index_part objects #7043

Open

jcsp force-pushed the jcsp/issue-7043-ancestral-gc branch from 824449d to 88dfeaa Compare July 1, 2024 10:40

jcsp requested a review from problame July 1, 2024 18:35

jcsp marked this pull request as ready for review July 1, 2024 18:35

problame reviewed Jul 3, 2024

View reviewed changes

jcsp added 15 commits July 11, 2024 11:32

scrubber: GC ancestor layers after shard split

4f5591c

test nits

4194d18

GcSummary::merge

7e07008

refactor ancestor shards filters

4b5598c

remove commented out code

79ec6e2

pretty print ages

3fc0c39

typo in GcSummary::merge

941bd8f

s/tenant_id/tenant_shard_id/

9f7e35e

avoid manual epoch calculation

ec0b5f8

storcon: add a 'Scrubber' scope and permit it to describe tenants

6cefa0e

scrubber: enable specifying controller API

40b3a0c

tests: refactor StorageScrubber wrapper to take NeonEnv and use it fo…

b57d5c1

…r controller API

storcon: pull client type into shared crate

9ca0be5

scrubber: validate ancestor GC with controller

60ca3d3

cargo: remove unused ansi feature

fd7037f

jcsp force-pushed the jcsp/issue-7043-ancestral-gc branch from 8c63696 to fd7037f Compare July 11, 2024 10:34

jcsp requested a review from a team as a code owner July 11, 2024 10:34

jcsp requested review from petuhovskiy and arssher July 11, 2024 10:34

make AncestorRefs a newtype

232c7d4

jcsp added 4 commits July 15, 2024 21:09

tests: add test_scrubber_physical_gc_ancestors_split

9c4729d

storage controller: fix shutdown with stuck HTTP handlers

b6aedf5

Merge remote-tracking branch 'upstream/main' into jcsp/issue-7043-anc…

7363793

…estral-gc

lint

a935ed6

petuhovskiy removed their request for review July 18, 2024 11:34

problame approved these changes Jul 18, 2024

View reviewed changes

storage_scrubber/src/pageserver_physical_gc.rs Outdated Show resolved Hide resolved

storage_scrubber/src/pageserver_physical_gc.rs Show resolved Hide resolved

storage_scrubber/src/pageserver_physical_gc.rs Outdated Show resolved Hide resolved

Review nits

fde87d2

jcsp enabled auto-merge (squash) July 19, 2024 14:43

arssher approved these changes Jul 19, 2024

View reviewed changes

jcsp merged commit 4478151 into main Jul 19, 2024
66 checks passed

jcsp deleted the jcsp/issue-7043-ancestral-gc branch July 19, 2024 16:08

yliang412 mentioned this pull request Jul 24, 2024

feat(storcon): store scrubber metadata scan result #8480

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage scrubber: GC ancestor shard layers #8196

storage scrubber: GC ancestor shard layers #8196

jcsp commented Jun 28, 2024 •

edited

Loading

github-actions bot commented Jun 28, 2024 •

edited

Loading

Postgres 16

problame left a comment

jcsp commented Jul 8, 2024

jcsp commented Jul 9, 2024

jcsp commented Jul 9, 2024

jcsp commented Jul 12, 2024

jcsp commented Jul 12, 2024

problame commented Jul 12, 2024

problame left a comment

storage scrubber: GC ancestor shard layers #8196

storage scrubber: GC ancestor shard layers #8196

Conversation

jcsp commented Jun 28, 2024 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Jun 28, 2024 • edited Loading

3169 tests run: 3048 passed, 0 failed, 121 skipped (full report)

Postgres 16

Code coverage* (full report)

problame left a comment

Choose a reason for hiding this comment

jcsp commented Jul 8, 2024

jcsp commented Jul 9, 2024

jcsp commented Jul 9, 2024

jcsp commented Jul 12, 2024

jcsp commented Jul 12, 2024

problame commented Jul 12, 2024

problame left a comment

Choose a reason for hiding this comment

jcsp commented Jun 28, 2024 •

edited

Loading

github-actions bot commented Jun 28, 2024 •

edited

Loading