pageserver: post-shard-split layer rewrites (2/2) #7531

jcsp · 2024-04-29T07:58:28Z

Problem

After a shard split of a large existing tenant, child tenants can end up with oversized historic layers indefinitely, if those layers are prevented from being GC'd by branchpoints.

This PR follows #7531, and adds rewriting of layers that contain a mixture of needed & un-needed contents, in addition to dropping un-needed layers.

Closes: #7504

Summary of changes

Add methods to ImageLayer for reading back existing layers
Extend compact_shard_ancestors to rewrite layer files that contain a mixture of keys that we want and keys we do not, if unwanted keys are the majority of those in the file.
Amend initialization code to handle multiple layers with the same LayerName properly
Get rid of of renaming bad layer files to .old since that's now expected on restarts during rewrites.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-04-29T08:42:15Z

3132 tests run: 3005 passed, 0 failed, 127 skipped (full report)

Code coverage* (full report)

functions: 31.4% (6450 of 20545 functions)
lines: 48.3% (49976 of 103400 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
a289914 at 2024-05-24T08:41:07.985Z :recycle:}

## Problem After a shard split of a large existing tenant, child tenants can end up with oversized historic layers indefinitely, if those layers are prevented from being GC'd by branchpoints. This PR is followed by #7531 Related issue: #7504 ## Summary of changes - Add a new compaction phase `compact_shard_ancestors`, which identifies layers that are no longer needed after a shard split. - Add a Timeline->LayerMap code path called `rewrite_layers` , which is currently only used to drop layers, but will later be used to rewrite them as well in #7531 - Add a new test that compacts after a split, and checks that something is deleted. Note that this doesn't have much impact on a tenant's resident size (since unused layers would end up evicted anyway), but it: - Makes index_part.json much smaller - Makes the system easier to reason about: avoid having tenants which are like "my physical size is 4TiB but don't worry I'll never actually download it", instead have tenants report the real physical size of what they might download. Why do we remove these layers in compaction rather than during the split? Because we have existing split tenants that need cleaning up. We can add it to the split operation in future as an optimization.

## Problem In #7531, we would like to be able to rewrite layers safely. One option is to make `Layer` able to rewrite files in place safely (e.g. by blocking evictions/deletions for an old Layer while a new one is created), but that's relatively fragile. It's more robust in general if we simply never overwrite the same local file: we can do that by putting the generation number in the filename. ## Summary of changes - Add `local_layer_path` (counterpart to `remote_layer_path`) and convert all locations that manually constructed a local layer path by joining LayerFileName to timeline path - In the layer upload path, construct remote paths with `remote_layer_path` rather than trying to build them out of a local path. - During startup, carry the full path to layer files through `init::reconcile`, and pass it into `Layer::for_resident` - Add a test to make sure we handle upgrades properly. - Comment out the generation part of `local_layer_path`, since we need to maintain forward compatibility for one release. A tiny followup PR will enable it afterwards. We could make this a bit simpler if we bulk renamed existing layers on startup instead of carrying literal paths through init, but that is operationally risky on existing servers with millions of layer files. We can always do a renaming change in future if it becomes annoying, but for the moment it's kind of nice to have a structure that enables us to change local path names again in future quite easily. We should rename `LayerFileName` to `LayerName` or somesuch, to make it more obvious that it's not a literal filename: this was already a bit confusing where that type is used in remote paths. That will be a followup, to avoid polluting this PR's diff.

## Problem After a shard split of a large existing tenant, child tenants can end up with oversized historic layers indefinitely, if those layers are prevented from being GC'd by branchpoints. This PR is followed by #7531 Related issue: #7504 ## Summary of changes - Add a new compaction phase `compact_shard_ancestors`, which identifies layers that are no longer needed after a shard split. - Add a Timeline->LayerMap code path called `rewrite_layers` , which is currently only used to drop layers, but will later be used to rewrite them as well in #7531 - Add a new test that compacts after a split, and checks that something is deleted. Note that this doesn't have much impact on a tenant's resident size (since unused layers would end up evicted anyway), but it: - Makes index_part.json much smaller - Makes the system easier to reason about: avoid having tenants which are like "my physical size is 4TiB but don't worry I'll never actually download it", instead have tenants report the real physical size of what they might download. Why do we remove these layers in compaction rather than during the split? Because we have existing split tenants that need cleaning up. We can add it to the split operation in future as an optimization.

## Problem In #7531, we would like to be able to rewrite layers safely. One option is to make `Layer` able to rewrite files in place safely (e.g. by blocking evictions/deletions for an old Layer while a new one is created), but that's relatively fragile. It's more robust in general if we simply never overwrite the same local file: we can do that by putting the generation number in the filename. ## Summary of changes - Add `local_layer_path` (counterpart to `remote_layer_path`) and convert all locations that manually constructed a local layer path by joining LayerFileName to timeline path - In the layer upload path, construct remote paths with `remote_layer_path` rather than trying to build them out of a local path. - During startup, carry the full path to layer files through `init::reconcile`, and pass it into `Layer::for_resident` - Add a test to make sure we handle upgrades properly. - Comment out the generation part of `local_layer_path`, since we need to maintain forward compatibility for one release. A tiny followup PR will enable it afterwards. We could make this a bit simpler if we bulk renamed existing layers on startup instead of carrying literal paths through init, but that is operationally risky on existing servers with millions of layer files. We can always do a renaming change in future if it becomes annoying, but for the moment it's kind of nice to have a structure that enables us to change local path names again in future quite easily. We should rename `LayerFileName` to `LayerName` or somesuch, to make it more obvious that it's not a literal filename: this was already a bit confusing where that type is used in remote paths. That will be a followup, to avoid polluting this PR's diff.

## Problem In #7531, I had a test flaky because the GC API endpoint fails if the tenant happens not to be active yet. ## Summary of changes While adding that wait for the tenant to be active, I noticed that this endpoint is kind of strange (spawns a TaskManager task) and has a comment `// TODO: spawning is redundant now, need to hold the gate`, so this PR cleans it up to just run the GC inline while holding a gate. The GC code is updated to avoid assuming it runs inside a task manager task. Avoiding checking the task_mgr cancellation token is safe, because our timeline shutdown always cancels Timeline::cancel.

jcsp · 2024-05-15T16:59:08Z

For my understanding: commit pageserver: init should not assume local LayerNames are unique is required for cases where the compaction task panics mid rewrites?

Right: where we have two layers on disk that have the same LayerName but different generation suffixes.

…t-split-rewrite

pageserver/src/tenant/storage_layer/layer.rs

pageserver/src/tenant/timeline/compaction.rs

I don't know about anymore racy behaviour. The init.rs changes are too much for me to get into without serious time investment to read all of this week's PRs.

## Problem In #7531, I had a test flaky because the GC API endpoint fails if the tenant happens not to be active yet. ## Summary of changes While adding that wait for the tenant to be active, I noticed that this endpoint is kind of strange (spawns a TaskManager task) and has a comment `// TODO: spawning is redundant now, need to hold the gate`, so this PR cleans it up to just run the GC inline while holding a gate. The GC code is updated to avoid assuming it runs inside a task manager task. Avoiding checking the task_mgr cancellation token is safe, because our timeline shutdown always cancels Timeline::cancel.

…t-split-rewrite

pageserver/src/tenant/storage_layer/layer.rs

test_runner/regress/test_sharding.py

pageserver/src/tenant/storage_layer/image_layer.rs

pageserver/src/tenant/timeline/compaction.rs

…t-split-rewrite

jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver labels Apr 29, 2024

jcsp mentioned this pull request Apr 30, 2024

pageserver: drop layers after shard split in GC #7559

Closed

7 tasks

jcsp force-pushed the jcsp/issue-7504-post-split-rewrite branch from 1af6470 to c27c451 Compare May 1, 2024 13:11

jcsp changed the title ~~pageserver: post-shard-split layer rewrites~~ pageserver: post-shard-split layer rewrites (2/2) May 1, 2024

This was referenced May 1, 2024

pageserver: post-shard-split layer trimming (1/2) #7572

Merged

pageserver: include generation number in local layer paths #7609

Merged

jcsp force-pushed the jcsp/issue-7504-post-split-rewrite branch from c27c451 to 773ab42 Compare May 3, 2024 17:27

jcsp force-pushed the jcsp/issue-7504-post-split-rewrite branch 2 times, most recently from 9e936f2 to a2df7d8 Compare May 7, 2024 19:15

jcsp mentioned this pull request May 8, 2024

pageserver: improve API for invoking GC #7655

Merged

5 tasks

jcsp force-pushed the jcsp/issue-7504-post-split-rewrite branch from a2df7d8 to da9d6f4 Compare May 8, 2024 11:22

jcsp changed the title ~~pageserver: post-shard-split layer rewrites (2/2)~~ [DNM] pageserver: post-shard-split layer rewrites (2/2) May 8, 2024

jcsp mentioned this pull request May 9, 2024

pageserver: implement auto-splitting #7681

Merged

5 tasks

jcsp force-pushed the jcsp/issue-7504-post-split-rewrite branch from da9d6f4 to ffc323b Compare May 12, 2024 17:31

jcsp changed the title ~~[DNM] pageserver: post-shard-split layer rewrites (2/2)~~ pageserver: post-shard-split layer rewrites (2/2) May 13, 2024

jcsp force-pushed the jcsp/issue-7504-post-split-rewrite branch from ffc323b to c595d97 Compare May 13, 2024 16:48

jcsp added 5 commits May 14, 2024 09:39

pageserver: implement rewrite method for ImageLayer

14fe1fa

pageserver: post-split layer rewrite in compaction

608a9ef

pageserver: init should not assume local LayerNames are unique

3f01d11

pageserver: clean up old .old files

0e5fdb6

tests: extend test_sharding_split_compaction

da614f8

jcsp force-pushed the jcsp/issue-7504-post-split-rewrite branch from c595d97 to da614f8 Compare May 14, 2024 08:40

jcsp requested a review from VladLazar May 14, 2024 09:03

jcsp added 3 commits May 15, 2024 15:21

clearer naming Decision::UseLocal -> Decision::Resident

d722861

Clean up failpoints to use 'exit'

f28f0ce

Simplify layer rewrite function

c54ca63

jcsp added 2 commits May 15, 2024 18:02

Merge remote-tracking branch 'upstream/main' into jcsp/issue-7504-pos…

1dd953c

…t-split-rewrite

remove incorrect metric sub

35690b5

koivunej reviewed May 16, 2024

View reviewed changes

pageserver/src/tenant/storage_layer/layer.rs Outdated Show resolved Hide resolved

koivunej reviewed May 16, 2024

View reviewed changes

pageserver/src/tenant/timeline/compaction.rs Outdated Show resolved Hide resolved

jcsp added 3 commits May 16, 2024 10:25

fix an error string

6ea9427

eliminate range argument

ae5c37b

Make rewrites safe against concurrent GC

0150440

update split_compaction test for using last_gc_cutoff

49f41a7

jcsp added 2 commits May 21, 2024 16:14

Merge remote-tracking branch 'upstream/main' into jcsp/issue-7504-pos…

8683a22

…t-split-rewrite

add unit test for layer filter

0ede8d6