-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage scrubber: GC ancestor shard layers #8196
Conversation
3169 tests run: 3048 passed, 0 failed, 121 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
fde87d2 at 2024-07-19T16:01:14.259Z :recycle: |
824449d
to
88dfeaa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clean up the variable names when speaking about tenants (tenant_shard_id
vs tenant_id
).
I would prefer we call it parent shard, so at least there is a minimal distinction from the word ancestor which we already use for Timeline
.
What if the remote storage lists some but not all shards, e.g. because we're in the middle of Tenant::split_prepare
?
Edit: ah, nice, you handled that.
Kind of an important thing to regress-test. Shouldn't be too hard to test this with a pausable failpoint, should it?
Edit 2: but, did you also consider the same condition for the stream_tenant_timelines
call made by gc_ancestor
? I think it can observe a tenant shard in S3 but some of its timelines' IndexPart
s haven't been uploaded by Tenant::split_prepare
yet.
Let's make sure we understand which races can happen and why they are fine / how they are handled. And: where will we document this? module-level comment doesn't seem like the worst choice, but, hard to discover.
About AncestorRefs
: why is it not a more phyiscal refcount map, like on the absolute_key
? Sure, a bit less memory efficient but much harder to screw up. Similarly, I don't think we should go through the trouble of only tracking the cross-ancestor references.
I'm feeling kind of strongly about this, I think the code could be much simpler and lower risk.
I need to think about this more. The hardest case is if we list things to do GC, then a new split happens while we are GC'ing, which references layers we will delete. This is only an issue if doing multiple splits. One possible soft approach is to always upload parent index_part during split, and then apply the min_age to parent index mtime |
Thought more about this. When we GC an ancestor layer, it means that the indices for children no longer refer to it. So any further splits cannot refer to the ancestral layer. Importantly, the split process shuts down the remote timeline client for the parent shard before it uploads child shards. That means that once we upload the child shards, the parents may be considered "frozen" and do not risk derefencing any ancestor layers (as long as the pageserver is still running). If the pageserver dies halfway through a split, the storage controller will retry the split, which will eventually still write child shard indices which do not reference any ancestral layers that their parent doesn't reference. |
Considering the "hard case" of multiple shard splits 0000 -> 0002,0102 ->0004,0104,0204,0304
The most robust approach would be to have some persistent "split complete" marker in tenant shards: this would tell us that it's safe to assume that only higher shardcounts matter, and it's safe to drop things that are referenced by this shardcount. It is only safe to set the "split complete" marker once the storage controller has persisted the completion of the split, i.e. nothing will ever try to attach the ancestral shards. |
8c63696
to
fd7037f
Compare
Memory efficiency is relatively important here, because we build the AncestorRefs for all timelines in an entire region.
Same as above: because we're building this map globally, it's important that we limit the scope of what we store in memory. Could we rework this to avoid building the monolithic ancestor refs map? Yes, but at the cost of having to first build global list of tenant shards, then go back and do the per-tenant work (with a smaller AncestorRefs scoped to one tenant). It's a tradeoff: for this PR, I'm preferring to take the memory cost of listing all ancestor refs, vs. the memory cost of storing a total list of all tenants (100ks). I wrote this avoiding the assumption that listings are reliably returned in order, S3 kinda-sorta promises this, but I have limited trust in third party implementations which might offer a more readdir-like lack of ordering. |
The use of "ancestor" is intentional, because a shard can reference layers that are from it's parent's parents. It's annoying that we have an "ancestor shard" and "ancestor timeline" concept, but I think it's the right word. We could call them "forebears" or some other synonym but I think that would be worse :-) |
If we're going to integrate scrubber with storcon anyway then we might as well avoid all the "discover tenants/shards through ListObjectsv2" and instead simply ask storcon for a list of tenants to scrub. And that list would include a consistent list of which shards exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, sorry for the delay
## Problem After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space. ## Summary of changes - Extend the physical GC command with `--mode=full`, which includes cleaning up unreferenced ancestor shard layers - Add test `test_scrubber_physical_gc_ancestors` - Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much. - Refactor storage controller API client code out of storcon_client into a `storage_controller/client` crate - During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.
Part of #8128, followed by #8502. ## Problem Currently we lack mechanism to alert unhealthy `scan_metadata` status if we start running this scrubber command as part of a cronjob. With the storage controller client introduced to storage scrubber in #8196, it is viable to set up alert by storing health status in the storage controller database. We intentionally do not store the full output to the database as the json blobs potentially makes the table really huge. Instead, only a health status and a timestamp recording the last time metadata health status is posted on a tenant shard. Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Part of #8128, followed by #8502. ## Problem Currently we lack mechanism to alert unhealthy `scan_metadata` status if we start running this scrubber command as part of a cronjob. With the storage controller client introduced to storage scrubber in #8196, it is viable to set up alert by storing health status in the storage controller database. We intentionally do not store the full output to the database as the json blobs potentially makes the table really huge. Instead, only a health status and a timestamp recording the last time metadata health status is posted on a tenant shard. Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Problem
After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space.
Summary of changes
--mode=full
, which includes cleaning up unreferenced ancestor shard layerstest_scrubber_physical_gc_ancestors
storage_controller/client
crateChecklist before requesting a review
Checklist before merging