Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage scrubber: GC ancestor shard layers #8196

Merged
merged 21 commits into from
Jul 19, 2024
Merged

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jun 28, 2024

Problem

After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space.

Summary of changes

  • Extend the physical GC command with --mode=full, which includes cleaning up unreferenced ancestor shard layers
  • Add test test_scrubber_physical_gc_ancestors
  • Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much.
  • Refactor storage controller API client code out of storcon_client into a storage_controller/client crate
  • During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver c/storage/scrubber Component: s3_scrubber labels Jun 28, 2024
@jcsp jcsp changed the title Jcsp/issue 7043 ancestral gc storage scrubber: GC ancestor shard layers Jun 28, 2024
Copy link

github-actions bot commented Jun 28, 2024

3169 tests run: 3048 passed, 0 failed, 121 skipped (full report)


Flaky tests (1)

Postgres 16

  • test_timeline_deletion_with_files_stuck_in_upload_queue: debug

Code coverage* (full report)

  • functions: 32.7% (6992 of 21411 functions)
  • lines: 50.0% (55243 of 110439 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
fde87d2 at 2024-07-19T16:01:14.259Z :recycle:

@jcsp jcsp force-pushed the jcsp/issue-7043-ancestral-gc branch from 824449d to 88dfeaa Compare July 1, 2024 10:40
@jcsp jcsp requested a review from problame July 1, 2024 18:35
@jcsp jcsp marked this pull request as ready for review July 1, 2024 18:35
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clean up the variable names when speaking about tenants (tenant_shard_id vs tenant_id).


I would prefer we call it parent shard, so at least there is a minimal distinction from the word ancestor which we already use for Timeline.


What if the remote storage lists some but not all shards, e.g. because we're in the middle of Tenant::split_prepare ?

Edit: ah, nice, you handled that.

Kind of an important thing to regress-test. Shouldn't be too hard to test this with a pausable failpoint, should it?

Edit 2: but, did you also consider the same condition for the stream_tenant_timelines call made by gc_ancestor? I think it can observe a tenant shard in S3 but some of its timelines' IndexParts haven't been uploaded by Tenant::split_prepare yet.

Let's make sure we understand which races can happen and why they are fine / how they are handled. And: where will we document this? module-level comment doesn't seem like the worst choice, but, hard to discover.


About AncestorRefs: why is it not a more phyiscal refcount map, like on the absolute_key ? Sure, a bit less memory efficient but much harder to screw up. Similarly, I don't think we should go through the trouble of only tracking the cross-ancestor references.

I'm feeling kind of strongly about this, I think the code could be much simpler and lower risk.

test_runner/regress/test_storage_scrubber.py Outdated Show resolved Hide resolved
test_runner/regress/test_storage_scrubber.py Show resolved Hide resolved
storage_scrubber/src/pageserver_physical_gc.rs Outdated Show resolved Hide resolved
storage_scrubber/src/pageserver_physical_gc.rs Outdated Show resolved Hide resolved
storage_scrubber/src/pageserver_physical_gc.rs Outdated Show resolved Hide resolved
storage_scrubber/src/pageserver_physical_gc.rs Outdated Show resolved Hide resolved
storage_scrubber/src/pageserver_physical_gc.rs Outdated Show resolved Hide resolved
storage_scrubber/src/pageserver_physical_gc.rs Outdated Show resolved Hide resolved
@jcsp
Copy link
Contributor Author

jcsp commented Jul 8, 2024

What if the remote storage lists some but not all shards, e.g. because we're in the middle of Tenant::split_prepare ?

I need to think about this more. The hardest case is if we list things to do GC, then a new split happens while we are GC'ing, which references layers we will delete. This is only an issue if doing multiple splits.

One possible soft approach is to always upload parent index_part during split, and then apply the min_age to parent index mtime

@jcsp
Copy link
Contributor Author

jcsp commented Jul 9, 2024

. The hardest case is if we list things to do GC, then a new split happens while we are GC'ing, which references layers we will delete. This is only an issue if doing multiple splits.

Thought more about this. When we GC an ancestor layer, it means that the indices for children no longer refer to it. So any further splits cannot refer to the ancestral layer.

Importantly, the split process shuts down the remote timeline client for the parent shard before it uploads child shards. That means that once we upload the child shards, the parents may be considered "frozen" and do not risk derefencing any ancestor layers (as long as the pageserver is still running). If the pageserver dies halfway through a split, the storage controller will retry the split, which will eventually still write child shard indices which do not reference any ancestral layers that their parent doesn't reference.

@jcsp
Copy link
Contributor Author

jcsp commented Jul 9, 2024

What if the remote storage lists some but not all shards, e.g. because we're in the middle of Tenant::split_prepare ?

Considering the "hard case" of multiple shard splits 0000 -> 0002,0102 ->0004,0104,0204,0304

  • A) If we see an incomplete set of the **04 shards, we'll refuse to do any ancestor GC because the code checks that for the highest shard count we've seen, we have all the shards at that count
  • B) If we do not see any of the **04 shards, then we'll treat the **02 shards as the latest ones, and that's safe because any ancestor referenced by **04 shards must be referenced by a **02 shard, because when we split we stop updates to the parent index before writing the child index, and in failure cases we will always re-run the split, freezing the parent again and writing the child again.
  • C) If we see the **04 shards but not the **02 shards (e.g. because we did two splits in really fast succession while a scrub was running), then that's usually okay because the integrity of the **02 shards doesn't matter once the **04 shards exist.
    • problem case: if the split wrote the **04 shards, then the pageserver crashed, such that on next restart it's going to fire up the **02 shards again.
    • problem case: if the split has written **04 shards, and those child shards have done some GC/compaction such that they no longer ref the parent, but the pageserver's response didn't make it to the storage controller, and the storage controller restarts, it will try to attach **02 shards again.

The most robust approach would be to have some persistent "split complete" marker in tenant shards: this would tell us that it's safe to assume that only higher shardcounts matter, and it's safe to drop things that are referenced by this shardcount. It is only safe to set the "split complete" marker once the storage controller has persisted the completion of the split, i.e. nothing will ever try to attach the ancestral shards.

@jcsp jcsp force-pushed the jcsp/issue-7043-ancestral-gc branch from 8c63696 to fd7037f Compare July 11, 2024 10:34
@jcsp jcsp requested a review from a team as a code owner July 11, 2024 10:34
@jcsp jcsp requested a review from a team as a code owner July 11, 2024 10:34
@jcsp jcsp requested review from petuhovskiy and arssher July 11, 2024 10:34
@jcsp
Copy link
Contributor Author

jcsp commented Jul 12, 2024

About AncestorRefs: why is it not a more phyiscal refcount map, like on the absolute_key ? Sure, a bit less memory efficient but much harder to screw up.

Memory efficiency is relatively important here, because we build the AncestorRefs for all timelines in an entire region.

Similarly, I don't think we should go through the trouble of only tracking the cross-ancestor references.

Same as above: because we're building this map globally, it's important that we limit the scope of what we store in memory.

Could we rework this to avoid building the monolithic ancestor refs map? Yes, but at the cost of having to first build global list of tenant shards, then go back and do the per-tenant work (with a smaller AncestorRefs scoped to one tenant). It's a tradeoff: for this PR, I'm preferring to take the memory cost of listing all ancestor refs, vs. the memory cost of storing a total list of all tenants (100ks).

I wrote this avoiding the assumption that listings are reliably returned in order, S3 kinda-sorta promises this, but I have limited trust in third party implementations which might offer a more readdir-like lack of ordering.

@jcsp
Copy link
Contributor Author

jcsp commented Jul 12, 2024

I would prefer we call it parent shard, so at least there is a minimal distinction from the word ancestor which we already use for Timeline.

The use of "ancestor" is intentional, because a shard can reference layers that are from it's parent's parents. It's annoying that we have an "ancestor shard" and "ancestor timeline" concept, but I think it's the right word. We could call them "forebears" or some other synonym but I think that would be worse :-)

@problame
Copy link
Contributor

Could we rework this to avoid building the monolithic ancestor refs map? Yes, but at the cost of having to first build global list of tenant shards, then go back and do the per-tenant work (with a smaller AncestorRefs scoped to one tenant)

If we're going to integrate scrubber with storcon anyway then we might as well avoid all the "discover tenants/shards through ListObjectsv2" and instead simply ask storcon for a list of tenants to scrub. And that list would include a consistent list of which shards exist.

@petuhovskiy petuhovskiy removed their request for review July 18, 2024 11:34
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, sorry for the delay

storage_scrubber/src/pageserver_physical_gc.rs Outdated Show resolved Hide resolved
storage_scrubber/src/pageserver_physical_gc.rs Outdated Show resolved Hide resolved
@jcsp jcsp enabled auto-merge (squash) July 19, 2024 14:43
@jcsp jcsp merged commit 4478151 into main Jul 19, 2024
66 checks passed
@jcsp jcsp deleted the jcsp/issue-7043-ancestral-gc branch July 19, 2024 16:08
problame pushed a commit that referenced this pull request Jul 22, 2024
## Problem

After a shard split, the pageserver leaves the ancestor shard's content
in place. It may be referenced by child shards, but eventually child
shards will de-reference most ancestor layers as they write their own
data and do GC. We would like to eventually clean up those ancestor
layers to reclaim space.

## Summary of changes

- Extend the physical GC command with `--mode=full`, which includes
cleaning up unreferenced ancestor shard layers
- Add test `test_scrubber_physical_gc_ancestors`
- Remove colored log output: in testing this is irritating ANSI code
spam in logs, and in interactive use doesn't add much.
- Refactor storage controller API client code out of storcon_client into
a `storage_controller/client` crate
- During physical GC of ancestors, call into the storage controller to
check that the latest shards seen in S3 reflect the latest state of the
tenant, and there is no shard split in progress.
yliang412 added a commit that referenced this pull request Jul 30, 2024
Part of #8128, followed by #8502.

## Problem

Currently we lack mechanism to alert unhealthy `scan_metadata` status if
we start running this scrubber command as part of a cronjob. With the
storage controller client introduced to storage scrubber in #8196, it is
viable to set up alert by storing health status in the storage
controller database.

We intentionally do not store the full output to the database as the
json blobs potentially makes the table really huge. Instead, only a
health status and a timestamp recording the last time metadata health
status is posted on a tenant shard.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
arpad-m pushed a commit that referenced this pull request Aug 5, 2024
Part of #8128, followed by #8502.

## Problem

Currently we lack mechanism to alert unhealthy `scan_metadata` status if
we start running this scrubber command as part of a cronjob. With the
storage controller client introduced to storage scrubber in #8196, it is
viable to set up alert by storing health status in the storage
controller database.

We intentionally do not store the full output to the database as the
json blobs potentially makes the table really huge. Instead, only a
health status and a timestamp recording the last time metadata health
status is posted on a tenant shard.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver c/storage/scrubber Component: s3_scrubber t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants