Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: improved synthetic size & find_gc_cutoff error handling #8051

Merged
merged 3 commits into from
Jun 14, 2024

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Jun 13, 2024

Problem

This PR refactors some error handling to avoid log spam on tenant/timeline shutdown.

Closes: #8012

Summary of changes

  • Refactor: Add a PageReconstructError variant to GcError: this is the only kind of error that find_gc_cutoffs can emit.
  • Functional change: only ignore shutdown PageReconstructError variant: for other variants, treat it as a real error
  • Refactor: add a structured CalculateSyntheticSizeError type and use it instead of anyhow::Error in synthetic size calculations
  • Functional change: while iterating through timelines gathering logical sizes, only drop out if the whole tenant is cancelled: individual timeline cancellations indicate deletion in progress and we can just ignore those.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Jun 13, 2024
@jcsp jcsp changed the title Jcsp/synthetic size errors pageserver: improved synthetic size & find_gc_cutoff error handling Jun 13, 2024
Copy link

3216 tests run: 3074 passed, 0 failed, 142 skipped (full report)


Code coverage* (full report)

  • functions: 31.5% (6636 of 21065 functions)
  • lines: 48.6% (51637 of 106275 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
12d5128 at 2024-06-13T19:22:14.109Z :recycle:

@jcsp jcsp marked this pull request as ready for review June 13, 2024 19:58
@jcsp jcsp requested a review from a team as a code owner June 13, 2024 19:58
@jcsp jcsp requested a review from arpad-m June 13, 2024 19:58
@jcsp jcsp merged commit eb0ca9b into main Jun 14, 2024
68 checks passed
@jcsp jcsp deleted the jcsp/synthetic-size-errors branch June 14, 2024 10:08
jcsp added a commit that referenced this pull request Jun 18, 2024
…_metric_collection` flake) (#8065)

## Problem

```
ERROR synthetic_size_worker: failed to calculate synthetic size for tenant ae449af30216ac56d2c1173f894b1122: Could not find size at 0/218CA70 in timeline d8da32b5e3e0bf18cfdb560f9de29638\n')
```

e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/main/9518948590/index.html#/testresult/30a6d1e2471d2775

This test had allow lists but was disrupted by
#8051. In that PR, I had kept
an error path in fill_logical_sizes that covered the case where we
couldn't find sizes for some of the segments, but that path could only
be hit in the case that some Timeline was shut down concurrently with a
synthetic size calculation, so it makes sense to just leave the
segment's size None in this case: the subsequent size calculations do
not assume it is Some.

## Summary of changes

- Remove `CalculateSyntheticSizeError::LsnNotFound` and just proceed in
the case where we used to return it
- Remove defunct allow list entries in `test_metric_collection`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ignoring failure to find gc cutoffs: timeline shutting down should be info! level
2 participants