Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: circuit breaker on compaction #8359

Merged
merged 3 commits into from
Jul 12, 2024

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jul 11, 2024

Problem

We already back off on compaction retries, but the impact of a failing compaction can be so great that backing off up to 300s isn't enough. The impact is consuming a lot of I/O+CPU in the case of image layer generation for large tenants, and potentially also leaking disk space.

Compaction failures are extremely rare and almost always indicate a bug, frequently a bug that will not let compaction to proceed until it is fixed.

Related: #6738

Summary of changes

  • Introduce a CircuitBreaker type
  • Add a circuit breaker for compaction, with a policy that after 5 failures, compaction will not be attempted again for 24 hours.
  • Add metrics that we can alert on: any >0 value for pageserver_circuit_breaker_broken_total should generate an alert.
  • Add a test that checks this works as intended.

Couple notes to reviewers:

  • Circuit breakers are intrinsically a defense-in-depth measure: this is not the solution to any underlying issues, it is just a general mitigation for "unknown unknowns" that might be encountered in future.
  • This PR isn't primarily about writing a perfect CircuitBreaker type: the one in this PR is meant to be just enough to mitigate issues in compaction, and make it easy to monitor/alert on these failures. We can refine this type in future as/when we want to use it elsewhere.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Jul 11, 2024
@jcsp jcsp force-pushed the jcsp/pageserver-compaction-circuit-breaker branch from 653a806 to 981a288 Compare July 11, 2024 16:25
Copy link

3054 tests run: 2939 passed, 0 failed, 115 skipped (full report)


Code coverage* (full report)

  • functions: 32.7% (6972 of 21322 functions)
  • lines: 50.1% (54804 of 109498 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
981a288 at 2024-07-11T17:16:29.814Z :recycle:

@jcsp jcsp requested a review from skyzh July 11, 2024 18:25
@jcsp jcsp marked this pull request as ready for review July 11, 2024 18:25
@jcsp jcsp requested a review from a team as a code owner July 11, 2024 18:25
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this prevents failing compaction corrupting/lagging the system, it does not solve the fundamental issue that compaction should clean up resources after failed attempts (#8362). I will come up with more fixes.

Meanwhile, I assume we have set up alerts for pageserver error logs?

pageserver/src/tenant.rs Show resolved Hide resolved
@jcsp jcsp merged commit 0645ae3 into main Jul 12, 2024
66 checks passed
@jcsp jcsp deleted the jcsp/pageserver-compaction-circuit-breaker branch July 12, 2024 11:04
skyzh pushed a commit that referenced this pull request Jul 15, 2024
## Problem

We already back off on compaction retries, but the impact of a failing
compaction can be so great that backing off up to 300s isn't enough. The
impact is consuming a lot of I/O+CPU in the case of image layer
generation for large tenants, and potentially also leaking disk space.

Compaction failures are extremely rare and almost always indicate a bug,
frequently a bug that will not let compaction to proceed until it is
fixed.

Related: #6738

## Summary of changes

- Introduce a CircuitBreaker type
- Add a circuit breaker for compaction, with a policy that after 5
failures, compaction will not be attempted again for 24 hours.
- Add metrics that we can alert on: any >0 value for
`pageserver_circuit_breaker_broken_total` should generate an alert.
- Add a test that checks this works as intended.

Couple notes to reviewers:
- Circuit breakers are intrinsically a defense-in-depth measure: this is
not the solution to any underlying issues, it is just a general
mitigation for "unknown unknowns" that might be encountered in future.
- This PR isn't primarily about writing a perfect CircuitBreaker type:
the one in this PR is meant to be just enough to mitigate issues in
compaction, and make it easy to monitor/alert on these failures. We can
refine this type in future as/when we want to use it elsewhere.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants