Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: add GcError type #7917

Merged
merged 3 commits into from
May 31, 2024
Merged

pageserver: add GcError type #7917

merged 3 commits into from
May 31, 2024

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented May 30, 2024

Problem

Gc failed 1 times, retrying in 2s: shutting down

GC really doesn't do a lot of complicated IO: it doesn't benefit from the backtrace capabilities of anyhow::Error, and can be expressed more robustly as an enum.

Summary of changes

  • Add GcError type and use it instead of anyhow::Error in GC functions
  • In gc_iteration_internal, return GcError::Cancelled on shutdown rather than Ok(()) (we only used Ok before because we didn't have a clear cancellation error variant to use).
  • In gc_iteration_internal, skip past timelines that are shutting down, to avoid having to go through another GC iteration if we happen to see a deleting timeline during a GC run.
  • In refresh_gc_info_internal, avoid an error case where a timeline might not be found after being looked up, by carrying an Arc instead of a TimelineId between the first loop and second loop in the function.
  • In HTTP request handler, handle Cancelled variants as 503 instead of turning all GC errors into 500s.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

Copy link

github-actions bot commented May 30, 2024

3156 tests run: 3017 passed, 0 failed, 139 skipped (full report)


Code coverage* (full report)

  • functions: 31.4% (6505 of 20711 functions)
  • lines: 48.3% (50226 of 103898 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
950e5d2 at 2024-05-31T17:05:26.139Z :recycle:

@jcsp jcsp marked this pull request as ready for review May 31, 2024 15:49
@jcsp jcsp requested a review from a team as a code owner May 31, 2024 15:49
@jcsp jcsp requested a review from arpad-m May 31, 2024 15:49
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
@jcsp jcsp requested a review from arpad-m May 31, 2024 16:18
@jcsp jcsp merged commit 7e60563 into main May 31, 2024
64 checks passed
@jcsp jcsp deleted the jcsp/typed-gc-error branch May 31, 2024 21:20
a-masterov pushed a commit that referenced this pull request Jun 3, 2024
## Problem

- Because GC exposes all errors as an anyhow::Error, we have
intermittent issues with spurious log errors during shutdown, e.g. in
this failure of a performance test
https://neon-github-public-dev.s3.amazonaws.com/reports/main/9300804302/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/214a2154f6f0217a/

```
Gc failed 1 times, retrying in 2s: shutting down
```

GC really doesn't do a lot of complicated IO: it doesn't benefit from
the backtrace capabilities of anyhow::Error, and can be expressed more
robustly as an enum.

## Summary of changes

- Add GcError type and use it instead of anyhow::Error in GC functions
- In `gc_iteration_internal`, return GcError::Cancelled on shutdown
rather than Ok(()) (we only used Ok before because we didn't have a
clear cancellation error variant to use).
- In `gc_iteration_internal`, skip past timelines that are shutting
down, to avoid having to go through another GC iteration if we happen to
see a deleting timeline during a GC run.
- In `refresh_gc_info_internal`, avoid an error case where a timeline
might not be found after being looked up, by carrying an Arc<Timeline>
instead of a TimelineId between the first loop and second loop in the
function.
- In HTTP request handler, handle Cancelled variants as 503 instead of
turning all GC errors into 500s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants