Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: rolling storage controller restarts RFC #8310

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented Jul 8, 2024

Problem

Storage controller upgrades (restarts, more generally) can cause multi-second availability gaps.
While the storage controller does not sit on the main data path, it's generally not acceptable
to block management requests for extended periods of time (e.g. #8034).

Summary of changes

This RFC describes the issues around the current storage controller restart procedure
and describes an implementation which reduces downtime to a few milliseconds on the happy path.

Related #7797

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@VladLazar VladLazar requested review from jcsp and yliang412 July 8, 2024 12:08
Copy link

github-actions bot commented Jul 8, 2024

No tests were run or test report is not available

Test coverage report is not available

The comment gets automatically updated with the latest test results
91b4533 at 2024-07-24T13:29:33.042Z :recycle:

@VladLazar VladLazar requested a review from jcsp July 10, 2024 08:05
@jcsp
Copy link
Contributor

jcsp commented Jul 12, 2024

Thinking of extra safety measures: we might in future like to carry an HTTP header on controller requests to pageservers, which would change for new leaders, so that pageservers could refuse requests from stale leaders. Might be worth embedding some counter in the leader table for that purpose.

VladLazar added a commit that referenced this pull request Jul 26, 2024
## Problem
We are missing the step-down primitive required to implement rolling
restarts of the storage controller.

## Summary of changes
Add `/control/v1/step_down` endpoint which puts the storage controller
into a state where it rejects
all API requests apart from `/control/v1/step_down`, `/status` and
`/metrics`. When receiving the request,
storage controller cancels all pending reconciles and waits for them to
exit gracefully. The response contains
a snapshot of the in-memory observed state.

Related:
* neondatabase/cloud#14701
* #7797
* #8310
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants