docs: rolling storage controller restarts RFC #8310

VladLazar · 2024-07-08T12:03:49Z

Problem

Storage controller upgrades (restarts, more generally) can cause multi-second availability gaps.
While the storage controller does not sit on the main data path, it's generally not acceptable
to block management requests for extended periods of time (e.g. #8034).

Summary of changes

This RFC describes the issues around the current storage controller restart procedure
and describes an implementation which reduces downtime to a few milliseconds on the happy path.

Related #7797

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-07-08T12:57:34Z

No tests were run or test report is not available

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
91b4533 at 2024-07-24T13:29:33.042Z :recycle:}

docs/rfcs/034-storage-controller-restarts.md

jcsp · 2024-07-12T15:20:43Z

Thinking of extra safety measures: we might in future like to carry an HTTP header on controller requests to pageservers, which would change for new leaders, so that pageservers could refuse requests from stale leaders. Might be worth embedding some counter in the leader table for that purpose.

## Problem We are missing the step-down primitive required to implement rolling restarts of the storage controller. ## Summary of changes Add `/control/v1/step_down` endpoint which puts the storage controller into a state where it rejects all API requests apart from `/control/v1/step_down`, `/status` and `/metrics`. When receiving the request, storage controller cancels all pending reconciles and waits for them to exit gracefully. The response contains a snapshot of the in-memory observed state. Related: * neondatabase/cloud#14701 * #7797 * #8310

docs: rolling storage controller restarts RFC

33f00b2

VladLazar requested review from jcsp and yliang412 July 8, 2024 12:08

jcsp reviewed Jul 8, 2024

View reviewed changes

docs/rfcs/034-storage-controller-restarts.md Show resolved Hide resolved

yliang412 reviewed Jul 9, 2024

View reviewed changes

docs/rfcs/034-storage-controller-restarts.md Show resolved Hide resolved

yliang412 reviewed Jul 9, 2024

View reviewed changes

docs/rfcs/034-storage-controller-restarts.md Outdated Show resolved Hide resolved

review: clarify leader table UPDATEs

9a353ed

VladLazar requested a review from jcsp July 10, 2024 08:05

jcsp approved these changes Jul 10, 2024

View reviewed changes

docs/rfcs/034-storage-controller-restarts.md Show resolved Hide resolved

docs/rfcs/034-storage-controller-restarts.md Show resolved Hide resolved

docs/rfcs/034-storage-controller-restarts.md Show resolved Hide resolved

docs/rfcs/034-storage-controller-restarts.md Show resolved Hide resolved

review: more clarity on proposed start-up sequence

503692f

VladLazar commented Jul 12, 2024

View reviewed changes

docs/rfcs/034-storage-controller-restarts.md Show resolved Hide resolved

review: safety ideas section

91b4533

VladLazar mentioned this pull request Jul 25, 2024

storcon: introduce step down primitive #8512

Merged

5 tasks

VladLazar mentioned this pull request Aug 2, 2024

storcon: implement graceful leadership transfer #8588

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: rolling storage controller restarts RFC #8310

docs: rolling storage controller restarts RFC #8310

VladLazar commented Jul 8, 2024 •

edited

Loading

github-actions bot commented Jul 8, 2024 •

edited

Loading

jcsp commented Jul 12, 2024

docs: rolling storage controller restarts RFC #8310

Are you sure you want to change the base?

docs: rolling storage controller restarts RFC #8310

Conversation

VladLazar commented Jul 8, 2024 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Jul 8, 2024 • edited Loading

No tests were run or test report is not available

Test coverage report is not available

jcsp commented Jul 12, 2024

VladLazar commented Jul 8, 2024 •

edited

Loading

github-actions bot commented Jul 8, 2024 •

edited

Loading