docs: Graceful storage controller cluster restarts RFC #7704

VladLazar · 2024-05-10T19:10:14Z

RFC for "Graceful Restarts of Storage Controller Managed Clusters".

POC which implements nearly everything mentioned here apart from the optimizations
and some of the failure handling: #7682

Related #7387

github-actions · 2024-05-10T19:55:32Z

2946 tests run: 2829 passed, 0 failed, 117 skipped (full report)

Code coverage* (full report)

functions: 32.7% (6910 of 21129 functions)
lines: 50.1% (54195 of 108189 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
8faa437 at 2024-07-01T09:46:16.764Z :recycle:}

docs/rfcs/033-storage-controller-drain-and-fill.md

arpad-m

great and thoughtful RFC.

It might also make sense to mention that draining is a no-op for tenants that lack secondaries, i.e. non-HA ones. It makes sense to keep this as a TODO in the implementation but ideally the RFC would mention it as well.

Also, I wonder what to do if the secondary of a to-be-drained tenant is on an unresponsive pageserver. Do we want to drain that secondary? Maybe the answer is still a yes, generations will take care of it, but we should be robust wrt that in the implementation, to not stall the draining process indefinitely while waiting for an okay from a pageserver that is unresponsive.

docs/rfcs/033-storage-controller-drain-and-fill.md

VladLazar · 2024-05-14T10:57:35Z

It might also make sense to mention that draining is a no-op for tenants that lack secondaries, i.e. non-HA ones. It makes sense to keep this as a TODO in the implementation but ideally the RFC would mention it as well.

Right, in theory we can cater to non-HA tenants as well by changing their attached location. Depending on tenant size,
this might be more disruptive than the restart since the pageserver we've moved to do will need to on-demand download
the entire working set for the tenant. I can add this as a non-goal to the RFC.

Also, I wonder what to do if the secondary of a to-be-drained tenant is on an unresponsive pageserver. Do we want to drain that secondary? Maybe the answer is still a yes, generations will take care of it, but we should be robust wrt that in the implementation, to not stall the draining process indefinitely while waiting for an okay from a pageserver that is unresponsive.

This is basically the same scenario as previously. If the node we are moving to is unresponsive, the reconciliation will fail, leaving the tenant on the original node. It's a good point though. I think it's a good idea to make sure the node is online
before explicitly setting the attached location. Saves us a reconcile.

arpad-m · 2024-05-15T14:10:39Z

If the node we are moving to is unresponsive, the reconciliation will fail, leaving the tenant on the original node.

yeah my point was mainly about that: there is a difference between doing retries indefinitely, and doing retries but giving up at a point. We need to implement latter because of the scenario I mentioned above.

…r restarts (#8014) ## Problem Pageserver restarts cause read availablity downtime for tenants. See `Motivation` section in the [RFC](#7704). ## Summary of changes * Introduce a new `NodeSchedulingPolicy`: `PauseForRestart` * Implement the first take of drain and fill algorithms * Add a node status endpoint which can be polled to figure out when an operation is done The implementation follows the RFC, so it might be useful to peek at it as you're reviewing. Since the PR is rather chunky, I've made sure all commits build (with warnings), so you can review by commit if you prefer that. RFC: #7704 Related #7387

RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387

docs: Graceful storage controller cluster restarts RFC

4bba434

review: proof-reading

4ff36a2

VladLazar requested review from jcsp and arpad-m May 13, 2024 09:28

VladLazar marked this pull request as ready for review May 13, 2024 09:29

VladLazar mentioned this pull request May 13, 2024

storage controller: rolling restart hooks #7387

Closed

jcsp reviewed May 13, 2024

View reviewed changes