Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Graceful storage controller cluster restarts RFC #7704

Merged
merged 6 commits into from
Jul 1, 2024

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented May 10, 2024

RFC for "Graceful Restarts of Storage Controller Managed Clusters".

POC which implements nearly everything mentioned here apart from the optimizations
and some of the failure handling: #7682

Related #7387

Copy link

github-actions bot commented May 10, 2024

2946 tests run: 2829 passed, 0 failed, 117 skipped (full report)


Code coverage* (full report)

  • functions: 32.7% (6910 of 21129 functions)
  • lines: 50.1% (54195 of 108189 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
8faa437 at 2024-07-01T09:46:16.764Z :recycle:

@VladLazar VladLazar requested review from jcsp and arpad-m May 13, 2024 09:28
@VladLazar VladLazar marked this pull request as ready for review May 13, 2024 09:29
@VladLazar VladLazar requested a review from jcsp May 13, 2024 17:48
Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great and thoughtful RFC.

It might also make sense to mention that draining is a no-op for tenants that lack secondaries, i.e. non-HA ones. It makes sense to keep this as a TODO in the implementation but ideally the RFC would mention it as well.

Also, I wonder what to do if the secondary of a to-be-drained tenant is on an unresponsive pageserver. Do we want to drain that secondary? Maybe the answer is still a yes, generations will take care of it, but we should be robust wrt that in the implementation, to not stall the draining process indefinitely while waiting for an okay from a pageserver that is unresponsive.

@VladLazar
Copy link
Contributor Author

It might also make sense to mention that draining is a no-op for tenants that lack secondaries, i.e. non-HA ones. It makes sense to keep this as a TODO in the implementation but ideally the RFC would mention it as well.

Right, in theory we can cater to non-HA tenants as well by changing their attached location. Depending on tenant size,
this might be more disruptive than the restart since the pageserver we've moved to do will need to on-demand download
the entire working set for the tenant. I can add this as a non-goal to the RFC.

Also, I wonder what to do if the secondary of a to-be-drained tenant is on an unresponsive pageserver. Do we want to drain that secondary? Maybe the answer is still a yes, generations will take care of it, but we should be robust wrt that in the implementation, to not stall the draining process indefinitely while waiting for an okay from a pageserver that is unresponsive.

This is basically the same scenario as previously. If the node we are moving to is unresponsive, the reconciliation will fail, leaving the tenant on the original node. It's a good point though. I think it's a good idea to make sure the node is online
before explicitly setting the attached location. Saves us a reconcile.

@arpad-m
Copy link
Member

arpad-m commented May 15, 2024

If the node we are moving to is unresponsive, the reconciliation will fail, leaving the tenant on the original node.

yeah my point was mainly about that: there is a difference between doing retries indefinitely, and doing retries but giving up at a point. We need to implement latter because of the scenario I mentioned above.

VladLazar added a commit that referenced this pull request Jun 19, 2024
…r restarts (#8014)

## Problem
Pageserver restarts cause read availablity downtime for tenants. See
`Motivation` section in the
[RFC](#7704).

## Summary of changes
* Introduce a new `NodeSchedulingPolicy`: `PauseForRestart`
* Implement the first take of drain and fill algorithms
* Add a node status endpoint which can be polled to figure out when an
operation is done

The implementation follows the RFC, so it might be useful to peek at it
as you're reviewing.
Since the PR is rather chunky, I've made sure all commits build (with
warnings), so you can
review by commit if you prefer that.

RFC: #7704
Related #7387
@VladLazar VladLazar merged commit 9882ac8 into main Jul 1, 2024
64 checks passed
@VladLazar VladLazar deleted the vlad/graceful-storcon-cluster-restarts-rfc branch July 1, 2024 17:44
VladLazar added a commit that referenced this pull request Jul 8, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
VladLazar added a commit that referenced this pull request Jul 8, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
VladLazar added a commit that referenced this pull request Jul 8, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
VladLazar added a commit that referenced this pull request Jul 8, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
VladLazar added a commit that referenced this pull request Jul 8, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
VladLazar added a commit that referenced this pull request Jul 8, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants