Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage controller: rolling restart hooks #7387

Closed
jcsp opened this issue Apr 15, 2024 · 10 comments
Closed

storage controller: rolling restart hooks #7387

jcsp opened this issue Apr 15, 2024 · 10 comments
Assignees
Labels
c/storage/controller Component: Storage Controller t/feature Issue type: feature, for new features or requests

Comments

@jcsp
Copy link
Collaborator

jcsp commented Apr 15, 2024

Hooks to enable some external orchestrator to ensure that attachments are drained from a node before restarting it, and hint the controller to move attachments back to the node after the restart.

@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/controller Component: Storage Controller labels Apr 15, 2024
@VladLazar
Copy link
Contributor

This week:

  • work on POC and RFC

@VladLazar
Copy link
Contributor

Last week:

This week:

  • Get feedback
  • Work on POC to make it prod ready

VladLazar added a commit that referenced this issue Jun 11, 2024
## Problem
The storage controller does not track the number of shards attached to a
given pageserver. This is a requirement for various scheduling
operations (e.g. draining and filling will use this to figure out if the
cluster is balanced)

## Summary of Changes
Track the number of shards attached to each node.

Related #7387
@VladLazar
Copy link
Contributor

VladLazar commented Jun 17, 2024

Last week (2024-06-10):

This week (2024-06-17):

VladLazar added a commit that referenced this issue Jun 19, 2024
…r restarts (#8014)

## Problem
Pageserver restarts cause read availablity downtime for tenants. See
`Motivation` section in the
[RFC](#7704).

## Summary of changes
* Introduce a new `NodeSchedulingPolicy`: `PauseForRestart`
* Implement the first take of drain and fill algorithms
* Add a node status endpoint which can be polled to figure out when an
operation is done

The implementation follows the RFC, so it might be useful to peek at it
as you're reviewing.
Since the PR is rather chunky, I've made sure all commits build (with
warnings), so you can
review by commit if you prefer that.

RFC: #7704
Related #7387
@VladLazar
Copy link
Contributor

VladLazar commented Jun 20, 2024

There's a number of small PR's currently up for review. See the topological dependency sort below:

https://github.com/neondatabase/neon/pull/8029

https://github.com/neondatabase/neon/pull/8061 ---------------
                                                              |-------> https://github.com/neondatabase/neon/pull/8119
https://github.com/neondatabase/neon/pull/8099 ----------------

@VladLazar
Copy link
Contributor

Last week (2024-06-17):

This week (2024-06-24):

@VladLazar
Copy link
Contributor

VladLazar commented Jul 1, 2024

Last week (2024-06-24)

This week (2024-07-01)

  • Do graceful storcon cluster restart in ap-southeast-1 as part of the platform release

VladLazar added a commit that referenced this issue Jul 1, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
@VladLazar
Copy link
Contributor

Last week (2024-07-01)

  • Do graceful storcon cluster restart in ap-southeast-1 as part of the platform release - all went fine
  • Do a bigger region this week. I'm storage on call, so I'll pick one and do it

VladLazar added a commit that referenced this issue Jul 8, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
VladLazar added a commit that referenced this issue Jul 8, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
VladLazar added a commit that referenced this issue Jul 8, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
VladLazar added a commit that referenced this issue Jul 8, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
VladLazar added a commit that referenced this issue Jul 8, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
VladLazar added a commit that referenced this issue Jul 8, 2024
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
@VladLazar
Copy link
Contributor

VladLazar commented Jul 12, 2024

Last week (2024-07-08)

  • Did graceful storcon cluster restart up to and including eu-central-1
  • eu-central-1:
    • Storcon went in a busy loop and was killed by k8s for not responding to /status. Fixed here
    • Restart proceeded without fill and the background optimizations gradually balanced the cluster

Next week (2024-07-15)

@VladLazar
Copy link
Contributor

Last week (2022-07-22):

  • Storage release did graceful storcon cluster restart in all regions (- ap-southeast-2)

This week (2022-07-29):

  • Set up an alert for ERROR logs coming from the Python script
  • Enable by default?

@VladLazar
Copy link
Contributor

2024-08-06:

  • Added alerts: Storage Controller Cluster Restart Logs (Prod|Staging)
  • Will monitor during deployments for the next couple of weeks, but closing the issue as complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

No branches or pull requests

2 participants