storage controller: rolling restart hooks #7387

jcsp · 2024-04-15T13:46:44Z

Hooks to enable some external orchestrator to ensure that attachments are drained from a node before restarting it, and hint the controller to move attachments back to the node after the restart.

VladLazar · 2024-05-09T08:23:27Z

This week:

work on POC and RFC

VladLazar · 2024-05-13T09:36:55Z

Last week:

RFC: docs: Graceful storage controller cluster restarts RFC #7704
POC: implements 90% of the RFC Vlad/storcon drain fill poc #7682

This week:

Get feedback
Work on POC to make it prod ready

## Problem The storage controller does not track the number of shards attached to a given pageserver. This is a requirement for various scheduling operations (e.g. draining and filling will use this to figure out if the cluster is balanced) ## Summary of Changes Track the number of shards attached to each node. Related #7387

VladLazar · 2024-06-17T13:08:31Z

Last week (2024-06-10):

Merged a prereq pr
Tidied up storcon: add drain and fill background operations for graceful cluster restarts #8014
Worked on stabilising impl for scale test - passing now

This week (2024-06-17):

Merge storcon: add drain and fill background operations for graceful cluster restarts #8014
Work on Ansible side

…r restarts (#8014) ## Problem Pageserver restarts cause read availablity downtime for tenants. See `Motivation` section in the [RFC](#7704). ## Summary of changes * Introduce a new `NodeSchedulingPolicy`: `PauseForRestart` * Implement the first take of drain and fill algorithms * Add a node status endpoint which can be polled to figure out when an operation is done The implementation follows the RFC, so it might be useful to peek at it as you're reviewing. Since the PR is rather chunky, I've made sure all commits build (with warnings), so you can review by commit if you prefer that. RFC: #7704 Related #7387

VladLazar · 2024-06-20T15:19:42Z

There's a number of small PR's currently up for review. See the topological dependency sort below:

https://github.com/neondatabase/neon/pull/8029

https://github.com/neondatabase/neon/pull/8061 ---------------
                                                              |-------> https://github.com/neondatabase/neon/pull/8119
https://github.com/neondatabase/neon/pull/8099 ----------------

VladLazar · 2024-06-24T13:28:20Z

Last week (2024-06-17):

Merged the main PR with drain and fill support
Spent a lot of time on fine-tuning shard placement and scale testing
Implemented Ansible side: https://github.com/neondatabase/aws/pull/1511

This week (2024-06-24):

Test in staging
Merge https://github.com/neondatabase/aws/pull/1511
Open PR for scale test

VladLazar · 2024-07-01T11:36:00Z

Last week (2024-06-24)

Investigated storcon segfaults => downgraded openssl
Multiple staging tests
Merged ansible side of things https://github.com/neondatabase/aws/pull/1511

This week (2024-07-01)

Do graceful storcon cluster restart in ap-southeast-1 as part of the platform release

RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387

VladLazar · 2024-07-08T13:20:52Z

Last week (2024-07-01)

Do graceful storcon cluster restart in ap-southeast-1 as part of the platform release - all went fine
Do a bigger region this week. I'm storage on call, so I'll pick one and do it

RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387

VladLazar · 2024-07-12T16:43:54Z

Last week (2024-07-08)

Did graceful storcon cluster restart up to and including eu-central-1
eu-central-1:
- Storcon went in a busy loop and was killed by k8s for not responding to /status. Fixed here
- Restart proceeded without fill and the background optimizations gradually balanced the cluster

Next week (2024-07-15)

Verify that cplane fix https://github.com/neondatabase/cloud/pull/15403 is released first. These rolling restarts issue many reconciles and makes the bug more likely.
Do graceful restarts in ap-* and one us region.

VladLazar · 2024-07-29T09:08:03Z

Last week (2022-07-22):

Storage release did graceful storcon cluster restart in all regions (- ap-southeast-2)

This week (2022-07-29):

Set up an alert for ERROR logs coming from the Python script
Enable by default?

VladLazar · 2024-08-06T09:42:33Z

2024-08-06:

Added alerts: Storage Controller Cluster Restart Logs (Prod|Staging)
Will monitor during deployments for the next couple of weeks, but closing the issue as complete

jcsp added t/feature Issue type: feature, for new features or requests c/storage/controller Component: Storage Controller labels Apr 15, 2024

jcsp assigned VladLazar Apr 15, 2024

jcsp mentioned this issue Apr 16, 2024

storage controller: optimize placement of tenant shards when a node comes back online #7139

Open

VladLazar mentioned this issue May 13, 2024

docs: Graceful storage controller cluster restarts RFC #7704

Merged

VladLazar mentioned this issue Jun 11, 2024

storcon: track number of attached shards for each node #8011

Merged

5 tasks

VladLazar mentioned this issue Jun 11, 2024

storcon: add drain and fill background operations for graceful cluster restarts #8014

Merged

5 tasks

VladLazar added a commit that referenced this issue Jul 1, 2024

docs: Graceful storage controller cluster restarts RFC (#7704)

9882ac8

RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387

VladLazar added a commit that referenced this issue Jul 8, 2024

docs: Graceful storage controller cluster restarts RFC (#7704)

fc39ef4

RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387

VladLazar added a commit that referenced this issue Jul 8, 2024

docs: Graceful storage controller cluster restarts RFC (#7704)

219ca1b

RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387

VladLazar added a commit that referenced this issue Jul 8, 2024

docs: Graceful storage controller cluster restarts RFC (#7704)

0cb2ea3

RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387

VladLazar added a commit that referenced this issue Jul 8, 2024

docs: Graceful storage controller cluster restarts RFC (#7704)

414aebd

RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387

VladLazar added a commit that referenced this issue Jul 8, 2024

docs: Graceful storage controller cluster restarts RFC (#7704)

e948e2d

RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387

VladLazar added a commit that referenced this issue Jul 8, 2024

docs: Graceful storage controller cluster restarts RFC (#7704)

8c0ec2f

RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387

VladLazar closed this as completed Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage controller: rolling restart hooks #7387

storage controller: rolling restart hooks #7387

jcsp commented Apr 15, 2024

VladLazar commented May 9, 2024

VladLazar commented May 13, 2024

VladLazar commented Jun 17, 2024 •

edited

Loading

VladLazar commented Jun 20, 2024 •

edited

Loading

VladLazar commented Jun 24, 2024

VladLazar commented Jul 1, 2024 •

edited

Loading

VladLazar commented Jul 8, 2024

VladLazar commented Jul 12, 2024 •

edited

Loading

VladLazar commented Jul 29, 2024

VladLazar commented Aug 6, 2024

storage controller: rolling restart hooks #7387

storage controller: rolling restart hooks #7387

Comments

jcsp commented Apr 15, 2024

VladLazar commented May 9, 2024

VladLazar commented May 13, 2024

VladLazar commented Jun 17, 2024 • edited Loading

VladLazar commented Jun 20, 2024 • edited Loading

VladLazar commented Jun 24, 2024

VladLazar commented Jul 1, 2024 • edited Loading

VladLazar commented Jul 8, 2024

VladLazar commented Jul 12, 2024 • edited Loading

VladLazar commented Jul 29, 2024

VladLazar commented Aug 6, 2024

VladLazar commented Jun 17, 2024 •

edited

Loading

VladLazar commented Jun 20, 2024 •

edited

Loading

VladLazar commented Jul 1, 2024 •

edited

Loading

VladLazar commented Jul 12, 2024 •

edited

Loading