storage controller: add node deletion API #8226

jcsp · 2024-07-01T20:41:05Z

Problem

In anticipation of later adding a really nice drain+delete API, I initially only added an intentionally basic /drop API that is just about usable for deleting nodes in a pinch, but requires some ugly storage controller restarts to persuade it to restart secondaries.

Summary of changes

I started making a few tiny fixes, and ended up writing the delete API...

Quality of life nit: ordering of node + tenant listings in storcon_cli
Papercut: Fix the attach_hook using the wrong operation type for reporting slow locks
Make Service::spawn tolerate generation_pageserver columns that point to nonexistent node IDs. I started out thinking of this as a general resilience thing, but when implementing the delete API I realized it was actually a legitimate end state after the delete API is called (as that API doesn't wait for all reconciles to succeed).
Add a DELETE API for nodes, which does not gracefully drain, but does reschedule everything. This becomes safe to use when the system is in any state, but will incur availability gaps for any tenants that weren't already live-migrated away. If tenants have already been drained, this becomes a totally clean + safe way to decom a node.
Add a test and a storcon_cli wrapper for it

This is meant to be a robust initial API that lets us remove nodes without doing ugly things like restarting the storage controller -- it's not quite a totally graceful node-draining routine yet. There's more work in #8333 to get to our end-end state.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-07-01T21:21:48Z

3054 tests run: 2939 passed, 0 failed, 115 skipped (full report)

Flaky tests (2)

Postgres 16

test_isolation[None]: debug
test_tenant_creation_fails: debug

Code coverage* (full report)

functions: 32.6% (6971 of 21351 functions)
lines: 50.0% (54789 of 109536 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
20b69e7 at 2024-07-11T12:54:08.057Z :recycle:}

VladLazar

Have you considered implementing this as a background operation where the caller has to poll for the absence of the node? It would look a lot like the drain code and I think it would be easier on the operator (i.e us 😄 ).

storage_controller/src/service.rs

test_runner/regress/test_storage_controller.py

jcsp · 2024-07-10T08:11:18Z

Have you considered implementing this as a background operation where the caller has to poll for the absence of the node? It would look a lot like the drain code and I think it would be easier on the operator (i.e us 😄 ).

Yes, we should add something like that. This command is meant as a stepping stone that gives us a clean way to remove a node, without necessarily being totally graceful about draining it (yet)

jcsp · 2024-07-10T08:36:02Z

I created #8333 to track the remaining work to make node removal really slick.

storage_controller/src/service.rs

test_runner/fixtures/neon_fixtures.py

test_runner/regress/test_storage_controller.py

## Problem In anticipation of later adding a really nice drain+delete API, I initially only added an intentionally basic `/drop` API that is just about usable for deleting nodes in a pinch, but requires some ugly storage controller restarts to persuade it to restart secondaries. ## Summary of changes I started making a few tiny fixes, and ended up writing the delete API... - Quality of life nit: ordering of node + tenant listings in storcon_cli - Papercut: Fix the attach_hook using the wrong operation type for reporting slow locks - Make Service::spawn tolerate `generation_pageserver` columns that point to nonexistent node IDs. I started out thinking of this as a general resilience thing, but when implementing the delete API I realized it was actually a legitimate end state after the delete API is called (as that API doesn't wait for all reconciles to succeed). - Add a `DELETE` API for nodes, which does not gracefully drain, but does reschedule everything. This becomes safe to use when the system is in any state, but will incur availability gaps for any tenants that weren't already live-migrated away. If tenants have already been drained, this becomes a totally clean + safe way to decom a node. - Add a test and a storcon_cli wrapper for it This is meant to be a robust initial API that lets us remove nodes without doing ugly things like restarting the storage controller -- it's not quite a totally graceful node-draining routine yet. There's more work in #8333 to get to our end-end state.

jcsp added 3 commits July 1, 2024 21:35

storcon_cli: sort node + tenant listings

27081e2

storcon: resilience against bad generation_pageserver in DB

7af54f5

fix attach_hook using wrong operation type

01e2972

jcsp added t/feature Issue type: feature, for new features or requests c/storage/controller Component: Storage Controller labels Jul 1, 2024

jcsp requested a review from VladLazar July 1, 2024 20:41

jcsp added 4 commits July 2, 2024 09:42

storcon: implement tenant delete API

07dc62e

tests: add test_storage_controller_node_deletion

ed8a75e

storcon_cli: add node-delete

92e4560

tests: update compat tests

ee09248

jcsp force-pushed the jcsp/storcon-papercuts branch from c8e6a4c to ee09248 Compare July 2, 2024 08:42

jcsp marked this pull request as ready for review July 2, 2024 13:56

jcsp requested a review from a team as a code owner July 2, 2024 13:56

VladLazar reviewed Jul 3, 2024

View reviewed changes

storage_controller/src/service.rs Show resolved Hide resolved

test_runner/regress/test_storage_controller.py Show resolved Hide resolved

jcsp mentioned this pull request Jul 10, 2024

storage controller: more graceful node removal #8333

Open

Merge remote-tracking branch 'upstream/main' into jcsp/storcon-papercuts

5e4bc61

jcsp requested a review from VladLazar July 11, 2024 07:30

VladLazar approved these changes Jul 11, 2024

View reviewed changes

storage_controller/src/service.rs Show resolved Hide resolved

test_runner/fixtures/neon_fixtures.py Outdated Show resolved Hide resolved

test_runner/regress/test_storage_controller.py Show resolved Hide resolved

jcsp added 2 commits July 11, 2024 12:57

remove a spurious comment

92ab0c9

Merge remote-tracking branch 'upstream/main' into jcsp/storcon-papercuts

20b69e7

jcsp enabled auto-merge (squash) July 11, 2024 12:05

jcsp merged commit 814c8e8 into main Jul 11, 2024
65 checks passed

jcsp deleted the jcsp/storcon-papercuts branch July 11, 2024 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage controller: add node deletion API #8226

storage controller: add node deletion API #8226

jcsp commented Jul 1, 2024 •

edited

Loading

github-actions bot commented Jul 1, 2024 •

edited

Loading

Postgres 16

VladLazar left a comment

jcsp commented Jul 10, 2024

jcsp commented Jul 10, 2024

storage controller: add node deletion API #8226

storage controller: add node deletion API #8226

Conversation

jcsp commented Jul 1, 2024 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Jul 1, 2024 • edited Loading

3054 tests run: 2939 passed, 0 failed, 115 skipped (full report)

Postgres 16

Code coverage* (full report)

VladLazar left a comment

Choose a reason for hiding this comment

jcsp commented Jul 10, 2024

jcsp commented Jul 10, 2024

jcsp commented Jul 1, 2024 •

edited

Loading

github-actions bot commented Jul 1, 2024 •

edited

Loading