Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage controller: add node deletion API #8226

Merged
merged 10 commits into from
Jul 11, 2024
Merged

storage controller: add node deletion API #8226

merged 10 commits into from
Jul 11, 2024

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jul 1, 2024

Problem

In anticipation of later adding a really nice drain+delete API, I initially only added an intentionally basic /drop API that is just about usable for deleting nodes in a pinch, but requires some ugly storage controller restarts to persuade it to restart secondaries.

Summary of changes

I started making a few tiny fixes, and ended up writing the delete API...

  • Quality of life nit: ordering of node + tenant listings in storcon_cli
  • Papercut: Fix the attach_hook using the wrong operation type for reporting slow locks
  • Make Service::spawn tolerate generation_pageserver columns that point to nonexistent node IDs. I started out thinking of this as a general resilience thing, but when implementing the delete API I realized it was actually a legitimate end state after the delete API is called (as that API doesn't wait for all reconciles to succeed).
  • Add a DELETE API for nodes, which does not gracefully drain, but does reschedule everything. This becomes safe to use when the system is in any state, but will incur availability gaps for any tenants that weren't already live-migrated away. If tenants have already been drained, this becomes a totally clean + safe way to decom a node.
  • Add a test and a storcon_cli wrapper for it

This is meant to be a robust initial API that lets us remove nodes without doing ugly things like restarting the storage controller -- it's not quite a totally graceful node-draining routine yet. There's more work in #8333 to get to our end-end state.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/controller Component: Storage Controller labels Jul 1, 2024
@jcsp jcsp requested a review from VladLazar July 1, 2024 20:41
Copy link

github-actions bot commented Jul 1, 2024

3054 tests run: 2939 passed, 0 failed, 115 skipped (full report)


Flaky tests (2)

Postgres 16

  • test_isolation[None]: debug
  • test_tenant_creation_fails: debug

Code coverage* (full report)

  • functions: 32.6% (6971 of 21351 functions)
  • lines: 50.0% (54789 of 109536 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
20b69e7 at 2024-07-11T12:54:08.057Z :recycle:

@jcsp jcsp marked this pull request as ready for review July 2, 2024 13:56
@jcsp jcsp requested a review from a team as a code owner July 2, 2024 13:56
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered implementing this as a background operation where the caller has to poll for the absence of the node? It would look a lot like the drain code and I think it would be easier on the operator (i.e us 😄 ).

storage_controller/src/service.rs Show resolved Hide resolved
@jcsp
Copy link
Contributor Author

jcsp commented Jul 10, 2024

Have you considered implementing this as a background operation where the caller has to poll for the absence of the node? It would look a lot like the drain code and I think it would be easier on the operator (i.e us 😄 ).

Yes, we should add something like that. This command is meant as a stepping stone that gives us a clean way to remove a node, without necessarily being totally graceful about draining it (yet)

@jcsp
Copy link
Contributor Author

jcsp commented Jul 10, 2024

I created #8333 to track the remaining work to make node removal really slick.

@jcsp jcsp requested a review from VladLazar July 11, 2024 07:30
storage_controller/src/service.rs Show resolved Hide resolved
test_runner/fixtures/neon_fixtures.py Outdated Show resolved Hide resolved
@jcsp jcsp enabled auto-merge (squash) July 11, 2024 12:05
@jcsp jcsp merged commit 814c8e8 into main Jul 11, 2024
65 checks passed
@jcsp jcsp deleted the jcsp/storcon-papercuts branch July 11, 2024 16:05
skyzh pushed a commit that referenced this pull request Jul 15, 2024
## Problem

In anticipation of later adding a really nice drain+delete API, I
initially only added an intentionally basic `/drop` API that is just
about usable for deleting nodes in a pinch, but requires some ugly
storage controller restarts to persuade it to restart secondaries.

## Summary of changes

I started making a few tiny fixes, and ended up writing the delete
API...

- Quality of life nit: ordering of node + tenant listings in storcon_cli
- Papercut: Fix the attach_hook using the wrong operation type for
reporting slow locks
- Make Service::spawn tolerate `generation_pageserver` columns that
point to nonexistent node IDs. I started out thinking of this as a
general resilience thing, but when implementing the delete API I
realized it was actually a legitimate end state after the delete API is
called (as that API doesn't wait for all reconciles to succeed).
- Add a `DELETE` API for nodes, which does not gracefully drain, but
does reschedule everything. This becomes safe to use when the system is
in any state, but will incur availability gaps for any tenants that
weren't already live-migrated away. If tenants have already been
drained, this becomes a totally clean + safe way to decom a node.
- Add a test and a storcon_cli wrapper for it

This is meant to be a robust initial API that lets us remove nodes
without doing ugly things like restarting the storage controller -- it's
not quite a totally graceful node-draining routine yet. There's more
work in #8333 to get to our
end-end state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants