Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling for shard splitting #6676

Closed
Tracked by #6278
jcsp opened this issue Feb 8, 2024 · 0 comments
Closed
Tracked by #6278

Error handling for shard splitting #6676

jcsp opened this issue Feb 8, 2024 · 0 comments
Assignees
Labels
a/tech_debt Area: related to tech debt c/storage Component: storage

Comments

@jcsp
Copy link
Collaborator

jcsp commented Feb 8, 2024

Non-exhaustive list of cases to handle:

  • Concurrent requests to another endpoint (e.g. delete the tenant while splitting) or the same endpoint (e.g. retries) should be excluded.
  • Crash during split should be recovered (check split status in Service::spawn)
  • Error during split should be rolled back (wrap Service::tenant_shard_split such that errors enter a rollback routine)
  • Figure out what to do if rollbacks fail, e.g. if we can't /location_config out to a pageserver to clean up a child shard. e.g. either retry forever, or ensure that we do a cleanup routine on nodes when they go Offline->Available that would remove any stray child shards.

See comments in code linking this ticket.

@jcsp jcsp added a/tech_debt Area: related to tech debt c/storage Component: storage labels Feb 8, 2024
@jcsp jcsp self-assigned this Feb 8, 2024
@jcsp jcsp changed the title Error handling for shard splitting splits Error handling for shard splitting Feb 26, 2024
jcsp added a commit that referenced this issue Mar 12, 2024
…#7087)

Not a user-facing change, but can break any existing `.neon` directories
created by neon_local, as the name of the database used by the storage
controller changes.

This PR changes all the locations apart from the path of
`control_plane/attachment_service` (waiting for an opportune moment to
do that one, because it's the most conflict-ish wrt ongoing PRs like
#6676 )
jcsp added a commit that referenced this issue Mar 14, 2024
## Problem

Shard splits worked, but weren't safe against failures (e.g. node crash
during split) yet.

Related: #6676 

## Summary of changes

- Introduce async rwlocks at the scope of Tenant and Node:
  - exclusive tenant lock is used to protect splits
- exclusive node lock is used to protect new reconciliation process that
happens when setting node active
- exclusive locks used in both cases when doing persistent updates (e.g.
node scheduling conf) where the update to DB & in-memory state needs to
be atomic.
- Add failpoints to shard splitting in control plane and pageserver
code.
- Implement error handling in control plane for shard splits: this
detaches child chards and ensures parent shards are re-attached.
- Crash-safety for storage controller restarts requires little effort:
we already reconcile with nodes over a storage controller restart, so as
long as we reset any incomplete splits in the DB on restart (added in
this PR), things are implicitly cleaned up.
- Implement reconciliation with offline nodes before they transition to
active:
- (in this context reconciliation means something like
startup_reconcile, not literally the Reconciler)
- This covers cases where split abort cannot reach a node to clean it
up: the cleanup will eventually happen when the node is marked active,
as part of reconciliation.
- This also covers the case where a node was unavailable when the
storage controller started, but becomes available later: previously this
allowed it to skip the startup reconcile.
- Storage controller now terminates on panics. We only use panics for
true "should never happen" assertions, and these cases can leave us in
an un-usable state if we keep running (e.g. panicking in a shard split).
In the unlikely event that we get into a crashloop as a result, we'll
rely on kubernetes to back us off.
- Add `test_sharding_split_failures` which exercises a variety of
failure cases during shard split.
@jcsp jcsp closed this as completed Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage Component: storage
Projects
None yet
Development

No branches or pull requests

1 participant