Error handling for shard splitting #6676

jcsp · 2024-02-08T10:34:53Z

Non-exhaustive list of cases to handle:

Concurrent requests to another endpoint (e.g. delete the tenant while splitting) or the same endpoint (e.g. retries) should be excluded.
Crash during split should be recovered (check split status in Service::spawn)
Error during split should be rolled back (wrap Service::tenant_shard_split such that errors enter a rollback routine)
Figure out what to do if rollbacks fail, e.g. if we can't /location_config out to a pageserver to clean up a child shard. e.g. either retry forever, or ensure that we do a cleanup routine on nodes when they go Offline->Available that would remove any stray child shards.

See comments in code linking this ticket.

…#7087) Not a user-facing change, but can break any existing `.neon` directories created by neon_local, as the name of the database used by the storage controller changes. This PR changes all the locations apart from the path of `control_plane/attachment_service` (waiting for an opportune moment to do that one, because it's the most conflict-ish wrt ongoing PRs like #6676 )

## Problem Shard splits worked, but weren't safe against failures (e.g. node crash during split) yet. Related: #6676 ## Summary of changes - Introduce async rwlocks at the scope of Tenant and Node: - exclusive tenant lock is used to protect splits - exclusive node lock is used to protect new reconciliation process that happens when setting node active - exclusive locks used in both cases when doing persistent updates (e.g. node scheduling conf) where the update to DB & in-memory state needs to be atomic. - Add failpoints to shard splitting in control plane and pageserver code. - Implement error handling in control plane for shard splits: this detaches child chards and ensures parent shards are re-attached. - Crash-safety for storage controller restarts requires little effort: we already reconcile with nodes over a storage controller restart, so as long as we reset any incomplete splits in the DB on restart (added in this PR), things are implicitly cleaned up. - Implement reconciliation with offline nodes before they transition to active: - (in this context reconciliation means something like startup_reconcile, not literally the Reconciler) - This covers cases where split abort cannot reach a node to clean it up: the cleanup will eventually happen when the node is marked active, as part of reconciliation. - This also covers the case where a node was unavailable when the storage controller started, but becomes available later: previously this allowed it to skip the startup reconcile. - Storage controller now terminates on panics. We only use panics for true "should never happen" assertions, and these cases can leave us in an un-usable state if we keep running (e.g. panicking in a shard split). In the unlikely event that we get into a crashloop as a result, we'll rely on kubernetes to back us off. - Add `test_sharding_split_failures` which exercises a variety of failure cases during shard split.

jcsp added a/tech_debt Area: related to tech debt c/storage Component: storage labels Feb 8, 2024

jcsp self-assigned this Feb 8, 2024

This was referenced Feb 8, 2024

pageserver: shard splitting #6379

Merged

Epic: pageserver shard splitting #6278

Closed

jcsp changed the title ~~Error handling for shard splitting splits~~ Error handling for shard splitting Feb 26, 2024

This was referenced Mar 8, 2024

pageserver/controller: error handling for shard splitting #7074

Merged

tests/neon_local: rename "attachment service" -> "storage controller" #7087

Merged

jcsp closed this as completed Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling for shard splitting #6676

Error handling for shard splitting #6676

jcsp commented Feb 8, 2024 •

edited

Loading

Error handling for shard splitting #6676

Error handling for shard splitting #6676

Comments

jcsp commented Feb 8, 2024 • edited Loading

jcsp commented Feb 8, 2024 •

edited

Loading