Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: simplified tenant deletion #5080

Closed
4 tasks done
jcsp opened this issue Aug 23, 2023 · 6 comments
Closed
4 tasks done

Epic: simplified tenant deletion #5080

jcsp opened this issue Aug 23, 2023 · 6 comments
Assignees
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver

Comments

@jcsp
Copy link
Contributor

jcsp commented Aug 23, 2023

The purpose of this refactor is

  • simpler code: avoid Tenant having a special deleting state
  • relieve the pageserver of the responsibility to reliably resume deletions, by exploiting the control plane's ability to remember an operation in progress and retry it.

Current behavior

Currently:

  • A tenant must be attached to a pageserver to do a deletion
  • Deletion calls from the control plane get a 202 Accepted, deletion starts in the background
  • The pageserver persists deletion markers and has logic to recover these on startup and continue deletions that had been accepted.

Proposed change

Since the control plane already has a persistent record of its intent to delete something (an Operation record), we may simplify this a bit:

  • Tenant deletion may be done whether attached or not: it is just deleting files/objects. Deleting an attached tenant/timeline will implicitly detach it before proceeding.
  • Deletion API calls may complete synchronously, or return a 408 Timeout to indicate that the deletion is still going on. The control plane is to keep retrying the request until it gets a 200 -- it is guaranteed that deletion will eventually succeed as long as remote storage is accessible. Usually deletions will complete in a single call: a few thousand layers is just a few DeleteObjects S3 calls.

Tasks

  1. c/storage/controller c/storage/pageserver t/bug
  2. a/tech_debt c/storage/pageserver
  3. a/tech_debt c/storage/pageserver
@jcsp jcsp added the c/storage/pageserver Component: storage: pageserver label Aug 23, 2023
@arpad-m
Copy link
Member

arpad-m commented Aug 23, 2023

Related to #5043

jcsp added a commit that referenced this issue Oct 6, 2023
## Problem

Spun off from #5449

Timeline deletion does the following:
1. Delete layers referenced in the index
2. Delete everything else in the timeline prefix, except the index
3. Delete the index.

When generations were added, the filter in step 2 got outdated, such
that the index objects were deleted along with everything else at step
2. That didn't really break anything, but it makes an automated test
unhappy and is a violation of the original intent of the code, which
presumably intends to upload an invariant that as long as any objects
for a timeline exist, the index exists.

(Eventually, this index-object-last complexity can go away: when we do
#5080, there is no need to
keep the index_part around, as deletions can always be retried any time
any where.)

## Summary of changes

After object listing, split the listed objects into layers and index
objects. Delete the layers first, then the index objects.
@jcsp
Copy link
Contributor Author

jcsp commented Oct 27, 2023

Note on local delete markers for tenants: currently we rely on the presence of a local directory to cause the pageserver to include the tenant in its re-attach request to control plane. So:

  • if we delete the local directory first, then we might forget to finish the remote deletion on a restart.
  • if we delete the remote prefix first, then at startup we would see the tenant as just an empty tenant with no timelines, and not finish the deletion.

So we can't get rid of those until we make the larger change to commit to removing the pageserver's retry-deletions-until-done behavior, and instead make deletion requests run synchronously and make the control plane understand 408 and keep retrying the deletion until we're done.

jcsp added a commit that referenced this issue Nov 27, 2023
## Problem

Historically, we treated the presence of a timeline on local disk as
evidence that it logically exists. Since #5580 that is no longer the
case, so we can always rely on remote storage. If we restart and the
timeline is gone in remote storage, we will also purge it from local
disk: no need for a marker.

Reference on why this PR is for timeline markers and not tenant markers:
#5080 (comment)

## Summary of changes

Remove code paths that read + write deletion marker for timelines.

Leave code path that deletes these markers, just in case we deploy while
there are some in existence. This can be cleaned up later.
(#5718)
@jcsp
Copy link
Contributor Author

jcsp commented Feb 5, 2024

Today we had an issue where control plane was sending deletions to the wrong pageserver, getting 404s, and handling that as success: this adds weight to the need to improve this API to avoid a situation where success is indistinguishable from talking to entirely the wrong node.

@arpad-m
Copy link
Member

arpad-m commented Feb 5, 2024

In order to answer something different between "never seen this tenant" and "tenant successfully deleted", we'd have to keep some state around after deletion, right now we don't.

we'd need to not delete at least some data on disk if we want to deliver "tenant successfully deleted" like messages even if there has been a crash right after the last traces have been deleted and we are getting ready to send the response "tenant successfully deleted". Or instead of a crash it can be a network issue, etc.

The question then becomes how to delete those tombstones. Maybe control plane could, at pageserver startup, or in some other regular interval, assemble a list of tenants that it expects the pageserver to have and if there is tombstones that are not in the list, then the pageserver is allowed to remove them. If we kept the "deleted_at" timestamp, we could even make it proof from race conditions by only considering tenants that have tombstones older than the date in the control plane database. or idk, something like that.

@arpad-m
Copy link
Member

arpad-m commented Feb 5, 2024

Hmmm thinking of it we could also just delete them after 24 hours or so and then have logic in the control plane that goes back to accepting "never seen this tenant" responses if it happens to be 24 hours after the first recorded deletion attempt.

@jcsp jcsp mentioned this issue Mar 7, 2024
5 tasks
jcsp added a commit that referenced this issue Apr 16, 2024
## Problem

test_sharding_smoke recently got an added section that checks deletion
of a sharded tenant. The storage controller does a retry loop for
deletion, waiting for a 404 response. When deletion is a bit slow (debug
builds), the retry of deletion was getting a 500 response -- this caused
the test to become flaky (example failure:
https://neon-github-public-dev.s3.amazonaws.com/reports/release-proxy/8659801445/index.html#testresult/b4cbf5b58190f60e/retries)

There was a false comment in the code:
```
         match tenant.current_state() {
             TenantState::Broken { .. } | TenantState::Stopping { .. } => {
-                // If a tenant is broken or stopping, DeleteTenantFlow can
-                // handle it: broken tenants proceed to delete, stopping tenants
-                // are checked for deletion already in progress.
```

If the tenant is stopping, DeleteTenantFlow does not in fact handle it,
but returns a 500-yielding errror.

## Summary of changes

Before calling into DeleteTenantFlow, if the tenant is in
stopping|broken state then return 202 if a deletion is in progress. This
makes the API friendlier for retries.

The historic AlreadyInProgress (409) response still exists for if we
enter DeleteTenantFlow and unexpectedly see the tenant stopping. That
should go away when we implement #5080 . For the moment, callers that
handle 409s should continue to do so.
@jcsp jcsp changed the title Epic: simplified timeline deletion, enable deletion of detached tenants Epic: simplified tenant deletion Jun 4, 2024
@jcsp jcsp self-assigned this Jun 4, 2024
@jcsp jcsp added the a/tech_debt Area: related to tech debt label Jun 5, 2024
jcsp added a commit that referenced this issue Jun 5, 2024
## Problem

As described in #7952, the controller's attempt to reconcile a tenant
before finally deleting it can get hung up waiting for the compute
notification hook to accept updates.

The fact that we try and reconcile a tenant at all during deletion is
part of a more general design issue (#5080), where deletion was
implemented as an operation on attached tenant, requiring the tenant to
be attached in order to delete it, which is not in principle necessary.

Closes: #7952

## Summary of changes

- In the pageserver deletion API, only do the traditional deletion path
if the tenant is attached. If it's secondary, then tear down the
secondary location, and then do a remote delete. If it's not attached at
all, just do the remote delete.
- In the storage controller, instead of ensuring a tenant is attached
before deletion, do a best-effort detach of the tenant, and then call
into some arbitrary pageserver to issue a deletion of remote content.

The pageserver retains its existing delete behavior when invoked on
attached locations. We can remove this later when all users of the API
are updated to either do a detach-before-delete. This will enable
removing the "weird" code paths during startup that sometimes load a
tenant and then immediately delete it, and removing the deletion markers
on tenants.
@jcsp jcsp closed this as completed Jun 25, 2024
@koivunej
Copy link
Member

koivunej commented Aug 5, 2024

A number of TODOs are still within the codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

3 participants