Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: finer-grained handling of migration states #5397

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Sep 27, 2023

Problem

In #5299 the tenant states for migration are added, but respected only in a coarse-grained way: when hinted not to do deletions, tenants will just avoid doing all GC or compaction. Skipping compaction is not necessary for AttachedMulti, as we will soon become the primary attached location, and it is not a waste of resources to proceed with compaction. Instead, per the RFC (https://github.com/neondatabase/neon/pull/5029/files), deletions are queued up in this state, and executed later when we switch to AttachedSingle.

TODO: tests per #5396

Closes: #5396

Summary of changes

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver labels Sep 27, 2023
@github-actions
Copy link

github-actions bot commented Sep 27, 2023

3174 tests run: 3030 passed, 6 failed, 138 skipped (full report)


Failures on Postgres 16

  • test_live_reconfig_get_evictions_low_residence_duration_metric_threshold: release, debug

Failures on Postgres 15

  • test_live_reconfig_get_evictions_low_residence_duration_metric_threshold: release, debug

Failures on Postgres 14

  • test_live_reconfig_get_evictions_low_residence_duration_metric_threshold: release, debug
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_live_reconfig_get_evictions_low_residence_duration_metric_threshold[release-pg14] or test_live_reconfig_get_evictions_low_residence_duration_metric_threshold[debug-pg14] or test_live_reconfig_get_evictions_low_residence_duration_metric_threshold[release-pg15] or test_live_reconfig_get_evictions_low_residence_duration_metric_threshold[debug-pg15] or test_live_reconfig_get_evictions_low_residence_duration_metric_threshold[release-pg16] or test_live_reconfig_get_evictions_low_residence_duration_metric_threshold[debug-pg16]"
Flaky tests (2)

Postgres 16

  • test_timeline_deletion_with_files_stuck_in_upload_queue: debug
  • test_subscriber_restart: release

Test coverage report is not available

The comment gets automatically updated with the latest test results
801d15f at 2024-06-03T17:33:23.126Z :recycle:

jcsp added a commit that referenced this pull request Oct 5, 2023
…ocations (#5299)

## Problem

These changes are part of building seamless tenant migration, as
described in the RFC:
- #5029

## Summary of changes

- A new configuration type `LocationConf` supersedes `TenantConfOpt` for
storing a tenant's configuration in the pageserver repo dir. It contains
`TenantConfOpt`, as well as a new `mode` attribute that describes what
kind of location this is (secondary, attached, attachment mode etc). It
is written to a file called `config-v1` instead of `config` -- this
prepares us for neatly making any other profound changes to the format
of the file in future. Forward compat for existing pageserver code is
achieved by writing out both old and new style files. Backward compat is
achieved by checking for the old-style file if the new one isn't found.
- The `TenantMap` type changes, to hold `TenantSlot` instead of just
`Tenant`. The `Tenant` type continues to be used for attached tenants
only. Tenants in other states (such as secondaries) are represented by a
different variant of `TenantSlot`.
- Where `Tenant` & `Timeline` used to hold an Arc<Mutex<TenantConfOpt>>,
they now hold a reference to a AttachedTenantConf, which includes the
extra information from LocationConf. This enables them to know the
current attachment mode.
- The attachment mode is used as an advisory input to decide whether to
do compaction and GC (AttachedStale is meant to avoid doing uploads,
AttachedMulti is meant to avoid doing deletions).
- A new HTTP API is added at `PUT /tenants/<tenant_id>/location_config`
to drive new location configuration. This provides a superset of the
functionality of attach/detach/load/ignore:
  - Attaching a tenant is just configuring it in an attached state
  - Detaching a tenant is configuring it to a detached state
  - Loading a tenant is just the same as attaching it
- Ignoring a tenant is the same as configuring it into Secondary with
warm=false (i.e. retain the files on disk but do nothing else).

Caveats:
- AttachedMulti tenants don't do compaction in this PR, but they do in
the follow on #5397
- Concurrent updates to the `location_config` API are not handled
elegantly in this PR, a better mechanism is added in the follow on
#5367
- Secondary mode is just a placeholder in this PR: the code to upload
heatmaps and do downloads on secondary locations will be added in a
later PR (but that shouldn't change any external interfaces)

Closes: #5379

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
jcsp added a commit that referenced this pull request Oct 10, 2023
## Problem

If a caller detaches a tenant and then attaches it again, pending
deletions from the old attachment might not have happened yet. This is
not a correctness problem, but it causes:
- Risk of leaking some objects in S3
- Some warnings from the deletion queue when pending LSN updates and
pending deletions don't pass validation.

## Summary of changes

- Deletion queue now uses UnboundedChannel so that the push interfaces
don't have to be async.
- This was pulled out of #5397,
where it is also useful to be able to drive the queue from non-async
contexts.
- Why is it okay for this to be unbounded? The only way the
unbounded-ness of the channel can become a problem is if writing out
deletion lists can't keep up, but if the system were that overloaded
then the code generating deletions (GC, compaction) would also be
impacted.
- DeletionQueueClient gets a new `flush_advisory` function, which is
like flush_execute, but doesn't wait for completion: this is appropriate
for use in contexts where we would like to encourage the deletion queue
to flush, but don't need to block on it.
- This function is also expected to be useful in next steps for seamless
migration, where the option to flush to S3 while transitioning into
AttachedStale will also include flushing deletion queue, but we wouldn't
want to block on that flush.
- The tenant_detach code in mgr.rs invokes flush_advisory after stopping
the `Tenant` object.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
@jcsp jcsp force-pushed the jcsp/secondary-locations-pt1.5 branch from 87d7bdc to 0b1a2a7 Compare November 2, 2023 14:01
@jcsp jcsp force-pushed the jcsp/secondary-locations-pt1.5 branch from 0b1a2a7 to ecc0368 Compare December 11, 2023 10:10
@jcsp jcsp force-pushed the jcsp/secondary-locations-pt1.5 branch from ecc0368 to 801d15f Compare June 3, 2024 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pageserver: additional live migration regression tests
1 participant