Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Pageserver should support 200K tenants #4816

Closed
shanyp opened this issue Jul 26, 2023 · 2 comments
Closed

Epic: Pageserver should support 200K tenants #4816

shanyp opened this issue Jul 26, 2023 · 2 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic

Comments

@shanyp
Copy link
Contributor

shanyp commented Jul 26, 2023

Motivation

To reduce costs and support higher density PS

DoD

Pageserver P90 get_page_request is the same with 200K tenants

Implementation ideas

Detaching in-active tenants
Reducing metrics

Tasks

  • [ ]

Other related tasks and Epics

@shanyp shanyp added c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic labels Jul 26, 2023
@LizardWizzard
Copy link
Contributor

Lets mention our assumption on how many of those 200k are active. I'd say that it is ~1-3%. Taking the average of 2% gives us 4k active tenants. Is it accurate?

Current problems are:

@shanyp shanyp assigned jcsp and unassigned shanyp Aug 14, 2023
@shanyp shanyp changed the title Epic: PageServer should support 200K tenants Epic: Pageserver should support 200K tenants Aug 14, 2023
@jcsp
Copy link
Contributor

jcsp commented Oct 2, 2023

Our latest thinking on this topic is that it does not make sense to attach idle tenants and then layer a low-energy mechanism on top of that. Instead, low energy tenant should not be attached at all.

The Secondary mode in #5299 is also suitable for idle tenants: whereas a normal active tenant would have one attached location and one secondary location, a long-idle tenant can just have one secondary layer with warm=false. Its layers will gradually fall off disk due to disk pressure. Eventually, it will make sense to not even have secondary locations (these are cheaper than attachments but still have some cost), especially for small tenants whose layers can rapidly be downloaded from S3 on-demand.

The responsibility for orchestrating this gets pushed up to the control plane, or to some new intermediate service that manages pageservers.

To make this complete, it is also necessary to handle:

  • Consumption metrics: either the billing infrastructure needs a "last recieved value still holds" behavior, or something will have to send redundant repeats of consumption metrics for these idle tenants.

Nice to have:

  • If a tenant goes idle while it has consumption/gc work pending, then wake it up at some point to do that. In practice, it might be sufficient to wake idle tenants after 1 week before putting them to sleep indefinitely, since we use a default 1 week PITR target.

@jcsp jcsp closed this as completed Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

3 participants