Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

per-TenantShard read throttling #6706

Merged
merged 28 commits into from
Feb 16, 2024
Merged

Conversation

problame
Copy link
Contributor

@problame problame commented Feb 9, 2024

Slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1707499237721679

Problem

part of epic #5899
Specifically, RFC #5648

tl;dr: we currently serve GetPage@LSN requests as fast as possible.
This sets the wrong incentives and poses operational and economic problems.

Summary of changes

Mechanism:

  • Add a throttling mechanism based on the leaky-bucket
  • Make throttle reconfigurable via tenant conf (=> durable & runtime-changeable)
  • Throttle is implemented at the top of Timeline::get
  • The throttle applies to a configurable set of TaskKinds;
    • To limit only Timeline::get requests from clients, specify only PageRequestHandler
    • This will include Timeline::get requests from basebackup, so, it might slow down basebackup.

The rationale for placing the throttle at the top of Timeline::get is, and for the TaskKind selector, is that it enables experimentation. E.g., we can also use it to rate limit compaction.

Observability:

  • Global metrics that counts accounted & throttled Timeline::get calls, as well as sum of wait time.
  • Per-tenant reporting through the logs if a tenant was throttled within last compaction_period seconds. Would like to have a better-defined reporting period, but, that would require a separate background task. If we add that in the future, we should move the walredo quiescing there, too.

Example

Live reconfigure a tenant like so:

curl -X PUT --data @override.json localhost:9898/v1/tenant/config
# override.json
{
    "tenant_id": "f76997305d46b20964d2a5636aa302c3",
    "timeline_get_throttle": {
            "fair": true,
            "initial": 0,
            "refill_interval": '1s',
            "refill_amount": 1000,
            "max": 1000,
            "task_kinds": [ "PageRequestHandler" ]
    }
}

Testing & Performance

I made changes to pagebench so it issues requests at a defined rate.
If a request takes too long, it waits client-side until the next request at regular frequency would have been issued before sending the next request.
It prints the number of requests that weren't sent as MISSED in the live stats.

I then used pagebench to benchmark the throttling code as follows

  • empty neon_local release build, on i3en.3xlarge instance store
  • create 50 empty tenants
  • configure page cache size & fd limits so everything stays within PS process
  • configure rate limit with 5000 IOPS:
page_cache_size = 1638400
max_file_descriptors = 500000

[tenant_config]

[tenant_config.timeline_get_throttle]
fair = true
initial = 0
refill_interval = '100ms'
refill_amount = 500
max = 5000
task_kinds = [ "PageRequestHandler" ]

Measure CPU contribution of throttling using process-scoped flamegraph, searching for throttling's contribution by searching the flamegraph for throttle:

sudo perf record -p "$(pgrep pageserver)" -F666  --call-graph=dwarf,65528 -a sleep 4  && sudo perf script | ./stackcollapse-perf.pl| ./flamegraph.pl > flamegraph.svg

Workload 0: 500 connections against one tenant, throttling disabled. 0.3% CPU time

./target/release/pagebench get-page-latest-lsn  --limit-to-first-n-targets 1 --per-client-rate 100 --num-clients 500
2024-02-16T15:19:24.703789Z  INFO RPS: 49944   MISSED: 0
Flame Graph ![2024-02-16 no throttle configured](https://github.com/neondatabase/neon/assets/956573/d7fa463d-f04d-48ba-82aa-619f0fa6c140)

Workload 1: 500 connections against one tenant, target rate 10x above throttle: 1.3% cpu time
TODO: debate this workload giving the skipping behavior of pagebench. Should the client instead go as fast as possible to determine CPU overhead?

./target/release/pagebench get-page-latest-lsn  --limit-to-first-n-targets 1 --per-client-rate 100 --num-clients 500
2024-02-14T13:30:12.505501Z  INFO RPS: 4995   MISSED: 45453
Flame Graph ![10x](https://github.com/neondatabase/neon/assets/956573/695148fc-7cbd-489d-9a30-0c5c053baa23)

Workload 2: single client per tenant, request rate below limit, run in parallel against each tenant to increase sampling coverage (all way below 100% usage, not CPU bound). 1.1% cpu time.

./target/release/pagebench get-page-latest-lsn  --limit-to-first-n-targets 50 --per-client-rate 400  --num-clients 1
2024-02-16T15:23:23.263295Z  INFO RPS: 19516   MISSED: 503
Flame Graph ![en par](https://github.com/neondatabase/neon/assets/956573/b42f22f0-34f6-4783-a3a4-8c39140cb03c)

TODOs

  • figure out what the page_service.rs metrics should mean on a rate limited tenant
    • currently, they will observe the rate-limited performance; that means our 4 Golden Signals dashboard & SLO will be affected by the rate limit
    • ideas:
      • A: duplicate these metrics, one rate-limited version, one not-rate-limited version
      • B: move the rate limit upwards into page_service.rs
        • if we don't want the experimentation capability that we gain by placing at top of Timeline::get

Future Work

  • Regression test using pagebench: assert that the rate limit is enforced by driving at limit rate with one client, then adding clients increases latency ~linearly.

Non-Goals

As pointed out in the RFC, this is a mechanism to prevent noisy-neighbor problems in pageserver. It's not a product feature / something we'll sell. The mechanism will be used to put an artificial ceiling on how many operations a single TenantShard can serve.

@problame problame changed the title getpage@lsn throttling per-TenantShard read throttling Feb 9, 2024
@problame problame requested a review from jcsp February 9, 2024 17:20
Copy link

github-actions bot commented Feb 9, 2024

2442 tests run: 2325 passed, 0 failed, 117 skipped (full report)


Flaky tests (3)

Postgres 16

  • test_create_snapshot: debug
  • test_empty_branch_remote_storage_upload: debug

Postgres 14

Code coverage (full report)

  • functions: 55.8% (12921 of 23162 functions)
  • lines: 82.5% (69917 of 84722 lines)

The comment gets automatically updated with the latest test results
23baada at 2024-02-16T19:29:12.761Z :recycle:

empty neon_local release build, on i3en.3xlarge instance store

create 50 empty tenants

configure page cache size & fd limits so everything stays within PS
process
and configure rate limit with 5000 IOPS:

page_cache_size = 1638400
max_file_descriptors = 500000

[tenant_config.timeline_get_rate_limit]
fair = true
initial = 0
interval_millis = 100
interval_refill = 500
max = 5000
task_kinds = [ "PageRequestHandler" ]

Overhead with throttle kicking in:
---------------------------------

./target/release/pagebench get-page-latest-lsn  --limit-to-first-n-targets 1 --per-client-rate 100 --num-clients 500

shows as expected that 90% of requests get dropped because above
throttle;

2024-02-14T13:30:12.505501Z  INFO RPS: 4995   MISSED: 45453

CPU overhead:

sudo perf record -p "$(pgrep pageserver)" -F666  --call-graph=dwarf,65528 -a sleep 4  && sudo perf script | ./stackcollapse-perf.pl| ./flamegraph.pl > flamegraph.svg

=> search result for "throttle"; 1.6% of CPU time.

Overhead while below throttle:
-----------------------------

./target/release/pagebench get-page-latest-lsn  --limit-to-first-n-targets 50 --per-client-rate 400  --num-clients 1

2024-02-14T13:41:34.811237Z  INFO RPS: 19588   MISSED: 434

CPU overhead:

sudo perf record -p "$(pgrep pageserver)" -F666  --call-graph=dwarf,65528 -a sleep 4  && sudo perf script | ./stackcollapse-perf.pl| ./flamegraph.pl > flamegraph.svg

=> search result for "throttle"; 0.9% of CPU time
@problame

This comment was marked as outdated.

@problame problame marked this pull request as ready for review February 14, 2024 17:56
@problame problame requested review from a team as code owners February 14, 2024 17:56
@problame problame requested review from koivunej and removed request for knizhnik February 14, 2024 18:04
libs/utils/src/lib.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it should work. The vast configuration options feels a bit like extra, but we can hopefully get to throttle those down in future. As far as I understood the perf is looking good.

Few nits for your consideration, maybe it's better to document on the Throttle that it will be shared between timelines instead everywhere else. At least that would had clarified for me why the arc-swap.

the Throttle has nice testable design, sadly I don't think it should get any tests because it just forwards to leaky_bucket.

@problame problame removed the request for review from jcsp February 16, 2024 13:59
@problame problame merged commit ca07fa5 into main Feb 16, 2024
49 checks passed
@problame problame deleted the problame/get-page-throttling/wip branch February 16, 2024 20:27
problame added a commit that referenced this pull request Mar 5, 2024
… metrics + regression test (#6953)

part of #5899

Problem
-------

Before this PR, the time spent waiting on the throttle was charged
towards the higher-level page_service metrics, i.e.,
`pageserver_smgr_query_seconds`.
The metrics are the foundation of internal SLIs / SLOs.
A throttled tenant would cause the SLI to degrade / SLO alerts to fire.

Changes
-------


- don't charge time spent in throttle towards the page_service metrics
- record time spent in throttle in RequestContext and subtract it from
the elapsed time
- this works because the page_service path doesn't create child context,
so, all the throttle time is recorded in the parent
- it's quite brittle and will break if we ever decide to spawn child
tasks that need child RequestContexts, which would have separate
instances of the `micros_spent_throttled` counter.
- however, let's punt that to a more general refactoring of
RequestContext
- add a test case that ensures that
- throttling happens for getpage requests; this aspect of the test
passed before this PR
- throttling delays aren't charged towards the page_service metrics;
this aspect of the test only passes with this PR
- drive-by: make the throttle log message `info!`, it's an expected
condition

Performance
-----------

I took the same measurements as in #6706 , no meaningful change in CPU
overhead.

Future Work
-----------

This PR enables us to experiment with the throttle for select tenants
without affecting the SLI metrics / triggering SLO alerts.

Before declaring this feature done, we need more work to happen,
specifically:

- decide on whether we want to retain the flexibility of throttling any
`Timeline::get` call, filtered by TaskKind
- versus: separate throttles for each page_service endpoint, potentially
with separate config options
- the trouble here is that this decision implies changes to the
TenantConfig, so, if we start using the current config style now, then
decide to switch to a different config, it'll be a breaking change

Nice-to-haves but probably not worth the time right now:

- Equivalent tests to ensure the throttle applies to all other
page_service handlers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants