Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: scalable async disk IO (tokio-epoll-uring) #4744

Closed
22 tasks done
problame opened this issue Jul 18, 2023 · 10 comments
Closed
22 tasks done

Epic: scalable async disk IO (tokio-epoll-uring) #4744

problame opened this issue Jul 18, 2023 · 10 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic

Comments

@problame
Copy link
Contributor

problame commented Jul 18, 2023

The async get_value_reconstruct_data project (GH project, final commit with good summary) converted more of the pageserver code base to async fns.

We had to revert that final commit due to performance considerations.
The current hypothesis is that most of the spawn_blocking'ed calls were completely CPU-bound and too short-lived (single-digit microsecond range).

Why were they short-lived? The current hypothesis is that

  1. pageserver-internal-page-cache hit rate is not as terrible as we thought and
  2. the kernel page cache held most of the remaining data

Under that hypothesis, spawn_blocking has too much overhead (CPU time => latency) for work that takes single-digit nanoseconds. In fact, microbenchmarks suggest that the breakeven point is at ca 25us of work on our systems.

(More details: https://www.notion.so/neondatabase/Why-we-needed-to-revert-my-async-get_value_reconstruct_data-patch-or-what-I-learned-about-spawn_b-91f28c48b7314765bdeed6e8cb38fdce?pvs=4 )

We considered switching to a design with

  • large pageserver-internal page cache
  • O_DIRECT for reads, using spawn_blocking
  • hence, no kernel-page-cache hits

The problem is that we'd

  • lose pageserver-internal page cache contents on each pageserver restart
  • likely need to invest more time into the pageserver page cache (scalability issues).

So, this epic sets out to explore how we can continue to mostly rely on the kernel-page-cache for our read IO on the Timeline::get code path, in a way that is more scalable than spawn_blocking.


This epic is the sister-epic to

High-Level

  1. 18 of 21
    problame
  2. problame
  3. 4 of 4
  4. 2 of 2

Impl

  1. run-extra-build-macos
  2. run-benchmarks
  3. 1 of 1
    c/storage/pageserver
    koivunej problame
  4. c/storage/pageserver t/bug
    problame

Tasks

  1. a/test

Follow-Ups not part of this epic

@problame problame self-assigned this Jul 18, 2023
@problame problame added the t/Epic Issue type: Epic label Jul 18, 2023
@problame problame added the c/storage/pageserver Component: storage: pageserver label Jul 18, 2023
@problame
Copy link
Contributor Author

Development happens in https://github.com/neondatabase/tokio-epoll-uring

problame added a commit that referenced this issue Dec 4, 2023
…6000)

Problem
-------

Before this PR, there was no concurrency limit on initial logical size
computations.

While logical size computations are lazy in theory, in practice
(production), they happen in a short timeframe after restart.

This means that on a PS with 20k tenants, we'd have up to 20k concurrent
initial logical size calculation requests.

This is self-inflicted needless overload.

This hasn't been a problem so far because the `.await` points on the
logical size calculation path never return `Pending`, hence we have a
natural concurrency limit of the number of executor threads.
But, as soon as we return `Pending` somewhere in the logical size
calculation path, other concurrent tasks get scheduled by tokio.
If these other tasks are also logical size calculations, they eventually
pound on the same bottleneck.

For example, in #5479, we want to switch the VirtualFile descriptor
cache to a `tokio::sync::RwLock`, which makes us return `Pending`, and
without measures like this patch, after PS restart, VirtualFile
descriptor cache thrashes heavily for 2 hours until all the logical size
calculations have been computed and the degree of concurrency /
concurrent VirtualFile operations is down to regular levels.
See the *Experiment* section below for details.

<!-- Experiments (see below) show that plain #5479 causes heavy
thrashing of the VirtualFile descriptor cache.
The high degree of concurrency is too much for 
In the case of #5479 the VirtualFile descriptor cache size starts
thrashing heavily.


-->

Background
----------

Before this PR, initial logical size calculation was spawned lazily on
first call to `Timeline::get_current_logical_size()`.

In practice (prod), the lazy calculation is triggered by
`WalReceiverConnectionHandler` if the timeline is active according to
storage broker, or by the first iteration of consumption metrics worker
after restart (`MetricsCollection`).

The spawns by walreceiver are high-priority because logical size is
needed by Safekeepers (via walreceiver `PageserverFeedback`) to enforce
the project logical size limit.
The spawns by metrics collection are not on the user-critical path and
hence low-priority. [^consumption_metrics_slo]

[^consumption_metrics_slo]: We can't delay metrics collection
indefintely because there are TBD internal SLOs tied to metrics
collection happening in a timeline manner
(neondatabase/cloud#7408). But let's ignore
that in this issue.

The ratio of walreceiver-initiated spawns vs
consumption-metrics-initiated spawns can be reconstructed from logs
(`spawning logical size computation from context of task kind {:?}"`).
PR #5995 and #6018 adds metrics for this.

First investigation of the ratio lead to the discovery that walreceiver
spawns 75% of init logical size computations.
That's because of two bugs:
- In Safekeepers: #5993
- In interaction between Pageservers and Safekeepers:
#5962

The safekeeper bug is likely primarily responsible but we don't have the
data yet. The metrics will hopefully provide some insights.

When assessing production-readiness of this PR, please assume that
neither of these bugs are fixed yet.


Changes In This PR
------------------

With this PR, initial logical size calculation is reworked as follows:

First, all initial logical size calculation task_mgr tasks are started
early, as part of timeline activation, and run a retry loop with long
back-off until success. This removes the lazy computation; it was
needless complexity because in practice, we compute all logical sizes
anyways, because consumption metrics collects it.

Second, within the initial logical size calculation task, each attempt
queues behind the background loop concurrency limiter semaphore. This
fixes the performance issue that we pointed out in the "Problem" section
earlier.

Third, there is a twist to queuing behind the background loop
concurrency limiter semaphore. Logical size is needed by Safekeepers
(via walreceiver `PageserverFeedback`) to enforce the project logical
size limit. However, we currently do open walreceiver connections even
before we have an exact logical size. That's bad, and I'll build on top
of this PR to fix that
(#5963). But, for the
purposes of this PR, we don't want to introduce a regression, i.e., we
don't want to provide an exact value later than before this PR. The
solution is to introduce a priority-boosting mechanism
(`GetLogicalSizePriority`), allowing callers of
`Timeline::get_current_logical_size` to specify how urgently they need
an exact value. The effect of specifying high urgency is that the
initial logical size calculation task for the timeline will skip the
concurrency limiting semaphore. This should yield effectively the same
behavior as we had before this PR with lazy spawning.

Last, the priority-boosting mechanism obsoletes the `init_order`'s grace
period for initial logical size calculations. It's a separate commit to
reduce the churn during review. We can drop that commit if people think
it's too much churn, and commit it later once we know this PR here
worked as intended.

Experiment With #5479 
---------------------

I validated this PR combined with #5479 to assess whether we're making
forward progress towards asyncification.

The setup is an `i3en.3xlarge` instance with 20k tenants, each with one
timeline that has 9 layers.
All tenants are inactive, i.e., not known to SKs nor storage broker.
This means all initial logical size calculations are spawned by
consumption metrics `MetricsCollection` task kind.
The consumption metrics worker starts requesting logical sizes at low
priority immediately after restart. This is achieved by deleting the
consumption metrics cache file on disk before starting
PS.[^consumption_metrics_cache_file]

[^consumption_metrics_cache_file] Consumption metrics worker persists
its interval across restarts to achieve persistent reporting intervals
across PS restarts; delete the state file on disk to get predictable
(and I believe worst-case in terms of concurrency during PS restart)
behavior.

Before this patch, all of these timelines would all do their initial
logical size calculation in parallel, leading to extreme thrashing in
page cache and virtual file cache.

With this patch, the virtual file cache thrashing is reduced
significantly (from 80k `open`-system-calls/second to ~500
`open`-system-calls/second during loading).


### Critique

The obvious critique with above experiment is that there's no skipping
of the semaphore, i.e., the priority-boosting aspect of this PR is not
exercised.

If even just 1% of our 20k tenants in the setup were active in
SK/storage_broker, then 200 logical size calculations would skip the
limiting semaphore immediately after restart and run concurrently.

Further critique: given the two bugs wrt timeline inactive vs active
state that were mentioned in the Background section, we could have 75%
of our 20k tenants being (falsely) active on restart.

So... (next section)

This Doesn't Make Us Ready For Async VirtualFile
------------------------------------------------

This PR is a step towards asynchronous `VirtualFile`, aka, #5479 or even
#4744.

But it doesn't yet enable us to ship #5479.

The reason is that this PR doesn't limit the amount of high-priority
logical size computations.
If there are many high-priority logical size calculations requested,
we'll fall over like we did if #5479 is applied without this PR.
And currently, at very least due to the bugs mentioned in the Background
section, we run thousands of high-priority logical size calculations on
PS startup in prod.

So, at a minimum, we need to fix these bugs.

Then we can ship #5479 and #4744, and things will likely be fine under
normal operation.

But in high-traffic situations, overload problems will still be more
likely to happen, e.g., VirtualFile cache descriptor thrashing.
The solution candidates for that are orthogonal to this PR though:
* global concurrency limiting
* per-tenant rate limiting => #5899
* load shedding
* scaling bottleneck resources (fd cache size (neondatabase/cloud#8351),
page cache size(neondatabase/cloud#8351), spread load across more PSes,
etc)

Conclusion
----------

Even with the remarks from in the previous section, we should merge this
PR because:
1. it's an improvement over the status quo (esp. if the aforementioned
bugs wrt timeline active / inactive are fixed)
2. it prepares the way for
#6010
3. it gets us close to shipping #5479 and #4744
problame added a commit that referenced this issue Jan 11, 2024
)

This reverts commit ab1f37e.
Thereby
fixes #5479

Updated Analysis
================

The problem with the original patch was that it, for the first time,
exposed the `VirtualFile` code to tokio task concurrency instead of just
thread-based concurrency. That caused the VirtualFile file descriptor
cache to start thrashing, effectively grinding the system to a halt.

Details
-------

At the time of the original patch, we had a _lot_ of runnable tasks in
the pageserver.
The symptom that prompted the revert (now being reverted in this PR) is
that our production systems fell into a valley of zero goodput, high
CPU, and zero disk IOPS shortly after PS restart.
We lay out the root cause for that behavior in this subsection.

At the time, there was no concurrency limit on the number of concurrent
initial logical size calculations.
Initial size calculation was initiated for all timelines within the
first 10 minutes as part of consumption metrics collection.
On a PS with 20k timelines, we'd thus have 20k runnable tasks.

Before the original patch, the `VirtualFile` code never returned
`Poll::Pending`.
That meant that once we entered it, the calling tokio task would not
yield to the tokio executor until we were done performing the
VirtualFile operation, i.e., doing a blocking IO system call.

The original patch switched the VirtualFile file descriptor cache's
synchronization primitives to those from `tokio::sync`.
It did not change that we were doing synchronous IO system calls.
And the cache had more slots than we have tokio executor threads.
So, these primitives never actually needed to return `Poll::Pending`.
But, the tokio scheduler makes tokio sync primitives return `Pending`
*artificially*, as a mechanism for the scheduler to get back into
control more often
([example](https://docs.rs/tokio/1.35.1/src/tokio/sync/batch_semaphore.rs.html#570)).

So, the new reality was that VirtualFile calls could now yield to the
tokio executor.
Tokio would pick one of the other 19999 runnable tasks to run.
These tasks were also using VirtualFile.
So, we now had a lot more concurrency in that area of the code.

The problem with more concurrency was that caches started thrashing,
most notably the VirtualFile file descriptor cache: each time a task
would be rescheduled, it would want to do its next VirtualFile
operation. For that, it would first need to evict another (task's)
VirtualFile fd from the cache to make room for its own fd. It would then
do one VirtualFile operation before hitting an await point and yielding
to the executor again. The executor would run the other 19999 tasks for
fairness before circling back to the first task, which would find its fd
evicted.

The other cache that would theoretically be impacted in a similar way is
the pageserver's `PageCache`.
However, for initial logical size calculation, it seems much less
relevant in experiments, likely because of the random access nature of
initial logical size calculation.

Fixes
=====

We fixed the above problems by
- raising VirtualFile cache sizes
  - neondatabase/cloud#8351
- changing code to ensure forward-progress once cache slots have been
acquired
  - #5480
  - #5482
  - tbd: #6065
- reducing the amount of runnable tokio tasks
  - #5578
  - #6000
- fix bugs that caused unnecessary concurrency induced by connection
handlers
  - #5993

I manually verified that this PR doesn't negatively affect startup
performance as follows:
create a pageserver in production configuration, with 20k
tenants/timelines, 9 tiny L0 layer files each; Start it, and observe

```
INFO Startup complete (368.009s since start) elapsed_ms=368009
```

I further verified in that same setup that, when using `pagebench`'s
getpage benchmark at as-fast-as-possible request rate against 5k of the
20k tenants, the achieved throughput is identical. The VirtualFile cache
isn't thrashing in that case.

Future Work
===========

We will still exposed to the cache thrashing risk from outside factors,
e.g., request concurrency is unbounded, and initial size calculation
skips the concurrency limiter when we establish a walreceiver
connection.

Once we start thrashing, we will degrade non-gracefully, i.e., encounter
a valley as was seen with the original patch.

However, we have sufficient means to deal with that unlikely situation:
1. we have dashboards & metrics to monitor & alert on cache thrashing
2. we can react by scaling the bottleneck resources (cache size) or by
manually shedding load through tenant relocation

Potential systematic solutions are future work:
* global concurrency limiting
* per-tenant rate limiting => #5899
* pageserver-initiated load shedding

Related Issues
==============

This PR unblocks the introduction of tokio-epoll-uring for asynchronous
disk IO ([Epic](#4744)).
@jcsp jcsp changed the title Epic: develop solution for scalable async maybe-kernel-buffered IO Epic: develop solution for scalable async maybe-kernel-buffered IO (tokio-epoll-uring) Jan 29, 2024
@jcsp
Copy link
Collaborator

jcsp commented Jan 29, 2024

  • Last week: enabled in CI. No major issues. 1-2 flaky tests being investigated to confirm they're pre-existing issues.
  • This week: benchmarking
  • This week: enable in staging.

@problame
Copy link
Contributor Author

Deployed to staging: done: https://github.com/neondatabase/aws/pull/932
Benchmarking: done: #6509 (comment)

Will PR the changes from my benchmarking baseline case over the next couple of days.

@problame problame changed the title Epic: develop solution for scalable async maybe-kernel-buffered IO (tokio-epoll-uring) Epic: scalable async disk IO (tokio-epoll-uring) Jan 30, 2024
@problame
Copy link
Contributor Author

problame commented Feb 5, 2024

Moved from #2975 (comment) by @jcsp


This week:

  • Async write operations
  • Avoid using spawn_blocking(block_on) because of the overhead of churning io_uring runtimes (while keeping it for prod until we cut over io_uring)
  • Cut over at least one prod region/pageserver to io_uring

@jcsp
Copy link
Collaborator

jcsp commented Feb 12, 2024

Monitoring staging with new metrics through this week, to get more insight on whether our limits on locked memory are going to be a problem in prod. Maybe prod next week.

@jcsp
Copy link
Collaborator

jcsp commented Feb 19, 2024

This week:

  1. Merging change to get rid of spurious metadata reads (unblock change to remove spurious writes). Once we're sure we aren't reverting release, we can proceed to 2.
  2. Merge changes that do virtualfile spawn_blockings, thereby remove the spawning of lots of io_urings. Observe this change in staging this week.
  3. Enable on one pageserver in ap-southeast-1 (not blocked by 1,2)

@jcsp
Copy link
Collaborator

jcsp commented Feb 26, 2024

This week:

  • Do a mid-week deploy to enable tokio-epoll-uring on remaining prod fleet.
  • PR to change linux default to tokio-epoll-uring + conditionally use it in dev if kernel is new enough

@problame
Copy link
Contributor Author

problame commented Mar 4, 2024

Do a mid-week deploy to enable tokio-epoll-uring on remaining prod fleet.

Couldn't happen last week, will happen mid this week.

PR to change linux default to tokio-epoll-uring + conditionally use it in dev if kernel is new enough

Do it after we've switched prod over.


write path

  • get layer file creation patch stack merged (work happened last week)

on-demand downloads

  • get PR in reviewable shape
  • measure / prevent regressions using a new pagebench benchmark to churn on-demand downloads
  • maybe experiment with different buffer sizes

@jcsp
Copy link
Collaborator

jcsp commented Mar 11, 2024

This week:

  • on-demand download should use epoll-uring

@problame
Copy link
Contributor Author

problame commented Mar 18, 2024

on-demand download should use epoll-uring

Implemented & benchmarked last week, shipping this week.

This week

  • observe perf impact & resource usage in prod
    • ? automate benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

2 participants