tokio-epoll-uring: retry on launch failures due to locked memory #7141

problame · 2024-03-15T12:41:25Z

Problem

Before this PR, we were using tokio_epoll_uring::thread_local_system(),
which panics on tokio_epoll_uring::System::launch() failure

As we've learned in the
past,
some older Linux kernels account io_uring instances as locked memory.

And while we've raised the limit in prod considerably, we did hit it
once on 2024-03-11 16:30 UTC.
That was after we enabled tokio-epoll-uring fleet-wide, but before
we had shipped release-5090 (c6ed86d)
which did away with the last mass-creation of tokio-epoll-uring
instances as per

commit 3da410c8fee05b0cd65a5c0b83fffa3d5680cd77
Author: Christian Schwarz <christian@neon.tech>
Date:   Tue Mar 5 10:03:54 2024 +0100

    tokio-epoll-uring: use it on the layer-creating code paths (#6378)

Nonetheless, it highlighted that panicking in this situation is probably
not ideal, as it can leave the pageserver process in a semi-broken state.

Further, due to low sampling rate of Prometheus metrics, we don't know
much about the circumstances of this failure instance.

Solution

This PR implements a custom thread_local_system() that is pageserver-aware
and will do the following on failure:

dump relevant stats to tracing!, hopefully they will be useful to
understand the circumstances better
add metric counters for launch failures so we can create an alert
if it's ENOMEM, retry with exponential back-off, capped at 3s.
otherwise, assume it's permanent failure and abort() the process

This makes sense in the production environment where we know that
usually, there's ample locked memory allowance available, and we know
the failure rate is rare.

refs #7136 Problem ------- Before this PR, we were using `tokio_epoll_uring::thread_local_system()`, which panics on tokio_epoll_uring::System::launch() failure As we've learned in [the past](#6373 (comment)), some older Linux kernels account io_uring instances as locked memory. And while we've raised the limit in prod considerably, we did hit it once on 2024-03-11 16:30 UTC. That was after we enabled tokio-epoll-uring fleet-wide, but before we had shipped release-5090 (c6ed86d) which did away with the last mass-creation of tokio-epoll-uring instances as per commit 3da410c Author: Christian Schwarz <christian@neon.tech> Date: Tue Mar 5 10:03:54 2024 +0100 tokio-epoll-uring: use it on the layer-creating code paths (#6378) Nonetheless, it highlighted that panicking in this situation is probably not ideal, as it can leave the pageserver process in a semi-broken state. Further, due to low sampling rate of Prometheus metrics, we don't know much about the circumstances of this failure instance. Solution -------- This PR implements a custom thread_local_system() that is pageserver-aware and will do the following on failure: - dump relevant stats to `tracing!`, hopefully they will be useful to understand the circumstances better - if it's the locked memory failure (or any other ENOMEM): abort() the process - if it's ENOMEM, retry with exponential back-off, capped at 3s. - add metric counters so we can create an alert This makes sense in the production environment where we know that _usually_, there's ample locked memory allowance available, and we know the failure rate is rare.

pageserver/src/virtual_file/io_engine/tokio_epoll_uring_ext.rs

koivunej

These are very verbose errors, but that aligns with how often we hope to see them in the future.

github-actions · 2024-03-15T13:50:36Z

2712 tests run: 2586 passed, 0 failed, 126 skipped (full report)

Flaky tests (5)

Postgres 16

test_ts_of_lsn_api: release

Postgres 14

test_pageserver_getpage_throttle: debug
test_secondary_downloads: debug
test_pageserver_recovery: debug
test_long_timeline_create_cancelled_by_tenant_delete: debug

Code coverage* (full report)

functions: 28.5% (7084 of 24879 functions)
lines: 47.0% (43472 of 92562 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
f0f7930 at 2024-03-15T20:01:05.119Z :recycle:}

…uring-retry-launch

The PR #7141 added log message ``` ThreadLocalState is being dropped and id might be re-used in the future ``` which was supposed to be emitted when the thread-local is destroyed. Instead, it was emitted on _each_ call to `thread_local_system()`, ie.., on each tokio-epoll-uring operation.

The PR #7141 added log message ``` ThreadLocalState is being dropped and id might be re-used in the future ``` which was supposed to be emitted when the thread-local is destroyed. Instead, it was emitted on _each_ call to `thread_local_system()`, ie.., on each tokio-epoll-uring operation. Testing ------- Reproduced the issue locally and verified that this PR fixes the issue.

problame requested a review from a team as a code owner March 15, 2024 12:41

problame requested review from arpad-m and koivunej and removed request for arpad-m March 15, 2024 12:41

problame mentioned this pull request Mar 15, 2024

tokio-epoll-uring: panic in prod due to tokio-epoll-uring creation failure #7136

Closed

koivunej reviewed Mar 15, 2024

View reviewed changes

pageserver/src/virtual_file/io_engine/tokio_epoll_uring_ext.rs Outdated Show resolved Hide resolved

koivunej reviewed Mar 15, 2024

View reviewed changes

pageserver/src/virtual_file/io_engine/tokio_epoll_uring_ext.rs Show resolved Hide resolved

koivunej approved these changes Mar 15, 2024

View reviewed changes

address review comments

fc695b9

problame enabled auto-merge (squash) March 15, 2024 13:09

problame added 3 commits March 15, 2024 17:49

Merge remote-tracking branch 'origin/main' into problame/tokio-epoll-…

771a1a8

…uring-retry-launch

for some reason cargo hakari needs changes

2e6f907

clippy

f0f7930

problame merged commit 0694ee9 into main Mar 15, 2024
53 checks passed

problame deleted the problame/tokio-epoll-uring-retry-launch branch March 15, 2024 19:46

problame mentioned this pull request Mar 18, 2024

fixup(#7141 / tokio_epoll_uring_ext): high frequency log message #7160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokio-epoll-uring: retry on launch failures due to locked memory #7141

tokio-epoll-uring: retry on launch failures due to locked memory #7141

problame commented Mar 15, 2024 •

edited

Loading

koivunej left a comment

github-actions bot commented Mar 15, 2024 •

edited

Loading

Postgres 16

Postgres 14

tokio-epoll-uring: retry on launch failures due to locked memory #7141

tokio-epoll-uring: retry on launch failures due to locked memory #7141

Conversation

problame commented Mar 15, 2024 • edited Loading

Problem

Solution

koivunej left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 15, 2024 • edited Loading

2712 tests run: 2586 passed, 0 failed, 126 skipped (full report)

Postgres 16

Postgres 14

Code coverage* (full report)

problame commented Mar 15, 2024 •

edited

Loading

github-actions bot commented Mar 15, 2024 •

edited

Loading