Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokio-epoll-uring: retry on launch failures due to locked memory #7141

Merged
merged 5 commits into from
Mar 15, 2024

Conversation

problame
Copy link
Contributor

@problame problame commented Mar 15, 2024

refs #7136

Problem

Before this PR, we were using tokio_epoll_uring::thread_local_system(),
which panics on tokio_epoll_uring::System::launch() failure

As we've learned in the
past
,
some older Linux kernels account io_uring instances as locked memory.

And while we've raised the limit in prod considerably, we did hit it
once on 2024-03-11 16:30 UTC.
That was after we enabled tokio-epoll-uring fleet-wide, but before
we had shipped release-5090 (c6ed86d)
which did away with the last mass-creation of tokio-epoll-uring
instances as per

commit 3da410c8fee05b0cd65a5c0b83fffa3d5680cd77
Author: Christian Schwarz <christian@neon.tech>
Date:   Tue Mar 5 10:03:54 2024 +0100

    tokio-epoll-uring: use it on the layer-creating code paths (#6378)

Nonetheless, it highlighted that panicking in this situation is probably
not ideal, as it can leave the pageserver process in a semi-broken state.

Further, due to low sampling rate of Prometheus metrics, we don't know
much about the circumstances of this failure instance.

Solution

This PR implements a custom thread_local_system() that is pageserver-aware
and will do the following on failure:

  • dump relevant stats to tracing!, hopefully they will be useful to
    understand the circumstances better
  • add metric counters for launch failures so we can create an alert
  • if it's ENOMEM, retry with exponential back-off, capped at 3s.
  • otherwise, assume it's permanent failure and abort() the process

This makes sense in the production environment where we know that
usually, there's ample locked memory allowance available, and we know
the failure rate is rare.

refs #7136

Problem
-------

Before this PR, we were using `tokio_epoll_uring::thread_local_system()`,
which panics on tokio_epoll_uring::System::launch() failure

As we've learned in [the
past](#6373 (comment)),
some older Linux kernels account io_uring instances as locked memory.

And while we've raised the limit in prod considerably, we did hit it
once on 2024-03-11 16:30 UTC.
That was after we enabled tokio-epoll-uring fleet-wide, but before
we had shipped release-5090 (c6ed86d)
which did away with the last mass-creation of tokio-epoll-uring
instances as per

    commit 3da410c
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Tue Mar 5 10:03:54 2024 +0100

        tokio-epoll-uring: use it on the layer-creating code paths (#6378)

Nonetheless, it highlighted that panicking in this situation is probably
not ideal, as it can leave the pageserver process in a semi-broken state.

Further, due to low sampling rate of Prometheus metrics, we don't know
much about the circumstances of this failure instance.

Solution
--------

This PR implements a custom thread_local_system() that is pageserver-aware
and will do the following on failure:
- dump relevant stats to `tracing!`, hopefully they will be useful to
  understand the circumstances better
- if it's the locked memory failure (or any other ENOMEM): abort() the
  process
- if it's ENOMEM, retry with exponential back-off, capped at 3s.
- add metric counters so we can create an alert

This makes sense in the production environment where we know that
_usually_, there's ample locked memory allowance available, and we know
the failure rate is rare.
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very verbose errors, but that aligns with how often we hope to see them in the future.

@problame problame enabled auto-merge (squash) March 15, 2024 13:09
Copy link

github-actions bot commented Mar 15, 2024

2712 tests run: 2586 passed, 0 failed, 126 skipped (full report)


Flaky tests (5)

Postgres 16

Postgres 14

  • test_pageserver_getpage_throttle: debug
  • test_secondary_downloads: debug
  • test_pageserver_recovery: debug
  • test_long_timeline_create_cancelled_by_tenant_delete: debug

Code coverage* (full report)

  • functions: 28.5% (7084 of 24879 functions)
  • lines: 47.0% (43472 of 92562 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
f0f7930 at 2024-03-15T20:01:05.119Z :recycle:

@problame problame merged commit 0694ee9 into main Mar 15, 2024
53 checks passed
@problame problame deleted the problame/tokio-epoll-uring-retry-launch branch March 15, 2024 19:46
problame added a commit that referenced this pull request Mar 18, 2024
The PR #7141 added log message

```
ThreadLocalState is being dropped and id might be re-used in the future
```

which was supposed to be emitted when the thread-local is destroyed.
Instead, it was emitted on _each_ call to `thread_local_system()`,
ie.., on each tokio-epoll-uring operation.
arpad-m pushed a commit that referenced this pull request Mar 18, 2024
The PR #7141 added log message

```
ThreadLocalState is being dropped and id might be re-used in the future
```

which was supposed to be emitted when the thread-local is destroyed.
Instead, it was emitted on _each_ call to `thread_local_system()`,
ie.., on each tokio-epoll-uring operation.
arpad-m pushed a commit that referenced this pull request Mar 18, 2024
The PR #7141 added log message

```
ThreadLocalState is being dropped and id might be re-used in the future
```

which was supposed to be emitted when the thread-local is destroyed.
Instead, it was emitted on _each_ call to `thread_local_system()`,
ie.., on each tokio-epoll-uring operation.

Testing
-------

Reproduced the issue locally and verified that this PR fixes the issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants