-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test release PR creation & notifications #5295
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Problem - https://github.com/neondatabase/neon/security/dependabot/28 ## Summary of changes Remove atty, and remove the `with_ansi` arg to scrubber's stdout logger.
The sequence that can lead to a deadlock: 1. DELETE request gets all the way to `tenant.shutdown(progress, false).await.is_err() ` , while holding TENANTS.read() 2. POST request for tenant creation comes in, calls `tenant_map_insert`, it does `let mut guard = TENANTS.write().await;` 3. Something that `tenant.shutdown()` needs to wait for needs a `TENANTS.read().await`. The only case identified in exhaustive manual scanning of the code base is this one: Imitate size access does `get_tenant().await`, which does `TENANTS.read().await` under the hood. In the above case (1) waits for (3), (3)'s read-lock request is queued behind (2)'s write-lock, and (2) waits for (1). Deadlock. I made a reproducer/proof-that-above-hypothesis-holds in #5281 , but, it's not ready for merge yet and we want the fix _now_. fixes #5284
This adds PostgreSQL 16 as a vendored postgresql version, and adapts the code to support this version. The important changes to PostgreSQL 16 compared to the PostgreSQL 15 changeset include the addition of a neon_rmgr instead of altering Postgres's original WAL format. Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Motivation ========== We observed two "indigestion" events on staging, each shortly after restarting `pageserver-0.eu-west-1.aws.neon.build`. It has ~8k tenants. The indigestion manifests as `Timeline::get` calls failing with `exceeded evict iter limit` . The error is from `page_cache.rs`; it was unable to find a free page and hence failed with the error. The indigestion events started occuring after we started deploying builds that contained the following commits: ``` [~/src/neon]: git log --oneline c0ed362..15eaf78 091da1a1c8b4f60ebf8 15eaf78 Disallow block_in_place and Handle::block_on (#5101) a18d6d9 Make File opening in VirtualFile async-compatible (#5280) 76cc873 Use tokio locks in VirtualFile and turn with_file into macro (#5247) ``` The second and third commit are interesting. They add .await points to the VirtualFile code. Background ========== On the read path, which is the dominant user of page cache & VirtualFile during pageserver restart, `Timeline::get` `page_cache` and VirtualFile interact as follows: 1. Timeline::get tries to read from a layer 2. This read goes through the page cache. 3. If we have a page miss (which is known to be common after restart), page_cache uses `find_victim` to find an empty slot, and once it has found a slot, it gives exclusive ownership of it to the caller through a `PageWriteGuard`. 4. The caller is supposed to fill the write guard with data from the underlying backing store, i.e., the layer `VirtualFile`. 5. So, we call into `VirtualFile::read_at`` to fill the write guard. The `find_victim` method finds an empty slot using a basic implementation of clock page replacement algorithm. Slots that are currently in use (`PageReadGuard` / `PageWriteGuard`) cannot become victims. If there have been too many iterations, `find_victim` gives up with error `exceeded evict iter limit`. Root Cause For Indigestion ========================== The second and third commit quoted in the "Motivation" section introduced `.await` points in the VirtualFile code. These enable tokio to preempt us and schedule another future __while__ we hold the `PageWriteGuard` and are calling `VirtualFile::read_at`. This was not possible before these commits, because there simply were no await points that weren't Poll::Ready immediately. With the offending commits, there is now actual usage of `tokio::sync::RwLock` to protect the VirtualFile file descriptor cache. And we __know__ from other experiments that, during the post-restart "rush", the VirtualFile fd cache __is__ too small, i.e., all slots are taken by _ongoing_ VirtualFile operations and cannot be victims. So, assume that VirtualFile's `find_victim_slot`'s `RwLock::write().await` calls _will_ yield control to the executor. The above can lead to the pathological situation if we have N runnable tokio tasks, each wanting to do `Timeline::get`, but only M slots, N >> M. Suppose M of the N tasks win a PageWriteGuard and get preempted at some .await point inside `VirtualFile::read_at`. Now suppose tokio schedules the remaining N-M tasks for fairness, then schedules the first M tasks again. Each of the N-M tasks will run `find_victim()` until it hits the `exceeded evict iter limit`. Why? Because the first M tasks took all the slots and are still holding them tight through their `PageWriteGuard`. The result is massive wastage of CPU time in `find_victim()`. The effort to find a page is futile, but each of the N-M tasks still attempts it. This delays the time when tokio gets around to schedule the first M tasks again. Eventually, tokio will schedule them, they will make progress, fill the `PageWriteGuard`, release it. But in the meantime, the N-M tasks have already bailed with error `exceeded evict iter limit`. Eventually, higher level mechanisms will retry for the N-M tasks, and this time, there won't be as many concurrent tasks wanting to do `Timeline::get`. So, it will shake out. But, it's a massive indigestion until then. This PR ======= This PR reverts the offending commits until we find a proper fix. ``` Revert "Use tokio locks in VirtualFile and turn with_file into macro (#5247)" This reverts commit 76cc873. Revert "Make File opening in VirtualFile async-compatible (#5280)" This reverts commit a18d6d9. ```
## Problem We don't have this instruction written anywhere but in internal Slack ## Summary of changes - Add `How to run a CI pipeline on Pull Requests from external contributors` section to `CONTRIBUTING.md` --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem `ci-run/pr-*` branches (and attached PRs) should be deleted automatically when their parent PRs get closed. But there are not ## Summary of changes - Fix if-condition
## Problem If @github-actions creates release PR, the CI pipeline is not triggered (but we have `release-notify.yml` workflow that we expect to run on this event). I suspect this happened because @github-actions is not a repository member. Ref #5283 (comment) ## Summary of changes - Use `CI_ACCESS_TOKEN` to create a PR - Use `gh` instead of `thomaseizinger/create-pull-request` - Restrict permissions for GITHUB_TOKEN to `contents: write` only (required for `git push`)
- pagestore_smgr.c had unnecessary WALSync() (see #5287 ) - Compute node dockerfile didn't build the neon_rmgr extension - Add PostgreSQL 16 image to docker-compose tests - Fix issue with high CPU usage in Safekeeper due to a bug in WALSender Co-authored-by: Alexander Bayandin <alexander@neon.tech>
xfail test reproducing issue #4698
2472 tests run: 2351 passed, 2 failed, 119 skipped (full report)Failures on Postgres 16
Test coverage report is not availableThe comment gets automatically updated with the latest test results
8556d94 at 2023-09-13T02:23:25.788Z :recycle: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Release 2023-09-13
Please merge this PR using 'Create a merge commit'!