Release 2024-03-11 #7081

vipvap · 2024-03-11T06:03:59Z

Release 2024-03-11

Please merge this Pull Request using 'Create a merge commit' button

## Problem - Walredo errors, e.g. during image creation, mention the LSN affected but not the key. ## Summary of changes - Add key to "error applying ... WAL records" log message

## Problem - #6966 - Existing logs aren't pointing to a cause: it looks like heatmap upload and download are happening, but for some reason the evicted layer isn't removed on the secondary location. ## Summary of changes - Assert evicted layer is gone from heatmap before checking its gone from local disk: this will give clarity on whether the issue is with the uploads or downloads. - On assertion failures, log the contents of heatmap.

## Problem Actually it's good idea to distinguish between cases when it's a cold start, but we took the compute from the pool ## Summary of changes Updated to enum.

The `writer.finish()` methods already fsync the inode, using `VirtualFile::sync_all()`. All that the callers need to do is fsync their directory, i.e., the timeline directory. Note that there's a call in the new compaction code that is apparently dead-at-runtime, so, I couldn't fix up any fsyncs there [Link](https://github.com/neondatabase/neon/blob/502b69b33bbd4ad1b0647e921a9c665249a2cd62/pageserver/src/tenant/timeline/compaction.rs#L204-L211). Note that layer durability still matters somewhat, even after #5198 which made remote storage authoritative. We do have the layer file length as an indicator, but no checksums on the layer file contents. So, a series of overwrites without fsyncs in the middle, plus a subsequent crash, could cause us to end up in a state where the file length matches but the contents are garbage. part of #6663

Usually RFC documents are not modified, but the vast mentions of "zenith" in early RFC documents make it desirable to update the product name to today's name, to avoid confusion. ## Problem Early RFC documents use the old "zenith" product name a lot, which is not something everyone is aware of after the product was renamed. ## Summary of changes Replace occurrences of "zenith" with "neon". Images are excluded. --------- Co-authored-by: Andreas Scherbaum <andreas@neon.tech>

## Problem The current implementation of `deploy-prod` workflow doesn't allow to run parallel deploys on Storage and Proxy. ## Summary of changes - Call `deploy-proxy-prod` workflow that deploys only Proxy components, and that can be run in parallel with `deploy-prod` for Storage.

As pointed out in the comments added in this PR: the in-memory state of the filesystem already has the layer file in its final place. If the fsync fails, but pageserver continues to execute, it's quite easy for subsequent pageserver code to observe the file being there and assume it's durable, when it really isn't. It can happen that we get ENOSPC during the fsync. However, 1. the timeline dir is small (remember, the big layer _file_ has already been synced). Small data means ENOSPC due to delayed allocation races etc are less likely. 2. what else are we going to do in that case? If we decide to bubble up the error, the file remains on disk. We could try to unlink it and fsync after the unlink. If that fails, we would _definitely_ need to error out. Is it worth the trouble though? Side note: all this logic about not carrying on after fsync failure implies that we `sync` the filesystem successfully before we restart the pageserver. We don't do that right now, but should (=> #6989) part of #6663

## Problem Typo ## Summary of changes Fix

…ync_all()` (#6986) Except for the involvement of the VirtualFile fd cache, this is equivalent to what happened before at runtime. Future PR #6378 will implement `VirtualFile::sync_all()` using tokio-epoll-uring if that's configured as the io engine. This PR is preliminary work for that. part of #6663

The template does not parse on GitHub

…6960 (#6999) This PR increases the `wait_until` timeout. These are where things became more flaky as of #6960. Most likely because it doubles the work in the `churn_while_failpoints_active_thread`. Slack context: https://neondb.slack.com/archives/C033RQ5SPDH/p1709554455962959?thread_ts=1709286362.850549&cid=C033RQ5SPDH

@bayandin

Fixes #6996 Thanks to @bayandin

pgbouncer 1.22.1 has been released > This release fixes issues caused by some clients using COPY FROM STDIN queries. Such queries could introduce memory leaks, performance regressions and prepared statement misbehavior. - NEWS: https://www.pgbouncer.org/2024/03/pgbouncer-1-22-1 - CHANGES: pgbouncer/pgbouncer@pgbouncer_1_22_0...pgbouncer_1_22_1 ## Summary of changes - vm-image: update pgbouncer from 1.22.0 to 1.22.1

tokio 1.36 has been out for a month. Release notes don't indicate major changes. Skimming through their issue tracker, I can't find open `C-bug` issues that would affect us. (My personal motivation for this is `JoinSet::try_join_next`.)

## Problem `cargo deny` fails - https://rustsec.org/advisories/RUSTSEC-2024-0019 - GHSA-r8w9-5wcg-vfj7 > The vulnerability is Windows-specific, and can only happen if you are using named pipes. Other IO resources are not affected. ## Summary of changes - Upgrade `mio` from 0.8.10 to 0.8.11 (`cargo update -p mio`)

## Problem Fix #6498 ## Summary of changes Only re-authenticate with zenith_admin if authentication fails. Otherwise, directly return the error message. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

part of #6663 See that epic for more context & related commits. Problem ------- Before this PR, the layer-file-creating code paths were using VirtualFile, but under the hood these were still blocking system calls. Generally this meant we'd stall the executor thread, unless the caller "knew" and used the following pattern instead: ``` spawn_blocking(|| { Handle::block_on(async { VirtualFile::....().await; }) }).await ``` Solution -------- This PR adopts `tokio-epoll-uring` on the layer-file-creating code paths in pageserver. Note that on-demand downloads still use `tokio::fs`, these will be converted in a future PR. Design: Avoiding Regressions With `std-fs` ------------------------------------------ If we make the VirtualFile write path truly async using `tokio-epoll-uring`, should we then remove the `spawn_blocking` + `Handle::block_on` usage upstack in the same commit? No, because if we’re still using the `std-fs` io engine, we’d then block the executor in those places where previously we were protecting us from that through the `spawn_blocking` . So, if we want to see benefits from `tokio-epoll-uring` on the write path while also preserving the ability to switch between `tokio-epoll-uring` and `std-fs` , where `std-fs` will behave identical to what we have now, we need to ***conditionally* use `spawn_blocking + Handle::block_on`** . I.e., in the places where we use that know, we’ll need to make that conditional based on the currently configured io engine. It boils down to investigating all the places where we do `spawn_blocking(... block_on(... VirtualFile::...))`. Detailed [write-up of that investigation in Notion](https://neondatabase.notion.site/Surveying-VirtualFile-write-path-usage-wrt-tokio-epoll-uring-integration-spawn_blocking-Handle-bl-5dc2270dbb764db7b2e60803f375e015?pvs=4 ), made publicly accessible. tl;dr: Preceding PRs addressed the relevant call sites: - `metadata` file: turns out we could simply remove it (#6777, #6769, #6775) - `create_delta_layer()`: made sensitive to `virtual_file_io_engine` in #6986 NB: once we are switched over to `tokio-epoll-uring` everywhere in production, we can deprecate `std-fs`; to keep macOS support, we can use `tokio::fs` instead. That will remove this whole headache. Code Changes In This PR ----------------------- - VirtualFile API changes - `VirtualFile::write_at` - implement an `ioengine` operation and switch `VirtualFile::write_at` to it - `VirtualFile::metadata()` - curiously, we only use it from the layer writers' `finish()` methods - introduce a wrapper `Metadata` enum because `std::fs::Metadata` cannot be constructed by code outside rust std - `VirtualFile::sync_all()` and for completeness sake, add `VirtualFile::sync_data()` Testing & Rollout ----------------- Before merging this PR, we ran the CI with both io engines. Additionally, the changes will soak in staging. We could have a feature gate / add a new io engine `tokio-epoll-uring-write-path` to do a gradual rollout. However, that's not part of this PR. Future Work ----------- There's still some use of `std::fs` and/or `tokio::fs` for directory namespace operations, e.g. `std::fs::rename`. We're not addressing those in this PR, as we'll need to add the support in tokio-epoll-uring first. Note that rename itself is usually fast if the directory is in the kernel dentry cache, and only the fsync after rename is slow. These fsyncs are using tokio-epoll-uring, so, the impact should be small.

`std` has had `pin!` macro for some time, there is no need for us to use the older alternatives. Cannot disallow `tokio::pin` because tokio macros use that.

Before this PR, the layer file download code would fsync the inode after rename instead of the timeline directory. That is not in line with what a comment further up says we're doing, and it's obviously not achieving the goal of making the rename durable. part of #6663

## Problem The value reconstruct of AUX_FILES_KEY from records is not deterministic since it uses a hash map under the hood. This caused vectored get validation failures when enabled in staging. ## Summary of changes Deserialise AUX_FILES_KEY blobs comparing. All other keys should reconstruct deterministically, so we simply compare the blobs.

## Problem Last weeks enablement of vectored get generated a number of panics. From them, I diagnosed two issues in the delta layer index traversal logic 1. The `key >= range.start && lsn >= lsn_range.start` was too aggressive. Lsns are not monotonically increasing in the delta layer index (keys are though), so we cannot assert on them. 2. Lsns greater or equal to `lsn_range.end` were not skipped. This caused the query to consider records newer than the request Lsn. ## Summary of changes * Fix the issues mentioned above inline * Refactor the layer traversal logic to make it unit testable * Add unit test which reproduces the failure modes listed above.

… metrics + regression test (#6953) part of #5899 Problem ------- Before this PR, the time spent waiting on the throttle was charged towards the higher-level page_service metrics, i.e., `pageserver_smgr_query_seconds`. The metrics are the foundation of internal SLIs / SLOs. A throttled tenant would cause the SLI to degrade / SLO alerts to fire. Changes ------- - don't charge time spent in throttle towards the page_service metrics - record time spent in throttle in RequestContext and subtract it from the elapsed time - this works because the page_service path doesn't create child context, so, all the throttle time is recorded in the parent - it's quite brittle and will break if we ever decide to spawn child tasks that need child RequestContexts, which would have separate instances of the `micros_spent_throttled` counter. - however, let's punt that to a more general refactoring of RequestContext - add a test case that ensures that - throttling happens for getpage requests; this aspect of the test passed before this PR - throttling delays aren't charged towards the page_service metrics; this aspect of the test only passes with this PR - drive-by: make the throttle log message `info!`, it's an expected condition Performance ----------- I took the same measurements as in #6706 , no meaningful change in CPU overhead. Future Work ----------- This PR enables us to experiment with the throttle for select tenants without affecting the SLI metrics / triggering SLO alerts. Before declaring this feature done, we need more work to happen, specifically: - decide on whether we want to retain the flexibility of throttling any `Timeline::get` call, filtered by TaskKind - versus: separate throttles for each page_service endpoint, potentially with separate config options - the trouble here is that this decision implies changes to the TenantConfig, so, if we start using the current config style now, then decide to switch to a different config, it'll be a breaking change Nice-to-haves but probably not worth the time right now: - Equivalent tests to ensure the throttle applies to all other page_service handlers.

## Problem Not really a problem. Improving visibility around redis communication. ## Summary of changes Added metric on the number of broken messages.

## Problem ref #6188 ## Summary of changes This pull request fixes `-Wmissing-prototypes` for the neon extension. Note that (1) the gcc version in CI and macOS is different, therefore some of the warning does not get reported when developing the neon extension locally. (2) the CI env variable `COPT = -Werror` does not get passed into the docker build process, therefore warnings are not treated as errors on CI. https://github.com/neondatabase/neon/blob/e62baa97041e10ce45772b3724e24e679a650d69/.github/workflows/build_and_test.yml#L22 There will be follow-up pull requests on solving other warnings. By the way, I did not figure out the default compile parameters in the CI env, and therefore this pull request is tested by manually adding `-Wmissing-prototypes` into the `COPT`. Signed-off-by: Alex Chi Z <chi@neon.tech>

Minor non-functional improvements to tiered compaction, mostly consisting of comment fixes. Followup of #6830, part of #6768

## Problem Branch/project and coldStart were not populated to data events. ## Summary of changes Populate it. Also added logging for the coldstart info.

The test is flaky due to #7006.

Moves some of the (legacy) compaction code to compaction.rs. No functional changes, just moves of code. Before, compaction.rs was only for the new tiered compaction mechanism, now it's for both the old and new mechanisms. Part of #6768

## Problem If large numbers of shards are attached to a pageserver concurrently, for example after another node fails, it can cause excessive I/O queue depths due to all the newly attached shards trying to calculate logical sizes concurrently. #6907 added the `lazy` flag to handle this. ## Summary of changes - Use `lazy=true` from all /location_config calls in the storage controller Reconciler.

Gets upstream PR nical/rust_debug#3 , removes trailing "s from output.

## Problem It seems that even though we have a retry on basebackup, it still sometimes fails to fetch it with the failpoint enabled, resulting in a test error. ## Summary of changes If we fail to get the basebackup, disable the failpoint and try again.

## Summary of changes Update rustls from 0.21 to 0.22. reqwest/tonic/aws-smithy still use rustls 0.21. no upgrade route available yet.

## Problem We reverted #6661 a few days ago. The change led to OOMs in benchmarks followed by large WAL reingests. The issue was that we removed [this code](https://github.com/neondatabase/neon/blob/d04af08567cc3ff94ff19a2f6b3f7a2a1e3c55d1/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs#L409-L417). That call may trigger a roll of the open layer due to the keepalive messages received from the safekeeper. Removing it meant that enforcing of checkpoint timeout became even more lax and led to using up large amounts of memory for the in memory layer indices. ## Summary of changes Piggyback on keep alive messages to enforce checkpoint timeout. This is a hack, but it's exactly what the current code is doing. ## Alternatives Christhian, Joonas and myself sketched out a timer based approach [here](#6940). While discussing it further, it became obvious that's also a bit of a hack and not the desired end state. I chose not to take that further since it's not what we ultimately want and it'll be harder to rip out. Right now it's unclear what the ideal system behaviour is: * early flushing on memory pressure, or ... * detaching tenants on memory pressure

## Problem For the ephemeral endpoint feature, it's not really too helpful to keep them around in the connection pool. This isn't really pressing but I think it's still a bit better this way. ## Summary of changes Add `is_ephemeral` function to `NeonOptions`. Allow `serverless::ConnInfo::endpoint_cache_key()` to return an `Option`. Handle that option appropriately

…#7037) ## Problem Tenants created via the storage controller have a `PlacementPolicy` that defines their HA/secondary/detach intent. For backward compat we can just set it to Single, for onboarding tenants using /location_conf it is automatically set to Double(1) if there are at least two pageservers, but for freshly created tenants we didn't have a way to specify it. This unblocks writing tests that create HA tenants on the storage controller and do failure injection testing. ## Summary of changes - Add optional fields to TenantCreateRequest for specifying PlacementPolicy. This request structure is used both on pageserver API and storage controller API, but this method is only meaningful for the storage controller (same as existing `shard_parameters` attribute). - Use the value from the creation request in tenant creation, if provided.

## Problem When we start compute with newer version of extension (i.e. 1.2) and then rollback the release, downgrading the compute version, next compute start will try to update extension to the latest version available in neon.control (i.e. 1.1). Thus we need to provide downgrade scripts like neon--1.2--1.1.sql These scripts must revert the changes made by the upgrade scripts in the reverse order. This is necessary to ensure that the next upgrade will work correctly. In general, we need to write upgrade and downgrade scripts to be more robust and add IF EXISTS / CREATE OR REPLACE clauses to all statements (where applicable). ## Summary of changes Adds downgrade scripts. Adds test cases for extension downgrade/upgrade. fixes #7066 This is a follow-up for https://app.incident.io/neondb/incidents/167?tab=follow-ups Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Alex Chi Z <iskyzh@gmail.com> Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>

## Problem Currently users can cause problems with replication ## Summary of changes Don't let them replicate

…7072) PR #6953 only excluded throttled time from the handle_pagerequests (aka smgr metrics). This PR implements the deduction for `basebackup ` queries. The other page_service methods either don't use Timeline::get or they aren't used in production. Found by manually inspecting in [staging logs](https://neonprod.grafana.net/explore?schemaVersion=1&panes=%7B%22wx8%22:%7B%22datasource%22:%22xHHYY0dVz%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bhostname%3D%5C%22pageserver-0.eu-west-1.aws.neon.build%5C%22%7D%20%7C~%20%60git-env%7CERR%7CWARN%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22xHHYY0dVz%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22to%22:%221709919114642%22,%22from%22:%221709904430898%22%7D%7D%7D).

## Problem Before this PR, it was possible that on-demand downloads were started after `Timeline::shutdown()`. For example, we have observed a walreceiver-connection-handler-initiated on-demand download that was started after `Timeline::shutdown()`s final `task_mgr::shutdown_tasks()` call. The underlying issue is that `task_mgr::shutdown_tasks()` isn't sticky, i.e., new tasks can be spawned during or after `task_mgr::shutdown_tasks()`. Cc: #4175 in lieu of a more specific issue for task_mgr. We already decided we want to get rid of it anyways. Original investigation: https://neondb.slack.com/archives/C033RQ5SPDH/p1709824952465949 ## Changes - enter gate while downloading - use timeline cancellation token for cancelling download thereby, fixes #7054 Entering the gate might also remove recent "kept the gate from closing" in staging.

github-actions · 2024-03-11T06:43:38Z

2496 tests run: 2374 passed, 0 failed, 122 skipped (full report)

Flaky tests (1)

Postgres 16

test_compute_pageserver_connection_stress: release

Code coverage* (full report)

functions: 28.8% (7033 of 24442 functions)
lines: 47.6% (43451 of 91358 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
f0a9017 at 2024-03-11T12:37:23.709Z :recycle:}

## Problem We want to report metrics for the oldest user database.

jcsp and others added 30 commits March 4, 2024 08:56

pageserver: mention key in walredo errors (#6988)

fad9be4

## Problem - Walredo errors, e.g. during image creation, mention the LSN affected but not the key. ## Summary of changes - Add key to "error applying ... WAL records" log message

proxy: change is cold start to enum (#6948)

3114be0

## Problem Actually it's good idea to distinguish between cases when it's a cold start, but we took the compute from the pool ## Summary of changes Updated to enum.

Fix type (#6998)

e1c032f

## Problem Typo ## Summary of changes Fix

fix epic issue template (#6920)

e938bb8

The template does not parse on GitHub

Update postgres-exporter to v0.12.1 (#7004)

0d2395f

Fixes #6996 Thanks to @bayandin

upgrade tokio 1.34 => 1.36 (#7008)

e62baa9

tokio 1.36 has been out for a month. Release notes don't indicate major changes. Skimming through their issue tracker, I can't find open `C-bug` issues that would affect us. (My personal motivation for this is `JoinSet::try_join_next`.)

compute_ctl: only try zenith_admin if could not authenticate (#6955)

b7db912

## Problem Fix #6498 ## Summary of changes Only re-authenticate with zenith_admin if authentication fails. Otherwise, directly return the error message. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

build: clippy disallow futures::pin_mut macro (#7016)

752bf5a

`std` has had `pin!` macro for some time, there is no need for us to use the older alternatives. Cannot disallow `tokio::pin` because tokio macros use that.

proxy: report redis broken message metric (#7021)

bdbb2f4

## Problem Not really a problem. Improving visibility around redis communication. ## Summary of changes Added metric on the number of broken messages.

Minor improvements to tiered compaction (#7020)

e69a255

Minor non-functional improvements to tiered compaction, mostly consisting of comment fixes. Followup of #6830, part of #6768

proxy: fix bug with populating the data (#7023)

15b3665

## Problem Branch/project and coldStart were not populated to data events. ## Summary of changes Populate it. Also added logging for the coldstart info.

test: disable large slru basebackup bench in ci (#7025)

2daa2f1

The test is flaky due to #7006.

fixup(#6991): it broke the macOS build (#7024)

eacdc17

Move compaction code to compaction.rs (#7026)

2f88e7a

Moves some of the (legacy) compaction code to compaction.rs. No functional changes, just moves of code. Before, compaction.rs was only for the new tiered compaction mechanism, now it's for both the old and new mechanisms. Part of #6768

arpad-m and others added 10 commits March 7, 2024 17:32

Update svg_fmt (#7049)

ce7a82d

Gets upstream PR nical/rust_debug#3 , removes trailing "s from output.

update rustls (#7048)

02358b2

## Summary of changes Update rustls from 0.21 to 0.22. reqwest/tonic/aws-smithy still use rustls 0.21. no upgrade route available yet.

Revoke REPLICATION (#7052)

4834d22

## Problem Currently users can cause problems with replication ## Summary of changes Don't let them replicate

vipvap requested review from a team as code owners March 11, 2024 06:04

vipvap requested review from arssher, conradludgate, problame and antonyc and removed request for a team March 11, 2024 06:04

koivunej approved these changes Mar 11, 2024

View reviewed changes

vadim2404 mentioned this pull request Mar 11, 2024

Export db size, deadlocks and changed row metrics #7050

Merged

5 tasks

arssher approved these changes Mar 11, 2024

View reviewed changes

Export db size, deadlocks and changed row metrics (#7050)

f0a9017

## Problem We want to report metrics for the oldest user database.

koivunej merged commit c6ed86d into release Mar 11, 2024
54 checks passed

koivunej deleted the rc/2024-03-11 branch March 11, 2024 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2024-03-11 #7081

Release 2024-03-11 #7081

vipvap commented Mar 11, 2024

github-actions bot commented Mar 11, 2024 •

edited

Loading

Postgres 16

Release 2024-03-11 #7081

Release 2024-03-11 #7081

Conversation

vipvap commented Mar 11, 2024