Release 2024-04-15 #7378

vipvap · 2024-04-15T06:03:58Z

Release 2024-04-15

Please merge this Pull Request using 'Create a merge commit' button

) It's just unnecessary to use spawn_blocking there, and with #7331 , it will result in really just one executor thread when enabling one-runtime with current_thread executor.

Removes usage of async_trait from the `CompactionDeltaLayer` trait. Split off from #7301 Related earlier work: #6305, #6464, #7303

This PR is an off-by-default revision v2 of the (since-reverted) PR #6555 / commit `3220f830b7fbb785d6db8a93775f46314f10a99b`. See that PR for details on why running with a single runtime is desirable and why we should be ready. We reverted #6555 because it showed regressions in prodlike cloudbench, see the revert commit message `ad072de4209193fd21314cf7f03f14df4fa55eb1` for more context. This PR makes it an opt-in choice via an env var. The default is to use the 4 separate runtimes that we have today, there shouldn't be any performance change. I tested manually that the env var & added metric works. ``` # undefined env var => no change to before this PR, uses 4 runtimes ./target/debug/neon_local start # defining the env var enables one-runtime mode, value defines that one runtime's configuration NEON_PAGESERVER_USE_ONE_RUNTIME=current_thread ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:1 ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:2 ./target/debug/neon_local start NEON_PAGESERVER_USE_ONE_RUNTIME=multi_thread:default ./target/debug/neon_local start ``` I want to use this change to do more manualy testing and potentially testing in staging. Future Work ----------- Testing / deployment ergonomics would be better if this were a variable in `pageserver.toml`. It can be done, but, I don't need it right now, so let's stick with the env var.

…7203) ## Problem We have two places that use a helper (`ser_rfc3339_millis`) to get serde to stringify SystemTimes into the desired format. ## Summary of changes Created a new module `utils::serde_system_time` and inside it a wrapper type `SystemTime` for `std::time::SystemTime` that serializes/deserializes to the RFC3339 format. This new type is then used in the two places that were previously using the helper for serialization, thereby eliminating the need to decorate structs. Closes #7151.

## Problem Some awkwardness in the measured API. Missing process metrics. ## Summary of changes Update measured to use the new convenience setup features. Added measured-process lib. Added measured support for libmetrics

## Problem After switching the default pageserver io-engine to `tokio-epoll-uring` on CI, we tuned a query that finds flaky tests (in #7077). It has been almost a month since then, additional query tuning is not required anymore. ## Summary of changes - Remove extra condition from flaky tests query - Also return back parameterisation to the query

## Problem ``` Could not resolve host: console.stage.neon.tech ``` ## Summary of changes - replace `console.stage.neon.tech` with `console-stage.neon.build`

## Problem Proxy doesn't know about existing endpoints. ## Summary of changes * Added caching of all available endpoints. * On the high load, use it before going to cplane. * Report metrics for the outcome. * For rate limiter and credentials caching don't distinguish between `-pooled` and not TODOs: * Make metrics more meaningful * Consider integrating it with the endpoint rate limiter * Test it together with cplane in preview

) Problem Currently, we base our time based layer rolling decision on the last time we froze a layer. This means that if we roll a layer and then go idle for longer than the checkpoint timeout the next layer will be rolled after the first write. This is of course not desirable. Summary of changes Record the timepoint of the first write to an open layer and use that for time based layer rolling decisions. Note that I had to keep `Timeline::last_freeze_ts` for the sharded tenant disk consistent lsn skip hack. Fixes #7241

Adds another tool to the DR toolbox: ability in pagectl to recover arbitrary prefixes in remote storage. Requires remote storage config, the prefix, and the travel-to timestamp parameter to be specified as cli args. The done-if-after parameter is also supported. Example invocation (after `aws login --profile dev`): ``` RUST_LOG=remote_storage=debug AWS_PROFILE=dev cargo run -p pagectl time-travel-remote-prefix 'remote_storage = { bucket_name = "neon-test-bucket-name", bucket_region = "us-east-2" }' wal/3aa8fcc61f6d357410b7de754b1d9001/641e5342083b2235ee3deb8066819683/ 2024-04-05T17:00:00Z ``` This has been written to resolve a customer recovery case: https://neondb.slack.com/archives/C033RQ5SPDH/p1712256888468009 There is validation of the prefix to prevent accidentially specifying too generic prefixes, which can cause corruption and data loss if used wrongly. Still, the validation is not perfect and it is important that the command is used with caution. If possible, `time_travel_remote_storage` should be used instead which has additional checks in place.

## Problem hyper1 offers control over the HTTP connection that hyper0_14 does not. We're blocked on switching all services to hyper1 because of how we use tonic, but no reason we can't switch proxy over. ## Summary of changes 1. hyper0.14 -> hyper1 1. self managed server 2. Remove the `WithConnectionGuard` wrapper from `protocol2` 2. Remove TLS listener as it's no longer necessary 3. include first session ID in connection startup logs

## Problem Incorrect processing of `-pooler` connections. ## Summary of changes Fix TODO: add e2e tests for caching

This reverts commit dbac2d2. ## Problem Proxy pods fails to install in k8s clusters, cplane release blocking. ## Summary of changes Revert

Part of neondatabase/cloud#12047. The basic idea is that for our VMs, we want to enable swap and disable Linux memory overcommit. Alongside these, we should set postgres' dynamic_shared_memory_type to mmap, but we want to avoid setting it to mmap if swap is not enabled. Implementing this in the control plane would be fiddly, but it's relatively straightforward to add to compute_ctl.

## Problem See https://neondb.slack.com/archives/C03QLRH7PPD/p1712529369520409 In case of statements CREATE TABLE AS SELECT... or INSERT FROM SELECT... we are fetching data from source table and storing it in destination table. It cause problems with prefetch last-written-lsn is known for the pages of source table (which for example happens after compute restart). In this case we get get global value of last-written-lsn which is changed frequently as far as we are writing pages of destination table. As a result request-isn for the prefetch and request-let when this page is actually needed are different and we got exported prefetch request. So it actually disarms prefetch. ## Summary of changes Proposed simple patch stores last-written LSN for the page when it is not found. So next time we will request last-written LSN for this page, we will get the same value (certainly if the page was not changed). ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

It was disabled due to #6530 breaking forward compatiblity. Now that we have deployed it to production, we can reenable the test

## Problem We are seeing some mysterious long waits when sending requests. ## Summary of changes - To eliminate risk that we are incurring some unreasonable overheads from setup, e.g. DNS, use a single Client (internally a pool) instead of repeatedly constructing a fresh one. - To make it clearer where a timeout is occurring, apply a 10 second timeout to requests as we send them.

## Problem `build-build-tools-image` workflow is designed to be run only in one example per the whole repository. Currently, the job gets cancelled if a newer one is scheduled, here's an example: https://github.com/neondatabase/neon/actions/runs/8419610607 ## Summary of changes - Explicitly set `cancel-in-progress: false` for all jobs that aren't supposed to be cancelled

## Problem My benchmarks show that prometheus is not very good. https://github.com/conradludgate/measured We're already using it in storage_controller and it seems to be working well. ## Summary of changes Replace prometheus with my new measured crate in proxy only. Apologies for the large diff. I tried to keep it as minimal as I could. The label types add a bit of boiler plate (but reduce the chance we mistype the labels), and some of our custom metrics like CounterPair and HLL needed to be rewritten.

## Problem Actually read redis events. ## Summary of changes This is revert of #7350 + fixes. * Fixed events parsing * Added timeout after connection failure * Separated regional and global redis clients.

## Problem possible for the database connections to not close in time. ## Summary of changes force the closing of connections if the client has hung up

## Problem `create-test-report` job takes more than 8 minutes, the longest step is uploading Allure report to S3: Before: ``` + aws s3 cp --recursive --only-show-errors /tmp/pr-7362-1712847045/report s3://neon-github-public-dev/reports/pr-7362/8647730612 real 6m10.572s user 6m37.717s sys 1m9.429s ``` After: ``` + s5cmd --log error cp '/tmp/pr-7362-1712858221/report/*' s3://neon-github-public-dev/reports/pr-7362/8650636861/ real 0m9.698s user 1m9.438s sys 0m6.419s ``` ## Summary of changes - Add `s5cmd`(https://github.com/peak/s5cmd) to build-tools image - Use `s5cmd` instead of `aws s3` for uploading Allure reports

The allowed modes as of Postgres 17 are: smart, fast, and immediate. $ cargo neon stop Finished dev [unoptimized + debuginfo] target(s) in 0.24s Running `target/debug/neon_local stop` postgres stop failed: pg_ctl failed, exit code: exit status: 1, stdout: , stderr: pg_ctl: unrecognized shutdown mode "fast " Try "pg_ctl --help" for more information.

github-actions · 2024-04-15T06:47:44Z

2748 tests run: 2630 passed, 0 failed, 118 skipped (full report)

Code coverage* (full report)

functions: 28.0% (6430 of 22962 functions)
lines: 46.6% (45025 of 96565 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
5288f96 at 2024-04-15T06:47:44.037Z :recycle:}

danieltprice · 2024-04-18T16:11:39Z

reviewed for changelog

problame and others added 24 commits April 8, 2024 09:35

refactor(pageserver): use tokio::signal instead of spawn_blocking (#7332

2d3c9f0

) It's just unnecessary to use spawn_blocking there, and with #7331 , it will result in really just one executor thread when enabling one-runtime with current_thread executor.

Remove async_trait from CompactionDeltaLayer (#7342)

47b705c

Removes usage of async_trait from the `CompactionDeltaLayer` trait. Split off from #7301 Related earlier work: #6305, #6464, #7303

update measured with some more convenient features (#7334)

f212630

## Problem Some awkwardness in the measured API. Missing process metrics. ## Summary of changes Update measured to use the new convenience setup features. Added measured-process lib. Added measured support for libmetrics

Update staging hostname (#7347)

4f4f787

## Problem ``` Could not resolve host: console.stage.neon.tech ``` ## Summary of changes - replace `console.stage.neon.tech` with `console-stage.neon.build`

proxy: fix credentials cache lookup (#7349)

5efe95a

## Problem Incorrect processing of `-pooler` connections. ## Summary of changes Fix TODO: add e2e tests for caching

Revert "Proxy read ids from redis (#7205)" (#7350)

0bb04eb

This reverts commit dbac2d2. ## Problem Proxy pods fails to install in k8s clusters, cplane release blocking. ## Summary of changes Revert

Reenable test_forward_compatibility (#7358)

db72543

It was disabled due to #6530 breaking forward compatiblity. Now that we have deployed it to production, we can reenable the test

Read cplane events from regional redis (#7352)

40f15c3

## Problem Actually read redis events. ## Summary of changes This is revert of #7350 + fixes. * Fixed events parsing * Added timeout after connection failure * Separated regional and global redis clients.

proxy: fix overloaded db connection closure (#7364)

e92fb94

## Problem possible for the database connections to not close in time. ## Summary of changes force the closing of connections if the client has hung up

build(deps): bump idna from 3.3 to 3.7 (#7367)

5288f96

vipvap requested review from a team as code owners April 15, 2024 06:03

vipvap requested review from save-buffer and removed request for a team April 15, 2024 06:04

vipvap requested review from conradludgate, koivunej and piercypixel and removed request for a team April 15, 2024 06:04

problame self-assigned this Apr 15, 2024

problame requested review from a team, petuhovskiy and problame and removed request for save-buffer, conradludgate, koivunej, piercypixel and a team April 15, 2024 12:45

problame approved these changes Apr 15, 2024

View reviewed changes

problame merged commit c213373 into release Apr 15, 2024
153 of 157 checks passed

problame deleted the rc/2024-04-15 branch April 15, 2024 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2024-04-15 #7378

Release 2024-04-15 #7378

vipvap commented Apr 15, 2024

github-actions bot commented Apr 15, 2024

danieltprice commented Apr 18, 2024

Release 2024-04-15 #7378

Release 2024-04-15 #7378

Conversation

vipvap commented Apr 15, 2024