Release 2024-05-27 #7888

vipvap · 2024-05-27T06:05:03Z

Storage & Compute release 2024-05-27

Please merge this Pull Request using 'Create a merge commit' button

"taking a fullbackup" is an ugly multi-liner copypasted in multiple places, most recently with timeline ancestor detach tests. move it under `PgBin` which is not a great place, but better than yet another utility function. Additionally: - cleanup `psql_env` repetition (PgBin already configures that) - move the backup tar comparison as a yet another free utility function - use backup tar comparison in `test_import.py` where a size check was done previously - cleanup extra timeline creation from test Cc: #7715

## Problem `report-benchmarks-failures` got skipped if a dependent job fails. ## Summary of changes - Fix the if-condition by adding `&& failures()` to it; it'll make the job run if the dependent job fails.

The openapi description with the error descriptions: - 200 is used for "detached or has been detached previously" - 400 is used for "cannot be detached right now" -- it's an odd thing, but good enough - 404 is used for tenant or timeline not found - 409 is used for "can never be detached" (root timeline) - 500 is used for transient errors (basically ill-defined shutdown errors) - 503 is used for busy (other tenant ancestor detach underway, pageserver shutdown) Cc: #6994

This is "required" by GitHub Actions, though they must do some coersion on their side.

## Problem Currently, `latest` tag is added to the images in several cases: ``` github.ref_name == 'main' || github.ref_name == 'release' || github.ref_name == 'release-proxy' ``` This leads to a race; the `latest` tag jumps back and forth depending on the branch that has built images. ## Summary of changes - Do not push `latest` images to prod ECR (we don't use it) - Use `docker buildx imagetools` instead of `crane` for tagging images - Unify `vm-compute-node-image` job with others and use dockerhub as a first source for images (sync images with ECR) - Tag images with `latest` only for commits in `main`

## Problem We don't build our docker images for ARM arch, and that makes it harder to run images on ARM (on MacBooks with Apple Silicon, for example). ## Summary of changes - Build `neondatabase/neon` for ARM and create a multi-arch image - Build `neondatabase/compute-node-vXX` for ARM and create a multi-arch image - Run `test-images` job on ARM as well

## Problem Safekeeper Timeline uses a channel for cancellation, but we have a dedicated type for that. ## Summary of changes - Use CancellationToken in Timeline

the gate was accidentially being dropped before the final blocking phase, possibly explaining the resident physical size global problems during deletions. it could had caused more harm as well, but the path is not actively being tested because cplane no longer puts locationconfigs with higher generation number during normal operation which prompted the last wave of fixes. Cc: #7341.

The code was working correctly, but was incorrectly using Buffer for a 0-based index into the BufferDesc array.

Don't set last-written LSN of a page when the record is replayed, only when the page is evicted from cache. For comparison, we don't update the last-written LSN on every page modification on the primary either, only when the page is evicted. Do update the last-written LSN when the page update is skipped in WAL redo, however. In neon_get_request_lsns(), don't be surprised if the last-written LSN is equal to the record being replayed. Use the LSN of the record being replayed as the request LSN in that case. Add a long comment explaining how that can happen. In neon_wallog_page, update last-written LSN also when Shutdown has been requested. We might still fetch and evict pages for a while, after shutdown has been requested, so we better continue to do that correctly. Enable the check that we don't evict a page with zero LSN also in standby, but make it a LOG message instead of PANIC Fixes issue #7791

Once all the computes in production have restarted, we can remove protocol version 1 altogether. See issue #6211. This was done earlier already in commit 0115fe6, but reverted before it was released to production in commit bbe730d because of issue #7692. That issue was fixed in commit 22afaea, so we are ready to change the default again.

## Problem If an existing user already has some aux v1 files, we don't want to switch them to the global tenant-level config. Part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem Despite making password hashing async, it can still take time away from the network code. ## Summary of changes Introduce a custom threadpool, inspired by rayon. Features: ### Fairness Each task is tagged with it's endpoint ID. The more times we have seen the endpoint, the more likely we are to skip the task if it comes up in the queue. This is using a min-count-sketch estimator for the number of times we have seen the endpoint, resetting it every 1000+ steps. Since tasks are immediately rescheduled if they do not complete, the worker could get stuck in a "always work available loop". To combat this, we check the global queue every 61 steps to ensure all tasks quickly get a worker assigned to them. ### Balanced Using crossbeam_deque, like rayon does, we have workstealing out of the box. I've tested it a fair amount and it seems to balance the workload accordingly

part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem This test relied on some sleeps, and was failing ~5% of the time. ## Summary of changes Use log-watching rather than straight waits, and make timeouts more generous for the CI environment.

## Problem Failures on some of our uglier shutdown log messages: https://neon-github-public-dev.s3.amazonaws.com/reports/main/9192662995/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/51b365408678c66f/ ## Summary of changes - Allow-list these errors.

For existing users, we want to allow doing a force switch for their aux file policy. Part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

Signed-off-by: Alex Chi Z <chi@neon.tech>

One change: runner: allow coredump collection (#931)

…h background downloads (#7848) ## Problem We've seen some strange behaviors when doing lots of migrations involving secondary locations. One of these was where a tenant was apparently stuck in the `Scheduler::running` list, but didn't appear to be making any progress. Another was a shutdown hang (neondatabase/cloud#13576). ## Summary of changes - Fix one issue (probably not the only one) where a tenant in the `pending` list could proceed to `spawn` even if the same tenant already had a running task via `handle_command` (this could have resulted in a weird value of SecondaryProgress) - Add various extra logging: - log before as well as after layer downloads so that it would be obvious if we were stuck in remote storage code (we shouldn't be, it has built in timeouts) - log the number of running + pending jobs from the scheduler every time it wakes up to do a scheduling iteration (~10s) -- this is quite chatty, but not compared with the volume of logs on a busy pageserver. It should give us confidence that the scheduler loop is still alive, and visibility of how many tasks the scheduler thinks are running.

[evidence] of quite rare flaky. the detach can cause this with the right timing. [evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7650/9191613501/index.html#suites/7745dadbd815ab87f5798aa881796f47/2190222925001078

I looked at the metrics from #7768 on staging and it seems that manager does too many iterations. This is probably caused by background job `remove_wal.rs` which iterates over all timelines and tries to remove WAL and persist control file. This causes shared state updates and wakes up the manager. The fix is to skip notifying about the updates if nothing was updated.

## Problem If the parquet upload was unsuccessful, it will panic. ## Summary of changes Write error in logs instead.

#7844 typo'd one of the expressions: https://neon-github-public-dev.s3.amazonaws.com/reports/main/9196993886/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/e420fbfdb193bf80/

## Problem By default, pgvector compiles with `-march=native` on some platforms for best performance. However, this can lead to `Illegal instruction` errors if trying to run the compiled extension on a different machine. I had this problem when trying to run the Neon compute docker image on MacOS with Apple Silicon with Rosetta. see https://github.com/pgvector/pgvector/blob/ff9b22977e3ef19866d23a54332c8717f258e8db/README.md?plain=1#L1021 ## Summary of changes Pass OPTFLAGS="" to make.

With #7828 and proper fullbackup testing the test became flaky ([evidence]). - produce better assertion messages in `assert_pageserver_backups_equal` - use read only endpoint to confirm the row count [evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7839/9192447962/index.html#suites/89cfa994d71769e01e3fc4f475a1f3fa/49009214d0f8b8ce

We'd like to get some bits reserved in the length field of image layers for future usage (compression). This PR bases on the assumption that we don't have any blobs that require more than 28 bits (3 bytes + 4 bits) to store the length, but as a preparation, before erroring, we want to first emit warnings as if the assumption is wrong, such warnings are less disruptive than errors. A metric would be even less disruptive (log messages are more slow, if we have a LOT of such large blobs then it would take a lot of time to print them). At the same time, likely such 256 MiB blobs will occupy an entire layer file, as they are larger than our target size. For layer files we already log something, so there shouldn't be a large increase in overhead. Part of #5431

The list timeline API gives something like `"wal_source_connstr":"PgConnectionConfig { host: Domain(\"safekeeper-5.us-east-2.aws.neon.build\"), port: 6500, password: Some(REDACTED-STRING) }"`, which is weird. This pull request makes it somehow like a connection string. This field is not used at least in the neon database, so I assume no one is reading or parsing it. Signed-off-by: Alex Chi Z <chi@neon.tech>

…7850) Reduces duplication between tiered and legacy compaction by using the `Timeline::create_image_layer_for_rel_blocks` function. This way, we also use vectored get in tiered compaction, so the change has two benefits in one. fixes #7659 --------- Co-authored-by: Alex Chi Z. <iskyzh@gmail.com>

* Reduce the logging level for create image layers of metadata keys. (question: is it possible to adjust logging levels at runtime?) * Do a info logging of image layers only after the layer is created. Now there are a lot of cases where we create the image layer writer but then discarding that image layer because it does not contain any key. Therefore, I changed the new image layer logging to trace, and create image layer logging to info. Signed-off-by: Alex Chi Z <chi@neon.tech>

Once upon a time, we used to have duplicated types for runtime IndexPart and whatever we stored. Because of the serde fixes in #5335 we have no need for duplicated IndexPart type anymore, but the `IndexLayerMetadata` stayed. - remove the type - remove LayerFileMetadata::file_size() in favor of direct field access Split off from #7833. Cc: #3072.

## Problem ## Summary of changes ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist

* Make PS connection startup use async APIs This allows for improved query cancellation when we start connections * Make PS connections have per-shard connection retry state. Previously they shared global backoff state, which is bad for quickly getting all connections started and/or back online. * Make sure we clean up most connection state on failed connections. Previously, we could technically leak some resources that we'd otherwise clean up. Now, the resources are correctly cleaned up. * pagestore_smgr.c now PANICs on unexpected response message types. Unexpected responses are likely a symptom of having a desynchronized view of the connection state. As a desynchronized connection state can cause corruption, we PANIC, as we don't know what data may have been written to buffers: the only solution is to fail fast & hope we didn't write wrong data. * Catch errors in sync pagestream request handling. Previously, if a query was cancelled after a message was sent to the pageserver, but before the data was received, the backend could forget that it sent the synchronous request, and let others deal with the repercussions. This could then lead to incorrect responses, or errors such as "unexpected response from page server with tag 0x68"

## Problem One database is too limiting. We have agreed to raise this limit to 10. ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist

## Problem - After a shard split of a large existing tenant, child tenants can end up with oversized historic layers indefinitely, if those layers are prevented from being GC'd by branchpoints. This PR follows #7531, and adds rewriting of layers that contain a mixture of needed & un-needed contents, in addition to dropping un-needed layers. Closes: #7504 ## Summary of changes - Add methods to ImageLayer for reading back existing layers - Extend `compact_shard_ancestors` to rewrite layer files that contain a mixture of keys that we want and keys we do not, if unwanted keys are the majority of those in the file. - Amend initialization code to handle multiple layers with the same LayerName properly - Get rid of of renaming bad layer files to `.old` since that's now expected on restarts during rewrites.

…m always yield Err after cancel (#7866) ## Problem Ongoing hunt for secondary location shutdown hang issues. ## Summary of changes - Revert the functional changes from #7675 - Tweak a log in secondary downloads to make it more apparent when we drop out on cancellation - Modify DownloadStream's behavior to always return an Err after it has been cancelled. This _should_ not impact anything, but it makes the behavior simpler to reason about (e.g. even if the poll function somehow got called again, it could never end up in an un-cancellable state) Related #neondatabase/cloud#13576

We have some 1001ms cases, which do not yield gate guard context.

## Problem A title for automatic proxy release PRs is `Proxy release`, and for storage & compute, it's just `Release` ## Summary of changes - Amend PR title for Storage & Compute releases to "Storage & Compute release"

## Problem Seems the websocket buffering was broken for large query responses only ## Summary of changes Move buffering until after the underlying stream is ready. Tested locally confirms this fixes the bug. Also fixes the pg-sni-router missing metrics bug

Do pull_timeline while WAL is being removed. To this end - extract pausable_failpoint to utils, sprinkle pull_timeline with it - add 'checkpoint' sk http endpoint to force WAL removal. After fixing checking for pull file status code test fails so far which is expected.

github-actions · 2024-05-27T06:52:16Z

3138 tests run: 3006 passed, 0 failed, 132 skipped (full report)

Flaky tests (3)

Postgres 15

test_download_remote_layers_api: debug
test_timeline_ancestor_errors: release

Postgres 14

test_pageserver_restarts_under_worload: release

Code coverage* (full report)

functions: 31.4% (6448 of 20543 functions)
lines: 48.3% (49953 of 103355 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
068c158 at 2024-05-27T14:00:04.371Z :recycle:}

koivunej · 2024-05-27T12:35:35Z

Will most likely cherry-pick #7885 still into this, if it doesn't work at all right now.

## Problem After [0e4f182] which introduce async connect Neon is not able to connect to page server. ## Summary of changes Perform sync commit at MacOS/X ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

danieltprice · 2024-05-30T22:52:32Z

Reviewed for changelog.

koivunej and others added 30 commits May 22, 2024 15:43

CI(report-benchmarks-failures): fix condition (#7820)

d1d55bb

## Problem `report-benchmarks-failures` got skipped if a dependent job fails. ## Summary of changes - Fix the if-condition by adding `&& failures()` to it; it'll make the job run if the dependent job fails.

Fix typos in action definitions

8901ce9

Make postgres_version action input default to a string

900f391

This is "required" by GitHub Actions, though they must do some coersion on their side.

safekeeper: use CancellationToken instead of watch channel (#7836)

e015b2b

## Problem Safekeeper Timeline uses a channel for cancellation, but we have a dedicated type for that. ## Summary of changes - Use CancellationToken in Timeline

Fix confusion between 1-based Buffer and 0-based index (#7825)

3404e76

The code was working correctly, but was incorrectly using Buffer for a 0-based index into the BufferDesc array.

feat(pageserver): auto-detect previous aux file policy (#7841)

64577cf

## Problem If an existing user already has some aux v1 files, we don't want to switch them to the global tenant-level config. Part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

chore(pageserver): use kebab case for aux file flag (#7840)

ddd8ebd

part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

tests: refine test_secondary_background_downloads (#7829)

014f822

## Problem This test relied on some sleeps, and was failing ~5% of the time. ## Summary of changes Use log-watching rather than straight waits, and make timeouts more generous for the CI environment.

chore(pageserver): add force aux file policy switch handler (#7842)

4a278cc

For existing users, we want to allow doing a force switch for their aux file policy. Part of #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

chore(pageserver): use kebab case for compaction algorithms (#7845)

ff560a1

Signed-off-by: Alex Chi Z <chi@neon.tech>

Bump vm-builder v0.28.1 -> v0.29.3 (#7849)

eb0c026

One change: runner: allow coredump collection (#931)

[proxy] Do not fail after parquet upload error (#7858)

cd6d811

## Problem If the parquet upload was unsuccessful, it will panic. ## Summary of changes Write error in logs instead.

tests: fix an allow list entry (#7856)

545f7e8

#7844 typo'd one of the expressions: https://neon-github-public-dev.s3.amazonaws.com/reports/main/9196993886/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/e420fbfdb193bf80/

skyzh and others added 12 commits May 23, 2024 15:30

chore: lower gate guard drop logging threshold to 100ms (#7862)

a3f5b83

We have some 1001ms cases, which do not yield gate guard context.

CI(release): tune Storage & Compute release PR title (#7870)

71a7fd9

## Problem A title for automatic proxy release PRs is `Proxy release`, and for storage & compute, it's just `Release` ## Summary of changes - Amend PR title for Storage & Compute releases to "Storage & Compute release"

Make python Safekeeper datadir Path instead of str.

b2d34a8

vipvap requested review from a team as code owners May 27, 2024 06:05

vipvap requested review from petuhovskiy, conradludgate and hlinnaka and removed request for a team May 27, 2024 06:05

koivunej approved these changes May 27, 2024

View reviewed changes

koivunej merged commit 2e1fe71 into release May 27, 2024
65 checks passed

koivunej deleted the rc/2024-05-27 branch May 27, 2024 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2024-05-27 #7888

Release 2024-05-27 #7888

vipvap commented May 27, 2024

github-actions bot commented May 27, 2024 •

edited

Loading

Postgres 15

Postgres 14

koivunej commented May 27, 2024

danieltprice commented May 30, 2024

Release 2024-05-27 #7888

Release 2024-05-27 #7888

Conversation

vipvap commented May 27, 2024

Storage & Compute release 2024-05-27

github-actions bot commented May 27, 2024 • edited Loading

3138 tests run: 3006 passed, 0 failed, 132 skipped (full report)

Postgres 15

Postgres 14

Code coverage* (full report)

koivunej commented May 27, 2024

danieltprice commented May 30, 2024

github-actions bot commented May 27, 2024 •

edited

Loading