Release 2024-06-17 #8069

vipvap · 2024-06-17T06:04:16Z

Storage & Compute release 2024-06-17

Please merge this Pull Request using 'Create a merge commit' button

A simple API to collect some statistics after compaction to easily understand the result. The tool reads the layer map, and analyze range by range instead of doing single-key operations, which is more efficient than doing a benchmark to collect the result. It currently computes two key metrics: * Latest data access efficiency, which finds how many delta layers / image layers the system needs to iterate before returning any key in a key range. * (Approximate) PiTR efficiency, as in #7770, which is simply the number of delta files in the range. The reason behind that is, assume no image layer is created, PiTR efficiency is simply the cost of collect records from the delta layers, and the replay time. Number of delta files (or in the future, estimated size of reads) is a simple yet efficient way of estimating how much effort the page server needs to reconstruct a page. Signed-off-by: Alex Chi Z <chi@neon.tech>

@hlinnaka

Reverts #7956 Rationale: compute incompatibilties Slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1718011276665839?thread_ts=1718008160.431869&cid=C033RQ5SPDH Relevant quotes from @hlinnaka > If we go through with the current release candidate, but the compute is pinned, people who create new projects will get that warning, which is silly. To them, it looks like the ICU version was downgraded, because initdb was run with newer version. > We should upgrade the ICU version eventually. And when we do that, users with old projects that use ICU will start to see that warning. I think that's acceptable, as long as we do homework, notify users, and communicate that properly. > When do that, we should to try to upgrade the storage and compute versions at roughly the same time.

Implement LogUtils in the Endpoint fixture class, so that the "log_contains" function can be used on compute logs too. Per discussion at: #7288 (comment)

As seen with the pgvector 0.7.0 index builds, we can receive large batches of images, leading to very large L0 layers in the range of 1GB. These large layers are produced because we are only able to roll the layer after we have witnessed two different Lsns in a single `DataDirModification::commit`. As the single Lsn batches of images can span over multiple `DataDirModification` lifespans, we will rarely get to write two different Lsns in a single `put_batch` currently. The solution is to remember the TimelineWriterState instead of eagerly forgetting it until we really open the next layer or someone else flushes (while holding the write_guard). Additional changes are test fixes to avoid "initdb image layer optimization" or ignoring initdb layers for assertion. Cc: #7197 because small `checkpoint_distance` will now trigger the "initdb image layer optimization"

Quite a few existing test cases create their own timelines instead of using the default one. This pull request highlights that and hopefully people can write simpler tests in the future. Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Yuchen Liang <70461588+yliang412@users.noreply.github.com>

The new features have deteriorated layer flushing, most recently with #7927. Changes: - inline `Timeline::freeze_inmem_layer` to the only caller - carry the TimelineWriterState guard to the actual point of freezing the layer - this allows us to `#[cfg(feature = "testing")]` the assertion added in #7927 - remove duplicate `flush_frozen_layer` in favor of splitting the `flush_frozen_layers_and_wait` - this requires starting the flush loop earlier for `checkpoint_distance < initdb size` tests

## Problem We need automated tests of extensions shipped with Neon to detect possible problems. ## Summary of changes A new image neon-test-extensions is added. Workflow changes to test the shipped extensions are added as well. Currently, the regression tests, shipped with extensions are in use. Some extensions, i.e. rum, timescaledb, rdkit, postgis, pgx_ulid, pgtap, pg_tiktoken, pg_jsonschema, pg_graphql, kq_imcx, wal2json_2_5 are excluded due to problems or absence of internal tests. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

We've stored metadata as bytes within the `index_part.json` for long fixed reasons. #7693 added support for reading out normal json serialization of the `TimelineMetadata`. Change the serialization to only write `TimelineMetadata` as json for going forward, keeping the backward compatibility to reading the metadata as bytes. Because of failure to include `alias = "metadata"` in #7693, one more follow-up is required to make the switch from the old name to `"metadata": <json>`, but that affects only the field name in serialized format. In documentation and naming, an effort is made to add enough warning signs around TimelineMetadata so that it will receive no changes in the future. We can add those fields to `IndexPart` directly instead. Additionally, the path to cleaning up `metadata.rs` is documented in the `metadata.rs` module comment. If we must extend `TimelineMetadata` before that, the duplication suggested in [review comment] is the way to go. [review comment]: #7699 (review)

We need unique tenant harness names in case you want to inspect the results of the last failing run. We are not using any proc macros to get the test name as there is no stable way of doing that, and there will not be one in the future, so we need to fix these duplicates. Also, clean up the duplicated tests to not mix `?` and `unwrap/assert`.

…paction at gc_horizon (#7948) A demo for a building block for compaction. The GC-compaction operation iterates all layers below/intersect with the GC horizon, and do a full layer rewrite of all of them. The end result will be image layer covering the full keyspace at GC-horizon, and a bunch of delta layers above the GC-horizon. This helps us collect the garbages of the test_gc_feedback test case to reduce space amplification. This operation can be manually triggered using an HTTP API or be triggered based on some metrics. Actual method TBD. The test is very basic and it's very likely that most part of the algorithm will be rewritten. I would like to get this merged so that I can have a basic skeleton for the algorithm and then make incremental changes. <img width="924" alt="image" src="https://github.com/neondatabase/neon/assets/4198311/f3d49f4e-634f-4f56-986d-bfefc6ae6ee2"> --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem The storage controller does not track the number of shards attached to a given pageserver. This is a requirement for various scheduling operations (e.g. draining and filling will use this to figure out if the cluster is balanced) ## Summary of Changes Track the number of shards attached to each node. Related #7387

## Problem We need the ability to prepare a subset of storage controller managed pageservers for decommisioning. The storage controller cannot currently express this in terms of scheduling constraints (it's a pretty special case, so I'm not sure it even should). ## Summary of Changes A new `drain` command is added to `storcon_cli`. It takes a set of nodes to drain and migrates primary attachments outside of said set. Simple round robing assignment is used under the assumption that nodes outside of the draining set are evenly balanced. Note that secondary locations are not migrated. This is fine for staging, but the migration API will have to be extended for prod in order to allow migration of secondaries as well. I've tested this out against a neon local cluster. The immediate use for this command will be to migrate staging to ARM(Arch64) pageservers. Related neondatabase/cloud#14029

This makes IDEs and github diff format the code the same way as PostgreSQL sources, which is the style we try to maintain.

The S3 scrubber contains "S3" in its name, but we want to make it generic in terms of which storage is used (#7547). Therefore, rename it to "storage scrubber", following the naming scheme of already existing components "storage broker" and "storage controller". Part of #7547

We implemented on-demand WAL download for walsender, but other things that may want to read the WAL from safekeepers don't do that yet. This PR makes it do that by adding the same set of hooks to logicalfuncs. Addresses #7959 Also relies on: neondatabase/postgres#438 neondatabase/postgres#437 neondatabase/postgres#436

- Split the first and second parts of the test to two separate tests - In the first test, disable the aggressive GC, compaction, and autovacuum. They are only needed by the second test. I'd like to get the first test to a point that the VM page is never all-zeros. Disabling autovacuum in the first test is hopefully enough to accomplish that. - Compare the full page images, don't skip page header. After fixing the previous point, there should be no discrepancy. LSN still won't match, though, because of commit 387a368. Fixes issue #7984

Let's be modern.

- Fix the dockerhub URLs - `neondatabase/compute-node` image has been replaced with Postgres version specific images like `neondatabase/compute-node-v16` - Use TAG=latest in the example, rather than some old tag. That's a sensible default for people to copy-past - For convenience, use a Postgres connection URL in the `psql` example that also includes the password. That way, there's no need to set up .pgpass - Update the image names in `docker ps` example to match what you get when you follow the example

…#8024) ## Problem The merging of #7818 caused the problem with the docker-compose file. Running docker compose is now impossible due to the unavailability of the neon-test-extensions:latest image ## Summary of changes Fix the problem: Add the latest tag to the neon-test-extensions image and use the profiles feature of the docker-compose file to avoid loading the neon-test-extensions container if it is not needed.

## Problem The previous code would attempt to drain to unavailable or unschedulable nodes. ## Summary of Changes Remove such nodes from the list of nodes to fill.

…eserver (#8023) ## Problem Testcase page bench test_pageserver_max_throughput_getpage_at_latest_lsn had been deactivated because it was flaky. We now ignore copy fail error messages like in https://github.com/neondatabase/neon/blob/270d3be507643f068120b52838c497f6c1b45b61/test_runner/regress/test_pageserver_getpage_throttle.py#L17-L20 and want to reactivate it to see it it is still flaky ## Summary of changes - reactivate the test in CI - ignore CopyFail error message during page bench test cases ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist

## Problem The version was missing in the image name causing the error during the workflow ## Summary of changes Added the version to the image name

Some test cases add random keys into the timeline, but it is not part of the `collect_keyspace`, this will cause compaction remove the keys. The pull request adds a field to supply extra keyspaces during unit tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

This query causes metrics exporter to complain about missing data because it can't find the correct column. Issue was introduced with #7761

…ry (#8031) If a standby is started right after switching to a new WAL segment, the request in the SLRU download request would point to the beginning of the segment (e.g. 0/5000000), while the not-modified-since LSN would point to just after the page header (e.g. 0/5000028). It's effectively the same position, as there cannot be any WAL records in between, but the pageserver rightly errors out on any request where the request LSN < not-modified since LSN. To fix, round down the not-modified since LSN to the beginning of the page like the request LSN. Fixes issue #8030

## Problem Respect errors classification from cplane

#8002 We need mock WAL record to make it easier to write unit tests. This pull request adds such a record. It has `clear` flag and `append` field. The tests for legacy-enhanced compaction are not modified yet and will be part of the next pull request. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

…ge cache hit (#8050) # Problem Suppose our vectored get starts with an inexact materialized page cache hit ("cached lsn") that is shadowed by a newer image layer image layer. Like so: ``` <inmemory layers> +-+ < delta layer | | -|-|----- < image layer | | | | -|-|----- < cached lsn for requested key +_+ ``` The correct visitation order is 1. inmemory layers 2. delta layer records in LSN range `[image_layer.lsn, oldest_inmemory_layer.lsn_range.start)` 3. image layer However, the vectored get code, when it visits the delta layer, it (incorrectly!) returns with state `Complete`. The reason why it returns is that it calls `on_lsn_advanced` with `self.lsn_range.start`, i.e., the layer's LSN range. Instead, it should use `lsn_range.start`, i.e., the LSN range from the correct visitation order listed above. # Solution Use `lsn_range.start` instead of `self.lsn_range.start`. # Refs discovered by & fixes #6967 Co-authored-by: Vlad Lazar <vlad@neon.tech>

This will help when analyzing the origins of connections to a compute like in [0]. [0]: neondatabase/cloud#14247

Update pgvector to 0.7.2 Purely mechanical update to pgvector.patch, just as a place to start from

## Problem Some code paths during secondary mode download are returning Ok() rather than UpdateError::Cancelled. This is functionally okay, but it means that the end of TenantDownloader::download has a sanity check that the progress is 100% on success, and prints a "Correcting drift..." warning if not. This warning can be emitted in a test, e.g. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8049/9503642976/index.html#/testresult/fff1624ba6adae9e. ## Summary of changes - In secondary download cancellation paths, use Err(UpdateError::Cancelled) rather than Ok(), so that we drop out of the download function and do not reach the progress sanity check.

This failed once with `relation "test" does not exist` when trying to run the query on the standby. It's possible that the standby is started before the CREATE TABLE is processed in the pageserver, and the standby opens up for queries before it has received the CREATE TABLE transaction from the primary. To fix, wait for the standby to catch up to the primary before starting to run the queries. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8025/9483658488/index.html

## Problem We have some amount of outdated action in the CI pipeline, GitHub complains about some of them. ## Summary of changes - Update `actions/checkout@1` (a really old one) in `vm-compute-node-image` - Update `actions/checkout@3` in `build-build-tools-image` - Update `docker/setup-buildx-action` in all workflows / jobs, it was downgraded in #7445, but it it seems it works fine now

## Problem This test could fail with a timeout waiting for tenant deletions. Tenant deletions could get tripped up on nodes transitioning from offline to online at the moment of the deletion. In a previous reconciliation, the reconciler would skip detaching a particular location because the node was offline, but then when we do the delete the node is marked as online and can be picked as the node to use for issuing a deletion request. This hits the "Unexpectedly still attached path", which would still work if the caller kept calling DELETE, but if a caller does a Delete,get,get,get poll, then it doesn't work because the GET calls fail after we've marked the tenant as detached. ## Summary of changes Fix the undesirable storage controller behavior highlighted by this test failure: - Change tenant deletion flow to _always_ wait for reconciliation to succeed: it was unsound to proceed and return 202 if something was still attached, because after the 202 callers can no longer GET the tenant. Stabilize the test: - Add a reconcile_until_idle to the test, so that it will not have reconciliations running in the background while we mark a node online. This test is not meant to be a chaos test: we should test that kind of complexity elsewhere. - This reconcile_until_idle also fixes another failure mode where the test might see a None for a tenant location because a reconcile was mutating it (https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7288/9500177581/index.html#suites/8fc5d1648d2225380766afde7c428d81/4acece42ae00c442/) It remains the case that a motivated tester could produce a situation where a DELETE gives a 500, when precisely the wrong node transitions from offline to available at the precise moment of a deletion (but the 500 is better than returning 202 and then failing all subsequent GETs). Note that nodes don't go through the offline state during normal restarts, so this is super rare. We should eventually fix this by making DELETE to the pageserver implicitly detach the tenant if it's attached, but that should wait until nobody is using the legacy-style deletes (the ones that use 202 + polling)

…8051) ## Problem This PR refactors some error handling to avoid log spam on tenant/timeline shutdown. - "ignoring failure to find gc cutoffs: timeline shutting down." logs (#8012) - "synthetic_size_worker: failed to calculate synthetic size for tenant ...: Failed to refresh gc_info before gathering inputs: tenant shutting down", for example here: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8049/9502988669/index.html#suites/3fc871d9ee8127d8501d607e03205abb/1a074a66548bbcea Closes: #8012 ## Summary of changes - Refactor: Add a PageReconstructError variant to GcError: this is the only kind of error that find_gc_cutoffs can emit. - Functional change: only ignore shutdown PageReconstructError variant: for other variants, treat it as a real error - Refactor: add a structured CalculateSyntheticSizeError type and use it instead of anyhow::Error in synthetic size calculations - Functional change: while iterating through timelines gathering logical sizes, only drop out if the whole tenant is cancelled: individual timeline cancellations indicate deletion in progress and we can just ignore those.

## Problem rust 1.79 new enabled by default lints ## Summary of changes * update to rust 1.79 * `s/default_features/default-features/` * fix proxy dead code. * fix pageserver dead code.

Graceful shutdown broke it.

## Problem I've bumped `docker/setup-buildx-action` in #8042 because I wasn't able to reproduce the issue from #7445. But now the issue appears again in https://github.com/neondatabase/neon/actions/runs/9514373620/job/26226626923?pr=8059 The steps to reproduce aren't clear, it required `docker/setup-buildx-action@v3` and rebuilding the image without cache, probably ## Summary of changes - Downgrade `docker/setup-buildx-action@v3` to `docker/setup-buildx-action@v2`

follow up on #7904 avoid a layer of indirection introduced by `Vec<Range<Key>>` Signed-off-by: Alex Chi Z <chi@neon.tech>

…ts (#8057) ## Problem halfvec data type was introduced in pgvector 0.7.0 and is popular because it allows smaller vectors, smaller indexes and potentially better performance. So far we have not tested halfvec in our periodic performance tests. This PR adds halfvec indexing and halfvec queries to the test.

cargo test (or nextest) might rebuild the binaries with different features/flags, so do install immediately after the build. Triggered by the particular case of nextest invocations missing $CARGO_FEATURES, which recompiled safekeeper without 'testing' feature which made python tests needing it (failpoints) not run in the CI. Also add CARGO_FEATURES to the nextest runs anyway because there doesn't seem to be an important reason not to.

github-actions · 2024-06-17T06:59:31Z

3216 tests run: 3087 passed, 0 failed, 129 skipped (full report)

Flaky tests (1)

Postgres 14

test_metric_collection: debug

Code coverage* (full report)

functions: 32.4% (6831 of 21064 functions)
lines: 50.0% (53148 of 106301 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
2ba4145 at 2024-06-17T06:59:31.236Z :recycle:}

danieltprice · 2024-06-21T01:03:37Z

Reviewed for changelog

skyzh and others added 30 commits June 10, 2024 10:42

Simplify scanning compute logs in tests (#7997)

5a7e285

Implement LogUtils in the Endpoint fixture class, so that the "log_contains" function can be used on compute logs too. Per discussion at: #7288 (comment)

Copy editor config for the neon extension from PostgreSQL (#8009)

78a59b9

This makes IDEs and github diff format the code the same way as PostgreSQL sources, which is the style we try to maintain.

Update default Postgres version in docker-compose.yml (#8019)

69aa1ac

Let's be modern.

storcon_cli: do not drain to undesirable nodes (#8027)

3099e1a

## Problem The previous code would attempt to drain to unavailable or unschedulable nodes. ## Summary of Changes Remove such nodes from the list of nodes to fill.

Add the image version to the neon-test-extensions image (#8032)

9dda13e

## Problem The version was missing in the image name causing the error during the workflow ## Summary of changes Added the version to the image name

Fix query error in vm-image-spec.yaml (#8028)

ad0ab3b

This query causes metrics exporter to complain about missing data because it can't find the correct column. Issue was introduced with #7761

Proxy process updated errors (#8026)

fbccd1e

## Problem Respect errors classification from cplane

Set application_name for internal connections to computes

0c3e3a8

This will help when analyzing the origins of connections to a compute like in [0]. [0]: neondatabase/cloud#14247

extensions: pgvector-0.7.2 (#8037)

f670101

Update pgvector to 0.7.2 Purely mechanical update to pgvector.patch, just as a place to start from

jcsp and others added 11 commits June 14, 2024 09:39

update rust to 1.79.0 (#8048)

e6eb002

## Problem rust 1.79 new enabled by default lints ## Summary of changes * update to rust 1.79 * `s/default_features/default-features/` * fix proxy dead code. * fix pageserver dead code.

Fix test_segment_init_failure.

a71f58e

Graceful shutdown broke it.

chore(pageserver): vectored get target_keyspace directly accums (#8055)

8189219

follow up on #7904 avoid a layer of indirection introduced by `Vec<Range<Key>>` Signed-off-by: Alex Chi Z <chi@neon.tech>

vipvap requested review from a team as code owners June 17, 2024 06:04

vipvap requested review from save-buffer, conradludgate, jcsp and NanoBjorn and removed request for a team June 17, 2024 06:04

lubennikovaav approved these changes Jun 17, 2024

View reviewed changes

arpad-m merged commit 371020f into release Jun 17, 2024
131 checks passed

arpad-m deleted the rc/2024-06-17 branch June 17, 2024 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2024-06-17 #8069

Release 2024-06-17 #8069

vipvap commented Jun 17, 2024

github-actions bot commented Jun 17, 2024

Postgres 14

danieltprice commented Jun 21, 2024

Release 2024-06-17 #8069

Release 2024-06-17 #8069

Conversation

vipvap commented Jun 17, 2024

Storage & Compute release 2024-06-17

github-actions bot commented Jun 17, 2024

3216 tests run: 3087 passed, 0 failed, 129 skipped (full report)

Postgres 14

Code coverage* (full report)

danieltprice commented Jun 21, 2024