Proxy release 2024-08-08 #8647

vipvap · 2024-08-08T06:02:03Z

Proxy release 2024-08-08

Please merge this Pull Request using 'Create a merge commit' button

Before this PR 1.The circuit breaker would trip on CompactionError::Shutdown. That's wrong, we want to ignore those cases. 2. remote timeline client shutdown would not be mapped to CompactionError::Shutdown in all circumstances. We observed this in staging, see https://neondb.slack.com/archives/C033RQ5SPDH/p1721829745384449 This PR fixes (1) with a simple `match` statement, and (2) by switching a bunch of `anyhow` usage over to distinguished errors that ultimately get mapped to `CompactionError::Shutdown`. I removed the implicit `#[from]` conversion from `anyhow::Error` to `CompactionError::Other` to discover all the places that were mapping remote timeline client shutdown to `anyhow::Error`. In my opinion `#[from]` is an antipattern and we should avoid it, especially for `anyhow::Error`. If some callee is going to return anyhow, the very least the caller should to is to acknowledge, through a `map_err(MyError::Other)` that they're conflating different failure reasons.

…benchmarking runs (#8493) ## Problem The rds-aurora endpoint connection cannot be reached from GitHub action runners. Temporarily remove this DBMS from the pgbench comparison runs. ## Summary of changes On Saturday we normally run Neon in comparison with AWS RDS-Postgres and AWS RDS-Aurora. Remove Aurora until we have a working setup

## Problem Re-attach blocks the pageserver http server from starting up. Hence, it can't reply to heartbeats until that's done. This makes the storage controller mark the node off-line (not good). We worked around this by setting the interval after which nodes are marked offline to 5 minutes. This isn't a long term solution. ## Summary of changes * Introduce a new `NodeAvailability` state: `WarmingUp`. This state models the following time interval: * From receiving the re-attach request until the pageserver replies to the first heartbeat post re-attach * The heartbeat delta generator becomes aware of this state and uses a separate longer interval * Flag `max-warming-up-interval` now models the longer timeout and `max-offline-interval` the shorter one to match the names of the states Closes #7552

## Problem Currently, tests may have a scrub during teardown if they ask for it, but most tests don't request it. To detect "unknown unknowns", let's run it at the end of every test where possible. This is similar to asserting that there are no errors in the log at the end of tests. ## Summary of changes - Remove explicit `enable_scrub_on_exit` - Always scrub if remote storage is an S3Storage.

## Problem This test was destabilized by #8431. The threshold is arbitrary & failures are still quite close to it. At a high level the test is asserting "eviction was approximately fair to these tenants", which appears to still be the case when the abs diff between ratios is slightly higher at ~0.6-0.7. ## Summary of changes - Change threshold from 0.06 to 0.065. Based on the last ~10 failures that should be sufficient.

## Problem Storcon shutdown did not produce a clean observed state. This is not a problem at the moment, but we will need to stop all reconciles with clean observed state for rolling restarts. I tried to test this by collecting the observed state during shutdown and comparing it with the in-memory observed state, but it doesn't work because a lot of tests use the cursed attach hook to create tenants directly through the ps. ## Summary of Changes Rework storcon shutdown as follows: * Reconcilers get a separate cancellation token which is a child token of the global `Service::cancel`. * Reconcilers get a separate gate * Add a mechanism to drain the reconciler result queue before * Put all of this together into a clean shutdown sequence Related neondatabase/cloud#14701

This pull request (should) fix the failure of test_gc_feedback. See the explanation in the newly-added test case. Part of #8002 Allow incomplete history for the compaction algorithm. Signed-off-by: Alex Chi Z <chi@neon.tech>

update pg_jsonschema extension to v 0.3.1 update pg_graphql extension to v1.5.7 update pgx_ulid extension to v0.1.5 update pg_tiktoken extension, patch Cargo.toml to use new pgrx

There is a race condition between timeline shutdown and the split task. Timeline shutdown first shuts down the upload queue, and only then fires the cancellation token. A parallel running timeline split operation might thus encounter a cancelled upload queue before the cancellation token is fired, and print a noisy error. Fix this by mapping `anyhow::Error{ NotInitialized::ShuttingDown }) to `FlushLayerError::Cancelled` instead of `FlushLayerError::Other(_)`. Fixes #8496

## Problem follow up for #8475 ## Summary of changes Using own private docker registry in `cache-from` and `cache-to` settings in docker build-push actions

## Problem The scrubber would like to check the highest mtime in a tenant's objects as a safety check during purges. It recently switched to use GenericRemoteStorage, so we need to expose that in the listing methods. ## Summary of changes - In Listing.keys, return a ListingObject{} including a last_modified field, instead of a RemotePath --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Release notes](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-180-2024-07-25). Prior update was in #8048

#8473) ## Problem This test relies on writing image layers before the split. It can fail to do so durably if the image layers are written ahead of the remote consistent LSN, so we should have been doing a checkpoint rather than just a compaction

## Problem This is an experiment to see if 16x concurrency is actually helping, or if it's just giving us very noisy results. If the total runtime with a lower concurrency is similar, then a lower concurrency is preferable to reduce the impact of resource-hungry tests running concurrently.

#8384) ## Problem Vectored get is already enabled in all prod regions without validation. The pageserver defaults are out of sync however. ## Summary of changes Update the pageserver defaults to match the prod config. Also means that when running tests locally, people don't have to use the env vars to get the prod config.

## Problem - The intersection of "safekeepers" and "storage" is just one person

## Problem We are missing the step-down primitive required to implement rolling restarts of the storage controller. ## Summary of changes Add `/control/v1/step_down` endpoint which puts the storage controller into a state where it rejects all API requests apart from `/control/v1/step_down`, `/status` and `/metrics`. When receiving the request, storage controller cancels all pending reconciles and waits for them to exit gracefully. The response contains a snapshot of the in-memory observed state. Related: * neondatabase/cloud#14701 * #7797 * #8310

The lfc_approximate_working_set_size_windows query was failing on pg14 and pg15 with pq: subquery in FROM must have an alias Because aliases in that position became optional only in pg16. Some context here: https://neondb.slack.com/archives/C04DGM6SMTM/p1721970322601679?thread_ts=1721921122.528849

I am not aware of a case of "transient" VirtualFile errors as mentioned in #5880 Private DM with Joonas discussing this: https://neondb.slack.com/archives/D049K7HJ9JM/p1721836424615799

Problem ------- wait_lsn timeouts result in a user-facing errors like ``` $ /tmp/neon/pg_install/v16/bin/pgbench -s3424 -i -I dtGvp user=neondb_owner dbname=neondb host=ep-tiny-wave-w23owa37.eastus2.azure.neon.build sslmode=require options='-cstatement_timeout=0 ' dropping old tables... NOTICE: table "pgbench_accounts" does not exist, skipping NOTICE: table "pgbench_branches" does not exist, skipping NOTICE: table "pgbench_history" does not exist, skipping NOTICE: table "pgbench_tellers" does not exist, skipping creating tables... generating data (server-side)... vacuuming... pgbench: error: query failed: ERROR: [NEON_SMGR] [shard 0] could not read block 214338 in rel 1663/16389/16839.0 from page server at lsn C/E1C12828 DETAIL: page server returned error: LSN timeout: Timed out while waiting for WAL record at LSN C/E1418528 to arrive, last_record_lsn 6/999D9CA8 disk consistent LSN=6/999D9CA8, WalReceiver status: (update 2024-07-25 08:30:07): connecting to node 25, safekeeper candidates (id|update_time|commit_lsn): [(21|08:30:16|C/E1C129E0), (23|08:30:16|C/E1C129E0), (25|08:30:17|C/E1C129E0)] CONTEXT: while scanning block 214338 of relation "public.pgbench_accounts" pgbench: detail: Query was: vacuum analyze pgbench_accounts ``` Solution -------- Its better to be slow than to fail the queries. If the app has a deadline, it can use `statement_timeout`. In the long term, we want to eliminate wait_lsn timeout. In the short term (this PR), we bump the wait_lsn timeout to a larger value to reduce the frequency at which these wait_lsn timeouts occur. We will observe SLOs and specifically `pageserver_wait_lsn_seconds_bucket` before we eliminate the timeout completely.

…ion (#8443) close #8435 ## Summary of changes If L0 compaction did not include all L0 layers, skip image generation. There are multiple possible solutions to the original issue, i.e., an alternative is to wrap the partial L0 compaction in a loop until it compacts all L0 layers. However, considering that we should weight all tenants equally, the current solution can ensure everyone gets a chance to run compaction, and those who write too much won't get a chance to create image layers. This creates a natural backpressure feedback that they get a slower read due to no image layers are created, slowing down their writes, and eventually compaction could keep up with their writes + generate image layers. Consider deployment, we should add an alert on "skipping image layer generation", so that we won't run into the case that image layers are not generated => incidents again. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

Persists whether a timeline is archived or not in `index_part.json`. We only return success if the upload has actually worked successfully. Also introduces a new `index_part.json` version number. Fixes #8459 Part of #8088

… is enabled (#7990) ## Problem There's a `NeonEnvBuilder#preserve_database_files` parameter that allows you to keep database files for debugging purposes (by default, files get cleaned up), but there's no way to get these files from a CI run. This PR adds handling of `NeonEnvBuilder#preserve_database_files` and adds the compressed test output directory to Allure reports (for tests with this parameter enabled). Ref #6967 ## Summary of changes - Compress and add the whole test output directory to Allure reports - Currently works only with `neon_env_builder` fixture - Remove `preserve_database_files = True` from sharding tests as unneeded --------- Co-authored-by: Christian Schwarz <christian@neon.tech>

For IDENTIFY_SYSTEM in particular, application name gives away whether the client is * walproposer => Some(wal_proposer_recovery) * safekeeper => Some(safekeeper) * pageserver => Some(pageserver) Context: https://neondb.slack.com/archives/C06SJG60FRB/p1721987794673429?thread_ts=1721981056.451599&cid=C06SJG60FRB

## Problem After #7990 `regress_test` job started to fail with an error: ``` ... File "/__w/neon/neon/test_runner/fixtures/benchmark_fixture.py", line 485, in pytest_terminal_summary terminalreporter.write(f"{test_report.head_line}.{recorded_property['name']}: ") TypeError: 'bool' object is not subscriptable ``` https://github.com/neondatabase/neon/actions/runs/10125750938/job/28002582582 It happens because the current implementation doesn't expect pytest's `user_properties` can be used for anything else but benchmarks (and #7990 started to use it for tracking `preserve_database_files` parameter) ## Summary of changes - Make NeonBenchmarker use only records with`neon_benchmarker_` prefix

Uses the Stream based `list_streaming` function added by #8457 in tenant deletion, as suggested in #7932 (comment) . We don't have to worry about retries, as the function is wrapped inside an outer retry block. If there is a retryable error either during the listing or during deletion, we just do a fresh start. Also adds `+ Send` bounds as they are required by the `delete_tenant_remote` function.

## Problem We need to test logical replication with 3rd-party tools regularly. ## Summary of changes Added a test using ClickHouse as a client Co-authored-by: Alexander Bayandin <alexander@neon.tech>

By including comparison of `remote_consistent_lsn_visible` we risk flakyness coming from outside of timeline creation. Mask out the `remote_consistent_lsn_visible` for the comparison. Evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8489/10142336315/index.html#suites/ffbb7f9930a77115316b58ff32b7c719/89ff0270bf58577a

Timeline cancellation running in parallel with gc yields error log lines like: ``` Gc failed 1 times, retrying in 2s: TimelineCancelled ``` They are completely harmless though and normal to occur. Therefore, only print those messages at an info level. Still print them at all so that we know what is going on if we focus on a single timeline.

Add a missing colon to the API specification of `ArchivalConfigRequest`. The `state` field is required. Pointed out by Gleb.

## Problem 1. Hard to correlate startup parameters with the endpoint that provided them. 2. Some configurations are not needed in the `ProxyConfig` struct. ## Summary of changes Because of some borrow checker fun, I needed to switch to an interior-mutability implementation of our `RequestMonitoring` context system. Using https://docs.rs/try-lock/latest/try_lock/ as a cheap lock for such a use-case (needed to be thread safe). Removed the lock of each startup message, instead just logging only the startup params in a successful handshake. Also removed from values from `ProxyConfig` and kept as arguments. (needed for local-proxy config)

#8229) Ephemeral files cleanup on drop but did not delay shutdown, leading to problems with restarting the tenant. The solution is as proposed: - make ephemeral files carry the gate guard to delay `Timeline::gate` closing - flush in-memory layers and strong references to those on `Timeline::shutdown` The above are realized by making LayerManager an `enum` with `Open` and `Closed` variants, and fail requests to modify `LayerMap`. Additionally: - fix too eager anyhow conversions in compaction - unify how we freeze layers and handle errors - optimize likely_resident_layers to read LayerFileManager hashmap values instead of bouncing through LayerMap Fixes: #7830

We see an assertion error in staging. Dump the key to guess where it was from, and then we can fix it. Signed-off-by: Alex Chi Z <chi@neon.tech>

… hashset (#8629) Earlier I was thinking we'd need a (ancestor_lsn, timeline_id) ordered list of reparented. Turns out we did not need it at all. Replace it with an unordered hashset. Additionally refactor the reparented direct children query out, it will later be used from more places. Split off from #8430. Cc: #6994

neondatabase/cloud#15899

We've noticed increased memory usage with the latest release. Drain the joinset of `page_service` connection handlers to avoid leaking them until shutdown. An alternative would be to use a TaskTracker. TaskTracker was not discussed in original PR #8339 review, so not hot fixing it in here either.

Part of #8130, [RFC: Direct IO For Pageserver](https://github.com/neondatabase/neon/blob/problame/direct-io-rfc/docs/rfcs/034-direct-io-for-pageserver.md) ## Description Add pageserver config for evaluating/enabling direct I/O. - Disabled: current default, uses buffered io as is. - Evaluate: still uses buffered io, but could do alignment checking and perf simulation (pad latency by direct io RW to a fake file). - Enabled: uses direct io, behavior on alignment error is configurable. Signed-off-by: Yuchen Liang <yuchen@neon.tech>

## Problem Some developers build on MacOS, which doesn't have io_uring. ## Summary of changes - Add `io_engine_for_bench`, which on linux will give io_uring or panic if it's unavailable, and on MacOS will always panic. We do not want to run such benchmarks with StdFs: the results aren't interesting, and will actively waste the time of any developers who start investigating performance before they realize they're using a known-slow I/O backend. Why not just conditionally compile this benchmark on linux only? Because even on linux, I still want it to refuse to run if it can't get io_uring.

github-actions · 2024-08-08T10:19:27Z

2108 tests run: 2039 passed, 0 failed, 69 skipped (full report)

Code coverage* (full report)

functions: 32.6% (7163 of 22005 functions)
lines: 50.6% (57901 of 114500 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
32e595d at 2024-08-08T13:56:43.382Z :recycle:}

conradludgate · 2024-08-08T12:48:58Z

Relevant changes:

proxy: random changes #8602
Merge conflict resolution changing the cancellation redis channel sink

problame and others added 30 commits July 25, 2024 10:24

followup(#8359): pre-initialize circuitbreaker metrics (#8491)

d57412a

Update pgrx to v 0.11.3 (#8515)

9bfa180

update pg_jsonschema extension to v 0.3.1 update pg_graphql extension to v1.5.7 update pgx_ulid extension to v0.1.5 update pg_tiktoken extension, patch Cargo.toml to use new pgrx

Update links in synthetic-size.md (#8501)

857a182

Using own registry to cache layers in docker build workflows (#8521)

8182bfd

## Problem follow up for #8475 ## Summary of changes Using own private docker registry in `cache-from` and `cache-to` settings in docker build-push actions

CODEOWNERS: collapse safekeepers into storage (#8510)

eddfd62

## Problem - The intersection of "safekeepers" and "storage" is just one person

refactor(layer load API): all errors are permanent (#8527)

8154e88

I am not aware of a case of "transient" VirtualFile errors as mentioned in #5880 Private DM with Joonas discussing this: https://neondb.slack.com/archives/D049K7HJ9JM/p1721836424615799

Persist archival information (#8479)

b5e95f6

Persists whether a timeline is archived or not in `index_part.json`. We only return success if the upload has actually worked successfully. Also introduces a new `index_part.json` version number. Fixes #8459 Part of #8088

Add a test for clickhouse as a logical replication consumer (#8408)

1d8cf5b

## Problem We need to test logical replication with 3rd-party tools regularly. ## Summary of changes Added a test using ClickHouse as a client Co-authored-by: Alexander Bayandin <alexander@neon.tech>

arpad-m and others added 10 commits August 7, 2024 09:29

Add missing colon to ArchivalConfigRequest specification (#8627)

4d7c0da

Add a missing colon to the API specification of `ArchivalConfigRequest`. The `state` field is required. Pointed out by Gleb.

fix(pageserver): dump the key when it's invalid (#8633)

b3eea45

We see an assertion error in staging. Dump the key to guess where it was from, and then we can fix it. Signed-off-by: Alex Chi Z <chi@neon.tech>

cicd: push build-tools image to ACR as well (#8638)

8468d51

neondatabase/cloud#15899

vipvap requested review from a team as code owners August 8, 2024 06:02

vipvap requested review from ololobus, petuhovskiy, conradludgate and yliang412 and removed request for a team August 8, 2024 06:02

Merge branch 'release-proxy' into rc/proxy/2024-08-08

32e595d

conradludgate force-pushed the rc/proxy/2024-08-08 branch from 9e3f806 to 32e595d Compare August 8, 2024 12:53

conradludgate approved these changes Aug 8, 2024

View reviewed changes

conradludgate merged commit 73935ea into release-proxy Aug 8, 2024
70 checks passed

conradludgate deleted the rc/proxy/2024-08-08 branch August 8, 2024 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy release 2024-08-08 #8647

Proxy release 2024-08-08 #8647

vipvap commented Aug 8, 2024

github-actions bot commented Aug 8, 2024 •

edited

Loading

conradludgate commented Aug 8, 2024

Proxy release 2024-08-08 #8647

Proxy release 2024-08-08 #8647

Conversation

vipvap commented Aug 8, 2024