Proxy release 2024-06-27 #8178

vipvap · 2024-06-27T06:01:52Z

Proxy release 2024-06-27

Please merge this Pull Request using 'Create a merge commit' button

part of Epic #7386 # Motivation The materialized page cache adds complexity to the code base, which increases the maintenance burden and risk for subtle and hard to reproduce bugs such as #8050. Further, the best hit rate that we currently achieve in production is ca 1% of materialized page cache lookups for `task_kind=PageRequestHandler`. Other task kinds have hit rates <0.2%. Last, caching page images in Pageserver rewards under-sized caches in Computes because reading from Pageserver's materialized page cache over the network is often sufficiently fast (low hundreds of microseconds). Such Computes should upscale their local caches to fit their working set, rather than repeatedly requesting the same page from Pageserver. Some more discussion and context in internal thread https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459 # Changes This PR removes the materialized page cache code & metrics. The infrastructure for different key kinds in `PageCache` is left in place, even though the "Immutable" key kind is the only remaining one. This can be further simplified in a future commit. Some tests started failing because their total runtime was dependent on high materialized page cache hit rates. This test makes them fixed-runtime or raises pytest timeouts: * test_local_file_cache_unlink * test_physical_replication * test_pg_regress # Performance I focussed on ensuring that this PR will not result in a performance regression in prod. * **getpage** requests: our production metrics have shown the materialized page cache to be irrelevant (low hit rate). Also, Pageserver is the wrong place to cache page images, it should happen in compute. * **ingest** (`task_kind=WalReceiverConnectionHandler`): prod metrics show 0 percent hit rate, so, removing will not be a regression. * **get_lsn_by_timestamp**: important API for branch creation, used by control pane. The clog pages that this code uses are not materialize-page-cached because they're not 8k. No risk of introducing a regression here. We will watch the various nightly benchmarks closely for more results before shipping to prod.

Without this patch, the Pageserver 4 Golden Signals dashboard shows no data if there are no basebackups (observed in pre-prod).

## Problem When creating a new shard the storage controller schedules via Scheduler::schedule_shard. This does not take into account the number of attached shards. What it does take into account is the node affinity: when a shard is scheduled, all its nodes (primaries and secondaries) get their affinity incremented. For two node clusters and shards with one secondary we have a pathological case where all primaries are scheduled on the same node. Now that we track the count of attached shards per node, this is trivial to fix. Still, the "proper" fix is to use the pageserver's utilization score. Closes #8041 ## Summary of changes Use attached shard count when deciding which node to schedule a fresh shard on.

## Problem Storage controller shared locks do not print a warning when held for long time spans. ## Summary of changes Extension of issue #7108 in tracing to exclusive lock in `id_lock_map` was added, to add the same for shared locks. It was mentioned in the comment #7397 (comment)

## Problem The fill planner introduced in #8014 selects tenant shards to promote strictly based on attached shard count load (tenant shards on nodes with the most attached shard counts are considered first). This approach runs the risk of migrating too many shards belonging to the same tenant on the same primary node. This is bad for availability and causes extra reconciles via the storage controller's background optimisations. Also see #8014 (comment). ## Summary of changes Refine the fill plan to avoid promoting too many shards belonging to the same tenant on the same node. We allow for `max(1, shard_count / node_count)` shards belonging to the same tenant to be promoted.

## Problem These APIs have be unused for some time. They were superseded by /location_conf: the equivalent of ignoring a tenant is now to put it in secondary mode. ## Summary of changes - Remove APIs - Remove tests & helpers that used them - Remove error variants that are no longer needed.

… controller in test cases to make test cases less flaky (#8079) ## Problem see #8070 ## Summary of changes the neon_local subcommands to - start neon - start pageserver - start safekeeper - start storage controller get a new option -t=xx or --start-timeout=xx which allows to specify a longer timeout in seconds we wait for the process start. This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases. In addition we exploit the new timeout option in the python test infrastructure (python fixtures) and modify the flaky testcase to increase the timeout from 10 seconds to 1 minute. Example from the test execution ```bash RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py ... 2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s" 2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001" 2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s" 2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s" ```

This is a preparation for #8022, to make the PR both backwards and foward compatible. This commit adds `eviction_state` field to control file. Adds support for reading it, but writes control file in old format where possible, to keep the disk format forward compatible. Note: in `patch_control_file`, new field gets serialized to json like this: - `"eviction_state": "Present"` - `"eviction_state": {"Offloaded": "0/8F"}`

In #7957 we enabled deletion without attachment, but retained the old-style deletion (return 202, delete in background) for attached tenants. In this PR, we remove the old-style deletion path, such that if the tenant delete API is invoked while a tenant is detached, it is simply detached before completing the deletion. This intentionally doesn't rip out all the old deletion code: in case a deletion was in progress at time of upgrade, we keep around the code for finishing it for one release cycle. The rest of the code removal happens in #8091 Now that deletion will always be via the new path, the new path is also updated to use some retries around remote storage operations, to tripping up the control plane with 500s if S3 has an intermittent issue.

…tions (#8029) ## Problem There's no way to cancel drain and fill operations. ## Summary of changes Implement HTTP endpoints to allow cancelling of background operations. When the operationis cancelled successfully, the node scheduling policy will revert to `Active`.

…8123) ## Problem For testing, the storage controller has a built-in hack that loads neon_local endpoint config from disk, and uses it to reconfigure endpoints when the attached pageserver changes. Some tests that stop an endpoint while the storage controller is running could occasionally fail on log errors from the controller trying to use its special test-mode calls into neon local Endpoint. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8117/9592392425/index.html#/testresult/9d2bb8623d0d53f8 ## Summary of changes - Give NotifyError an explicit NeonLocal variant, to avoid munging these into generic 500s (I don't want to ignore 500s in general) - Allow-list errors related to the local notification hook. The expectation is that tests using endpoints/workloads should be independently checking that those endpoints work: if neon_local generates an error inside the storage controller, that's ignorable.

neondatabase/cloud#13648

## Problem While adapting the storage controller scale test to do graceful rolling restarts via drain and fill, I noticed that secondaries are also being rescheduled, which, in turn, caused the storage controller to optimise attachments. ## Summary of changes * Introduce a transactional looking rescheduling primitive (i.e. "try to schedule to this secondary, but leave everything as is if you can't") * Use it for the drain and fill stages to avoid calling into `Scheduler::schedule` and having secondaries move around.

Adds a `Deserialize` impl to `RemoteStorageConfig`. We thus achieve the same as #7743 but with less repetitive code, by deriving `Deserialize` impls on `S3Config`, `AzureConfig`, and `RemoteStorageConfig`. The disadvantage is less useful error messages. The git history of this PR contains a state where we go via an intermediate representation, leveraging the `serde_json` crate, without it ever being actual json though. Also, the PR adds deserialization tests. Alternative to #7743 .

## Problem Fixes #1287 ## Summary of changes tokio-postgres now supports arbitrary server params through the `param(key, value)` method. Some keys are special so we explicitly filter them out.

Moves `RemoteStorageConfig` and related structs and functions into a dedicated module. Also implements `Serialize` for the config structs (requested in #8126). Follow-up of #8126

…8091) #8082 removed the legacy deletion path, but retained code for completing deletions that were started before a pageserver restart. This PR cleans up that remaining code, and removes all the pageserver code that dealt with tenant deletion markers and resuming tenant deletions. The release at #8138 contains #8082, so we can now merge this to `main`

## Problem `test_sharding_autosplit` is occasionally failing on warnings about shard splits taking longer than expected (`Exclusive lock by ShardSplit was held for`...) It's not obvious which part is taking the time (I suspect remote storage uploads). Example: https://neon-github-public-dev.s3.amazonaws.com/reports/main/9618788427/index.html#testresult/b395294d5bdeb783/ ## Summary of changes - Since shard splits are infrequent events, we can afford to be very chatty: add a bunch of info-level logging throughout the process.

## Problem This test could occasionally trigger a "removing local file ... because it has unexpected length log" when using the `compact-shard-ancestors-persistent` failpoint is in use, which is unexpected because that failpoint stops the process when the remote metadata is in sync with local files. It was because there are two shards on the same pageserver, and while the one being compacted explicitly stops at the failpoint, another shard was compacting in the background and failing at an unclean point. The test intends to disable background compaction, but was mistakenly revoking the value of `compaction_period` when it updated `pitr_interval`. Example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8123/9602976462/index.html#/testresult/7dd6165da7daef40 ## Summary of changes - Update `TENANT_CONF` in the test to use properly typed values, so that it is usable in pageserver APIs as well as via neon_local. - When updating tenant config with `pitr_interval`, retain the overrides from the start of the test, so that there won't be any background compaction going on during the test.

Previously in safekeeper code, new segment file was opened without truncate option. I don't think there is a reason to do it, this commit replaces it with `File::create` to make it simpler and remove `clippy::suspicious_open_options` linter warning.

We now have a `vX` number in the file name, i.e., `000000067F0000000400000B150100000000-000000067F0000000400000D350100000000__00000000014B7AC8-v1-00000001` The related pull request for new-style path was merged a month ago #7660 ## Summary of changes Fixed the draw timeline dir command to handle it. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

Part of #8002 This pull request adds tests for bottom-most gc-compaction with delta records. Also fixed a bug in the compaction process that creates overlapping delta layers by force splitting at the original delta layer boundary. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem This command was used in the very early days of sharding, before the storage controller had anti-affinity + scheduling optimization to spread out shards. ## Summary of changes - Remove `storcon_cli tenant-scatter`

## Problem - `test_storage_controller_many_tenants` can fail with warnings in the storage controller about tenant creation holding a lock for too long, because this test stresses the machine running the test with many concurrent timeline creations - `test_tenant_delete_smoke` can fail when synthetic remote storage errors show up ## Summary of changes - tolerate warnings about slow timeline creation in test_storage_controller_many_tenants - tolerate both possible errors during error_tolerant_delete

…PI (#8104) Part of #7497, closes #8072. ## Problem Currently the `get_lsn_by_timestamp` and branch creation pageserver APIs do not provide a pleasant client experience where the looked-up LSN might be GC-ed between the two API calls. This PR attempts to prevent common races between GC and branch creation by making use of LSN leases provided in #8084. A lease can be optionally granted to a looked-up LSN. With the lease, GC will not touch layers needed to reconstruct all pages at this LSN for the duration of the lease. Signed-off-by: Yuchen Liang <yuchen@neon.tech>

We did not recover the subtransaction IDs of prepared transactions when starting a hot standby from a shutdown checkpoint. As a result, such subtransactions were considered as aborted, rather than in-progress. That would lead to hint bits being set incorrectly, and the subtransactions suddenly becoming visible to old snapshots when the prepared transaction was committed. To fix, update pg_subtrans with prepared transactions's subxids when starting hot standby from a shutdown checkpoint. The snapshots taken from that state need to be marked as "suboverflowed", so that we also check the pg_subtrans. Discussion: https://www.postgresql.org/message-id/6b852e98-2d49-4ca1-9e95-db419a2696e0%40iki.fi NEON: cherry-picked from the upstream thread ahead of time, to unblock #7288. I expect this to be committed to upstream in the next few days, superseding this. NOTE: I did not include the new regression test on v15 and v14 branches, because the test would need some adapting, and we don't run the perl tests on Neon anyway.

## Problem Storage controller runs into memory corruption issue on the drain/fill code paths. ## Summary of changes Update db related depdencies in the unlikely case that the issue was fixed in diesel.

…8154) We only use `keys` to check if it's empty so we can bail out early. No need to collect the keys for that. Found this while doing research for #7418

## Problem The Docker build cache gets invalidated by PRs ## Summary of changes - Upload cache only from the main branch

…efresh (#8147) Part of #7497, closes #8120. ## Summary of changes This PR adds a metric to track the number of valid leases after `GCInfo` gets refreshed each time. Besides this metric, we should also track disk space and synthetic size (after #8071 is closed) to make sure leases are used properly. Signed-off-by: Yuchen Liang <yuchen@neon.tech>

I saw this compiler warning on my laptop: pgxn/neon_walredo/walredoproc.c:178:10: warning: using the result of an assignment as a condition without parentheses [-Wparentheses] if (err = close_range_syscall(3, ~0U, 0)) { ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pgxn/neon_walredo/walredoproc.c:178:10: note: place parentheses around the assignment to silence this warning if (err = close_range_syscall(3, ~0U, 0)) { ^ ( ) pgxn/neon_walredo/walredoproc.c:178:10: note: use '==' to turn this assignment into an equality comparison if (err = close_range_syscall(3, ~0U, 0)) { ^ == 1 warning generated. I'm not sure what compiler version or options cause that, but it's a good warning. Write the call a little differently, to avoid the warning and to make it a little more clear anyway. (The 'err' variable wasn't used for anything, so I'm surprised we were not seeing a compiler warning on the unused value, too.)

#8089) I was looking for metrics on how many computes are still using protocol version 1 and 2. This provides counters for that as "pagestream" and "pagestream_v2" commands, but also all the other commands. The new metrics are global for the whole pageserver instance rather than per-tenant, so the additional metrics bloat should be fairly small.

## Problem We have seen numerous segfault and memory corruption issue for clients using libpq and openssl 3.2.2. I don't know if this is a bug in openssl or libpq. Downgrading to 1.1.1w fixes the issues for the storage controller and pgbench. ## Summary of Changes: Use openssl 1.1.1w instead of 3.2.2

Fixes #6337 Add safekeeper support to switch between `Present` and `Offloaded(flush_lsn)` states. The offloading is disabled by default, but can be controlled using new cmdline arguments: ``` --enable-offload Enable automatic switching to offloaded state --delete-offloaded-wal Delete local WAL files after offloading. When disabled, they will be left on disk --control-file-save-interval <CONTROL_FILE_SAVE_INTERVAL> Pending updates to control file will be automatically saved after this interval [default: 300s] ``` Manager watches state updates and detects when there are no actvity on the timeline and actual partial backup upload in remote storage. When all conditions are met, the state can be switched to offloaded. In `timeline.rs` there is `StateSK` enum to support switching between states. When offloaded, code can access only control file structure and cannot use `SafeKeeper` to accept new WAL. `FullAccessTimeline` is now renamed to `WalResidentTimeline`. This struct contains guard to notify manager about active tasks requiring on-disk WAL access. All guards are issued by the manager, all requests are sent via channel using `ManagerCtl`. When manager receives request to issue a guard, it unevicts timeline if it's currently evicted. Fixed a bug in partial WAL backup, it used `term` instead of `last_log_term` previously. After this commit is merged, next step is to roll this change out, as in issue #6338.

…ration none/broken usages (#7947) ## Problem In #5299, the new config-v1 tenant config file was added to hold the LocationConf type. We left the old config file in place for forward compat, and because running without generations (therefore without LocationConf) as still useful before the storage controller was ready for prime-time. Closes: #5388 ## Summary of changes - Remove code for reading and writing the legacy config file - Remove Generation::Broken: it was unused. - Treat missing config file on disk as an error loading a tenant, rather than defaulting it. We can now remove LocationConf::default, and thereby guarantee that we never construct a tenant with a None generation. - Update some comments + add some assertions to clarify that Generation::None is only used in layer metadata, not in the state of a running tenant. - Update docker compose test to create tenants with a generation

Use aux file v2 in benchmarks. Signed-off-by: Alex Chi Z <chi@neon.tech>

Also, modify the "neon_local timeline import" command so that it doesn't create the endpoint any more. I don't see any reason to bundle that in the same command, the "timeline create" and "timeline branch" commands don't do that either. I plan to add more tests similar to 'test_import_at_2bil', this will help to reduce the copy-pasting.

Before we consolidate workflows we want to be triggered by merges to main. neondatabase/cloud#14862

github-actions · 2024-06-27T06:49:32Z

2928 tests run: 2811 passed, 0 failed, 117 skipped (full report)

Code coverage* (full report)

functions: 32.7% (6900 of 21117 functions)
lines: 50.1% (53984 of 107829 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
d81ef3f at 2024-06-27T09:38:15.990Z :recycle:}

…#8076)" This reverts commit 78d9059.

conradludgate · 2024-06-27T08:48:24Z

Reverted the arbitrary-config support as it is potentially a breaking change to our customers and needs a proper dedicated rollout. I don't want to delay this release though as it has some changes we need.

Changes in this proxy release:

problame and others added 30 commits June 20, 2024 11:56

fix: preinitialize pageserver_basebackup_query_seconds metric (#8121)

02ecdd1

Without this patch, the Pageserver 4 Golden Signals dashboard shows no data if there are no basebackups (observed in pre-prod).

Remove kq_imcx extension support per customer request

8776089

neondatabase/cloud#13648

proxy: update tokio-postgres to allow arbitrary config params (#8076)

78d9059

## Problem Fixes #1287 ## Summary of changes tokio-postgres now supports arbitrary server params through the `param(key, value)` method. Some keys are special so we explicitly filter them out.

Move remote_storage config related code into dedicated module (#8132)

5446e08

Moves `RemoteStorageConfig` and related structs and functions into a dedicated module. Also implements `Serialize` for the config structs (requested in #8126). Follow-up of #8126

storcon_cli: remove old tenant-scatter command (#8127)

3d76093

## Problem This command was used in the very early days of sharding, before the storage controller had anti-affinity + scheduling optimization to spread out shards. ## Summary of changes - Remove `storcon_cli tenant-scatter`

storcon: update db related dependencides (#8155)

7026dde

## Problem Storage controller runs into memory corruption issue on the drain/fill code paths. ## Summary of changes Update db related depdencies in the unlikely case that the issue was fixed in diesel.

L0 flush: avoid short-lived allocation when checking key_range empty (#…

947f6da

…8154) We only use `keys` to check if it's empty so we can bail out early. No need to collect the keys for that. Found this while doing research for #7418

CI: upload docker cache only from main (#8157)

9b2f941

## Problem The Docker build cache gets invalidated by PRs ## Summary of changes - Upload cache only from the main branch

petuhovskiy and others added 9 commits June 26, 2024 16:46

Panic on unexpected error in simtests (#8169)

3118c24

test: use aux file v2 policy in benchmarks (#8174)

04b2ac3

Use aux file v2 in benchmarks. Signed-off-by: Alex Chi Z <chi@neon.tech>

CI: additional trigger on merge to main (#8176)

32b75e7

Before we consolidate workflows we want to be triggered by merges to main. neondatabase/cloud#14862

vipvap requested review from a team as code owners June 27, 2024 06:01

vipvap requested review from hlinnaka, petuhovskiy, conradludgate and piercypixel and removed request for a team June 27, 2024 06:01

conradludgate approved these changes Jun 27, 2024

View reviewed changes

conradludgate force-pushed the rc/proxy/2024-06-27 branch from 32b75e7 to cbedb59 Compare June 27, 2024 08:45

Revert "proxy: update tokio-postgres to allow arbitrary config params (…

d81ef3f

…#8076)" This reverts commit 78d9059.

conradludgate force-pushed the rc/proxy/2024-06-27 branch from cbedb59 to d81ef3f Compare June 27, 2024 08:47

conradludgate merged commit cfb03d6 into release-proxy Jun 27, 2024
64 checks passed

conradludgate deleted the rc/proxy/2024-06-27 branch June 27, 2024 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy release 2024-06-27 #8178

Proxy release 2024-06-27 #8178

vipvap commented Jun 27, 2024

github-actions bot commented Jun 27, 2024 •

edited

Loading

conradludgate commented Jun 27, 2024 •

edited

Loading

Proxy release 2024-06-27 #8178

Proxy release 2024-06-27 #8178

Conversation

vipvap commented Jun 27, 2024