Release 2024-01-22 #6420

vipvap · 2024-01-22T06:01:27Z

Release 2024-01-22

Please merge this PR using 'Create a merge commit'!

After PR #6325, when running without --runtime, we wouldn't wait for start_work_barrier, causing the benchmark to not start at all.

## Problem neondatabase/serverless#51 (comment) ## Summary of changes 1. When we have a db_error, use db_error.message() as the message. 2. include error position. 3. line should be a string (weird?) 4. `datatype` -> `dataType`

We've removed this input from the deploy-prod workflow.

Fixes #6343. ## Problem PAGE_CACHE_ACQUIRE_PINNED_SLOT_TIME is used on hot path and it adds noticeable latency to GetPage@LSN. ## Refs https://discordapp.com/channels/1176467419317940276/1195022264115151001/1196370689268125716

## Problem Using the same domain name () for serverless driver can help with connection caching. #6290 ## Summary of changes Relax SNI check.

The theme of the changes in this PR is that they're enablers for #6251 which are superficial struct/api changes. This is a spinoff from #6251: - Various APIs + clients thereof take TenantShardId rather than TenantId - The creation API gets a ShardParameters member, which may be used to configure shard count and stripe size. This enables the attachment service to present a "virtual pageserver" creation endpoint that creates multiple shards. - The attachment service will use tenant size information to drive shard splitting. Make a version of `TenantHistorySize` that is usable for decoding these API responses. - ComputeSpec includes a shard stripe size.

## Problem In #5980 the page service connection handler gets a simple piece of logic for finding the right Timeline: at connection time, it picks an arbitrary Timeline, and then when handling individual page requests it checks if the original timeline is the correct shard, and if not looks one up. This is pretty slow in the case where we have to go look up the other timeline, because we take the big tenants manager lock. ## Summary of changes - Add a `shard_timelines` map of ShardIndex to Timeline on the page service connection handler - When looking up a Timeline for a particular ShardIndex, consult `shard_timelines` to avoid hitting the TenantsManager unless we really need to. - Re-work the CancellationToken handling, because the handler now holds gateguards on multiple timelines, and so must respect cancellation of _any_ timeline it has in its cache, not just the timeline related to the request it is currently servicing. --------- Co-authored-by: Vlad Lazar <vlad@neon.tech>

Follows #6123 Closes: #5342 The approach here is to avoid using `Layer` from secondary tenants, and instead make the eviction types (e.g. `EvictionCandidate`) have a variant that carries a Layer for attached tenants, and a different variant for secondary tenants. Other changes: - EvictionCandidate no longer carries a `Timeline`: this was only used for providing a witness reference to remote timeline client. - The types for returning eviction candidates are all in disk_usage_eviction_task.rs now, whereas some of them were in timeline.rs before. - The EvictionCandidate type replaces LocalLayerInfoForDiskUsageEviction type, which was basically the same thing.

This implements the `copy` operation for azure blobs, added to S3 by #6091, and adds tests both to s3 and azure ensuring that the copy operation works.

The remote_storage crate contains two copies of each test, one for azure and one for S3. The repetition is not necessary and makes the tests more prone to drift, so we remove it by moving the tests into a shared module. The module has a different name depending on where it is included, so that each test still has "s3" or "azure" in its full path, allowing you to just test the S3 test or just the azure tests. Earlier PR that removed some duplication already: #6176 Fixes #6146.

Previously, if we: 1. created a new timeline B from a different timeline's A initdb 2. deleted timeline A the initdb for timeline B would be gone, at least in a world where we are deleting initdbs upon timeline deletion. This world is imminent (#6226). Therefore, if the pageserver is instructed to load the initdb from a different timeline ID, copy it to the newly created timeline's directory in S3. This ensures that we can disaster recover the new timeline as well, regardless of whether the original timeline was deleted or not. Part of #5282.

## Problem The `/v1/tenant` listing API only applies to attached tenants. For an external service to implement a global reconciliation of its list of shards vs. what's on the pageserver, we need a full view of what's in TenantManager, including secondary tenant locations, and InProgress locations. Dependency of #6251 ## Summary of changes - Add methods to Tenant and SecondaryTenant to reconstruct the LocationConf used to create them. - Add `GET /v1/location_config` API

## Problem To test sharding, we need something to control it. We could write python code for doing this from the test runner, but this wouldn't be usable with neon_local run directly, and when we want to write tests with large number of shards/tenants, Rust is a better fit efficiently handling all the required state. This service enables automated tests to easily get a system with sharding/HA without the test itself having to set this all up by hand: existing tests can be run against sharded tenants just by setting a shard count when creating the tenant. ## Summary of changes Attachment service was previously a map of TenantId->TenantState, where the principal state stored for each tenant was the generation and the last attached pageserver. This enabled it to serve the re-attach and validate requests that the pageserver requires. In this PR, the scope of the service is extended substantially to do overall management of tenants in the pageserver, including tenant/timeline creation, live migration, evacuation of offline pageservers etc. This is done using synchronous code to make declarative changes to the tenant's intended state (`TenantState.policy` and `TenantState.intent`), which are then translated into calls into the pageserver by the `Reconciler`. Top level summary of modules within `control_plane/attachment_service/src`: - `tenant_state`: structure that represents one tenant shard. - `service`: implements the main high level such as tenant/timeline creation, marking a node offline, etc. - `scheduler`: for operations that need to pick a pageserver for a tenant, construct a scheduler and call into it. - `compute_hook`: receive notifications when a tenant shard is attached somewhere new. Once we have locations for all the shards in a tenant, emit an update to postgres configuration via the neon_local `LocalEnv`. - `http`: HTTP stubs. These mostly map to methods on `Service`, but are separated for readability and so that it'll be easier to adapt if/when we switch to another RPC layer. - `node`: structure that describes a pageserver node. The most important attribute of a node is its availability: marking a node offline causes tenant shards to reschedule away from it. This PR is a precursor to implementing the full sharding service for prod (#6342). What's the difference between this and a production-ready controller for pageservers? - JSON file persistence to be replaced with a database - Limited observability. - No concurrency limits. Marking a pageserver offline will try and migrate every tenant to a new pageserver concurrently, even if there are thousands. - Very simple scheduler that only knows to pick the pageserver with fewest tenants, and place secondary locations on a different pageserver than attached locations: it does not try to place shards for the same tenant on different pageservers. This matters little in tests, because picking the least-used pageserver usually results in round-robin placement. - Scheduler state is rebuilt exhaustively for each operation that requires a scheduler. - Relies on neon_local mechanisms for updating postgres: in production this would be something that flows through the real control plane. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

## Problem Currently relation hash size is limited by "neon.relsize_hash_size" GUC with default value 64k. 64k relations is not so small number... but it is enough to create 376 databases to exhaust it. ## Summary of changes Use LRU replacement algorithm to prevent hash overflow ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

Notes: https://github.com/hyperium/h2/releases/tag/v0.3.24 Related: https://rustsec.org/advisories/RUSTSEC-2024-0003

I just failed to see this earlier on #6136. layer counts are used as an abstraction, and each of the two tenants lose proportionally about the same amount of layers. sadly there is no difference in between `relative_spare` and `relative_equal` as both of these end up evicting the exact same amount of layers, but I'll try to add later another test for those. Cc: #5304

## Problem tenant_id/timeline_id is no longer a full identifier for metrics from a `Tenant` or `Timeline` object. Closes: #5953 ## Summary of changes Include `shard_id` label everywhere we have `tenant_id`/`timeline_id` label.

- Start pgbouncer in VM from postgres user, to allow connection to pgbouncer admin console. - Remove unused compute_ctl options --pgbouncer-connstr and --pgbouncer-ini-path. - Fix and cleanup code of connection to pgbouncer, add retries because pgbouncer may not be instantly ready when compute_ctl starts.

## Problem Use [NEON_SMGR] for all log messages produced by neon extension. ## Summary of changes ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

…bsolute (#6384) With testing the new eviction order there is a problem of all of the (currently rare) disk usage based evictions being rare and unique; this PR adds a human readable summary of what absolute order would had done and what the relative order does. Assumption is that these loggings will make the few evictions runs in staging more useful. Cc: #5304 for allowing testing in the staging

Before this patch, pagebench get-page-latest-lsn would sometimes cause noisy errors in pageserver log about `CopyFail` protocol message. refs #6390

In 7f82889 we changed the logic for persisting control_files. Previously it was updated if `peer_horizon_lsn` jumped more than one segment, which made `peer_horizon_lsn` initialized on disk as soon as safekeeper has received a first `AppendRequest`. This caused an issue with `truncateLsn`, which now can be zero sometimes. This PR fixes it, and now `truncateLsn/peer_horizon_lsn` can never be zero once we know `timeline_start_lsn`. Closes #6248

## Problem For PRs with `run-benchmarks` label, we don't upload results to the db, making it harder to debug such tests. The only way to see some numbers is by examining GitHub Action output which is really inconvenient. This PR adds zenbenchmark metrics to Allure reports. ## Summary of changes - Create a json file with zenbenchmark results and attach it to allure report

In the most straightforward way; safekeeper performs it in DELETE endpoint implementation, with no coordination between sks. delete_force endpoint in the code is renamed to delete as there is only one way to delete.

…#6388) From #6037 on, until this patch, if the client opens the connection but doesn't send a `PagestreamFeMessage` within the first 10ms, we'd close the connection because `self.timeline_cancelled()` returns. It returns because `self.shard_timelines` is still empty at that point: it gets filled lazily within the handlers for the incoming messages. Changes ------- The question is: if we can't check for timeline cancellation, what else do we need to be cancellable for? `tenant.cancel` is also a bad choice because the `tenant` (shard) we pick at the top of handle_pagerequests might indeed go away over the course of the connection lifetime, but other shards may still be there. The correct solution, I think, is to be responsive to task_mgr cancellation, because the connection handler runs in a task_mgr task and it is already the current canonical way how we shut down a tenant's / timelin's page_service connections (see `Tenant::shutdown` / `Timeline::shutdown`). So, rename the function and make it sensitive to task_mgr cancellation.

## Problem Some fields were missed in the initial spec. ## Summary of changes Adds a success boolean (defaults to false unless specifically marked as successful). Adds a duration_us integer that tracks how many microseconds were taken from session start through to request completion.

## Problem Name in notifications is not compatible with console name. ## Summary of changes Rename fields to make it compatible.

configures nextest to kill tests after 1 minute. slow period is set to 20s which is how long our tests currently take in total, there will be 2 warnings and then the test will be killed and it's output logged. Cc: #6361 Cc: #6368 -- likely this will be enough for longer time, but it will be counter productive when we want to attach and debug; the added line would have to be commented out.

update_next_xid() doesn't have any special treatment for the invalid or other special XIDs, so it will treat InvalidTransactionId (0) as a regular XID. If old nextXid is smaller than 2^31, 0 will look like a very old XID, and nothing happens. But if nextXid is greater than 2^31 0 will look like a very new XID, and update_next_xid() will incorrectly bump up nextXID.

## Problem Currently we store in cache even if the project is undefined. That makes invalidation impossible. ## Summary of changes Do not store if project id is empty.

## Problem In #6283 I did a couple changes that weren't directly related to the goal of extracting the state machine, so I'm putting them here ## Summary of changes - move postgres vs console provider into another enum - reduce error cases for link auth - slightly refactor link flow

## Problem See https://neondb.slack.com/archives/C06F5UJH601/p1705731304237889 Adding 1 to xid in `update_next_xid` can cause overflow in debug mode. 0xffffffff is valid transaction ID. ## Summary of changes Use `wrapping_add` ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

github-actions · 2024-01-22T06:38:18Z

2268 tests run: 2178 passed, 0 failed, 90 skipped (full report)

Flaky tests (2)

Postgres 16

test_statvfs_pressure_usage: debug

Postgres 15

test_secondary_mode_eviction: release

Code coverage (full report)

functions: 54.6% (10598 of 19424 functions)
lines: 81.5% (60659 of 74431 lines)

_{The comment gets automatically updated with the latest test results
90e689a at 2024-01-22T17:04:08.146Z :recycle:}

## Problem https://rustsec.org/advisories/RUSTSEC-2024-0006 ## Summary of changes `cargo update -p shlex` (cherry picked from commit 5559b16)

jcsp · 2024-01-22T10:34:20Z

The test failure is #6385, retrying

## Problem Gc currently doesn't work properly. ## Summary of changes Change statement on running gc.

Before this patch, the select! still retured immediately if `futs` was empty. Must have tested a stale build in my manual testing of #6388. (cherry picked from commit 15c0df4)

## Problem When a tenant is in Attaching state, and waiting for the `concurrent_tenant_warmup` semaphore, it also listens for the tenant cancellation token. When that token fires, Tenant::attach drops out. Meanwhile, Tenant::set_stopping waits forever for the tenant to exit Attaching state. Fixes: #6423 ## Summary of changes - In the absence of a valid state for the tenant, it is set to Broken in this path. A more elegant solution will require more refactoring, beyond this minimal fix. (cherry picked from commit 93572a3)

danieltprice · 2024-01-25T14:41:48Z

Reviewed for Friday changelog. The only item for the changelog appears to be the serverless driver error messaging improvement. Please let me know if I missed something that should be mentioned.

problame and others added 30 commits January 15, 2024 08:54

fix(pagebench): #6325 broke running without --runtime (#6351)

0e1ef37

After PR #6325, when running without --runtime, we wouldn't wait for start_work_barrier, causing the benchmark to not start at all.

proxy: fix serverless error message info (#6279)

0bac8dd

## Problem neondatabase/serverless#51 (comment) ## Summary of changes 1. When we have a db_error, use db_error.message() as the message. 2. include error position. 3. line should be a string (weird?) 4. `datatype` -> `dataType`

do not provide disclaimer input for the deploy-prod workflow (#6360)

d34adf4

We've removed this input from the deploy-prod workflow.

Proxy relax sni check (#6323)

3f2187e

## Problem Using the same domain name () for serverless driver can help with connection caching. #6290 ## Summary of changes Relax SNI check.

Add copy operation tests and implement them for azure blobs (#6362)

4b0204e

This implements the `copy` operation for azure blobs, added to S3 by #6091, and adds tests both to s3 and azure ensuring that the copy operation works.

build: update h2 (#6383)

e247ddb

Notes: https://github.com/hyperium/h2/releases/tag/v0.3.24 Related: https://rustsec.org/advisories/RUSTSEC-2024-0003

pageserver: add shard_id to metric labels (#6308)

bd19290

## Problem tenant_id/timeline_id is no longer a full identifier for metrics from a `Tenant` or `Timeline` object. Closes: #5953 ## Summary of changes Include `shard_id` label everywhere we have `tenant_id`/`timeline_id` label.

pagebench: use tracing panic hook (#6393)

00936d1

pagebench: avoid noise about CopyFail in PS logs (#6392)

e8f7733

Before this patch, pagebench get-page-latest-lsn would sometimes cause noisy errors in pageserver log about `CopyFail` protocol message. refs #6390

Delete WAL segments from s3 when timeline is deleted.

88df057

In the most straightforward way; safekeeper performs it in DELETE endpoint implementation, with no coordination between sks. delete_force endpoint in the code is renamed to delete as there is only one way to delete.

Remove rename in parameters (#6411)

f003dd6

## Problem Name in notifications is not compatible with console name. ## Summary of changes Rename fields to make it compatible.

khanova and others added 3 commits January 20, 2024 16:14

Proxy: do not store empty key (#6415)

9ace36d

## Problem Currently we store in cache even if the project is undefined. That makes invalidation impossible. ## Summary of changes Do not store if project id is empty.

vipvap requested review from a team as code owners January 22, 2024 06:01

vipvap requested review from arssher, khanova, conradludgate, jcsp and mtyazici and removed request for a team January 22, 2024 06:01

khanova approved these changes Jan 22, 2024

View reviewed changes

arssher approved these changes Jan 22, 2024

View reviewed changes

bump shlex (#6421)

7234208

## Problem https://rustsec.org/advisories/RUSTSEC-2024-0006 ## Summary of changes `cargo update -p shlex` (cherry picked from commit 5559b16)

jcsp approved these changes Jan 22, 2024

View reviewed changes

khanova and others added 3 commits January 22, 2024 14:39

Proxy: fix gc (#6426)

299d947

## Problem Gc currently doesn't work properly. ## Summary of changes Change statement on running gc.

fixup(#6037): actually fix the issue, #6388 failed to do so (#6429)

f0b2d4b

Before this patch, the select! still retured immediately if `futs` was empty. Must have tested a stale build in my manual testing of #6388. (cherry picked from commit 15c0df4)

jcsp merged commit a1a74ee into release Jan 22, 2024
46 checks passed

jcsp deleted the releases/2024-01-22 branch January 22, 2024 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2024-01-22 #6420

Release 2024-01-22 #6420

vipvap commented Jan 22, 2024

github-actions bot commented Jan 22, 2024 •

edited

Loading

Postgres 16

Postgres 15

jcsp commented Jan 22, 2024

danieltprice commented Jan 25, 2024

Release 2024-01-22 #6420

Release 2024-01-22 #6420

Conversation

vipvap commented Jan 22, 2024