Release 2024-02-05 #6617

vipvap · 2024-02-05T06:01:18Z

Release 2024-02-05

Please merge this PR using 'Create a merge commit'!

## Problem Measuring cardinality using logs is expensive and slow. ## Summary of changes Implement a pre-aggregated HyperLogLog-based cardinality estimate. HyperLogLog estimates the cardinality of a set by using the probability that the uniform hash of a value will have a run of n 0s at the end is `1/2^n`, therefore, having observed a run of `n` 0s suggests we have measured `2^n` distinct values. By using multiple shards, we can use the harmonic mean to get a more accurate estimate. We record this into a Prometheus time-series. HyperLogLog counts can be merged by taking the `max` of each shard. We can apply a `max_over_time` in order to find the estimate of cardinality of distinct values over time

This is the "partial revert" of #6384. The summaries turned out to be expensive due to naive vec usage, but also inconclusive because of the additional context required. In addition to removing summary traces, small refactoring is done.

## Problem There's no efficient way of querying the layer map for a range. ## Summary of changes Introduce a range query for the layer map (`LayerMap::range_search`). There's two broad steps to it: 1. Find all coverage changes for layers that intersect the queried range (see `LayerCoverage::range_overlaps`). The slightly tricky part is dealing with the start of the range. We can either be aligned with a layer or not and we need to treat these cases differently. 2. Iterate over the coverage changes and collect the result. For this we use a two pointer approach: the trailing pointer tracks the start of the current range (current location in the key space) and the forward pointer tracks the next coverage change. Plugging the range search into the read path is deferred to a future PR. ## Performance I adapted the layer map benchmarks on a local branch. Range searches are between 2x and 2.5x slower than point searches. That's in line with what I expected since we query thelayer map twice. Since `Timeline::get` will proxy to `Timeline::get_vectored` we can special case the one element layer map range search at that point.

## Problem Creating sharded tenants will require an instance of the sharding service -- the initial goal is to deploy one of these in a staging region (neondatabase/cloud#9718). It will run as a kubernetes container, similar to the storage broker, so needs to be built into the container image. ## Summary of changes Add `attachment_service` binary to container image

## Problem `rdkit` extension is built with `RDK_BUILD_FREETYPE_SUPPORT=ON` (by default), which requires a bunch of additional dependencies, but the support of freetype fonts isn't required for Postgres. With `RDK_BUILD_FREETYPE_SUPPORT=ON`: ``` ldd /usr/local/pgsql/lib/rdkit.so linux-vdso.so.1 (0x0000ffff82ea8000) libfreetype.so.6 => /usr/lib/aarch64-linux-gnu/libfreetype.so.6 (0x0000ffff825e5000) libboost_serialization.so.1.74.0 => /usr/lib/aarch64-linux-gnu/libboost_serialization.so.1.74.0 (0x0000ffff82590000) libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffff8255f000) libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffff82387000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff822dc000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff822b8000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff82144000) libpng16.so.16 => /usr/lib/aarch64-linux-gnu/libpng16.so.16 (0x0000ffff820fd000) libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1 (0x0000ffff820d3000) libbrotlidec.so.1 => /usr/lib/aarch64-linux-gnu/libbrotlidec.so.1 (0x0000ffff820b8000) /lib/ld-linux-aarch64.so.1 (0x0000ffff82e78000) libbrotlicommon.so.1 => /usr/lib/aarch64-linux-gnu/libbrotlicommon.so.1 (0x0000ffff82087000) ``` With `RDK_BUILD_FREETYPE_SUPPORT=OFF`: ``` ldd /usr/local/pgsql/lib/rdkit.so linux-vdso.so.1 (0x0000ffffbba75000) libboost_serialization.so.1.74.0 => /usr/lib/aarch64-linux-gnu/libboost_serialization.so.1.74.0 (0x0000ffffbb259000) libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffffbb228000) libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffffbb050000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffffbafa5000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffffbaf81000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffbae0d000) /lib/ld-linux-aarch64.so.1 (0x0000ffffbba45000) ``` ## Summary of changes - Build `rdkit` with `RDK_BUILD_FREETYPE_SUPPORT=OFF` - Remove extra dependencies from the Compute image

## Problem The tenants we want to recover might have tens of thousands of keys, or more. At that point, the AWS API returns a paginated response. ## Summary of changes Support paginated responses for `list_object_versions` requests. Follow-up of #6155, part of neondatabase/cloud#8233

## Problem Taking my ideas from #6283 and doing a bit less radical changes. smaller commits. Proxy flow was quite deeply nested, which makes adding more interesting error handling quite tricky. ## Summary of changes I recommend reviewing commit by commit. 1. move handshake logic into a separate file 2. move passthrough logic into a separate file 3. no longer accept a closure in CancelMap session logic 4. Remove connect_to_db, copy logic into handle_client 5. flatten auth_and_wake_compute in authenticate 6. record info for link auth

Closes #6397

Update pgvector extension from 0.5.1 to 0.6.0

When using spawn + wait_with_output instead of std::process::Command::output or tokio::process::Command::output we must configure the redirection. Fixes: #6523 by discarding the stdout completely, we only care about stderr if any.

…ng basebackup (#6400) Before this patch, when requesting basebackup for a not-found tenant or timeline, we'd emit an ERROR-level log entry with a huge stack trace. See #6366 "Details" section for an example With this patch, we log at INFO level and only a single line. Example: ``` 2024-01-19T14:16:11.479800Z INFO page_service_conn_main{peer_addr=127.0.0.1:43448}: query handler for 'basebackup d69a536d529a68fcf85bc070030cdf4b 035484e9c28d8d0138a492caadd03ffd 0/2204340 --gzip' entity not found: Tenant d69a536d529a68fcf85bc070030cdf4b not found 2024-01-19T14:19:35.807819Z INFO page_service_conn_main{peer_addr=127.0.0.1:48862}: query handler for 'basebackup d69a536d529a68fcf85bc070030cdf4a 035484e9c28d8d0138a492caadd03ffd 0/2204340 --gzip' entity not found: Timeline d69a536d529a68fcf85bc070030cdf4a/035484e9c28d8d0138a492caadd03ffd was not found ``` fixes #6366 Changes ------- - Change `handle_basebackup_request` to return a `QueryError` - The new `impl From<WaitLsnError> for QueryError` is needed so the `?` at `wait_lsn()` call in `handle_basebackup_request` works again. It's duplicating `impl From<WaitLsnError> for PageStreamError`. - Remove hard-to-spot conversion of `handle_basebackup_request` return value to anyhow::Result (the place where I replaced `anyhow::Ok` with `Result::<(), QueryError>::Ok(())` - Add forgotten distinguished handling for "Tenant not found" case in `impl From<GetActiveTenantError> for QueryError` This was not at all pleasant, and I find it very hard to follow the various error conversions. It took me a while to spot the hard-to-spot `anyhow::Ok` thing above. It would have been caught by the compiler if we weren't auto-converting `anyhow::Error` into `QueryError::Other`. We should move away from that, in my opinion, instead forcing each `.context()` site to become `.context().map_err(QueryError::Other)`. But that's for a future PR.

It hanged if file size is less than of a normal segment. Normally that doesn't happen, but it might in case of crash during segment init. We're going to fix that half initialized segment by durably renaming it after cooking, so this fix won't be needed, but better avoid busy loop anyway. fixes #6401

Since fdatasync is used for flushing WAL, changing file size is unsafe. Make segment creation atomic by using tmp file + rename to avoid using partially initialized segments. fixes #6402

## Problem PR #6500 has removed the limiting by number of versions/deletions for time travel calls. We never get informed about how many versions there are, and thus the call would just hang without any indication of progress. ## Summary of changes We improve the pageserver's behaviour with large prefixes, i.e. those with many keys, removed or currently still available. * Add a hard limit of 100k versions/deletions. For the reasoning see neondatabase/cloud#8233 (comment) , but TLDR it will roughly support tenants of 2 TiB size, of course depending on general write activity and duration of the s3 retention window. The goal is to have a limit at all so that the process doesn't accumulate increasing numbers of versions until an eventual crash. * Lower the RAM footprint for the `VerOrDelete` datastructure. This means we now don't cache a lot of redundant metadata in RAM like the owner ID. The top level datastructure's footprint goes down from 264 bytes to 80 (but it contains strings that are not counted in there). Follow-up of #6500, part of neondatabase/cloud#8233 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>

@hlinnaka

## Problem `pgvector` requires a patch to work well with Neon (a patch created by @hlinnaka) ## Summary of changes - Apply the patch to `pgvector`

## Problem We currently can't create subscriptions in PG14 and PG15 because only superusers can, and PG16 requires adding roles to pg_create_subscription. ## Summary of changes I added changes to PG14 and PG15 that allow neon_superuser to bypass the superuser requirement. For PG16, I didn't do that but added a migration that adds neon_superuser to pg_create_subscription. Also added a test to make sure it works.

…ned by PS (tenant not found) (#6522) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1706531433057289 ## Summary of changes 1. Do not decrease reconnect timeout until maximal interval value (1 second) is reached 2. Compute reconnect time after connection attempt is taken to exclude connect time itself from the interval measurement. So now backend should not perform more than 4 reconnect attempts per second. But please notice that backoff is performed locally in each backend and so if there are many active backends, then connection (and so error) rate may be much higher. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

Depends on: #6468 ## Problem The sharding service will be used as a "virtual pageserver" by the control plane -- so it needs the set of pageserver APIs that the control plane uses, and to present them under identical URLs, including prefix (/v1). ## Summary of changes - Add missing APIs: - Tenant deletion - Timeline deletion - Node list (used in test now, later in tools) - `/location_config` API (for migrating tenants into the sharding service) - Rework attachment service URLs: - `/v1` prefix is used for pageserver-compatible APIs - `/upcall/v1` prefix is used for APIs that are called by the pageserver (re-attach and validate) - `/debug/v1` prefix is used for endpoints that are for testing - `/control/v1` prefix is used for new sharding service APIs that do not mimic a pageserver API, such as registering and configuring nodes. - Add test_sharding_service. The sharding service already had some collateral coverage from its use in general tests, but this is the first dedicated testing for it.

## Summary of changes Experiment with jemalloc in proxy

This removes the last remnants of the version param added by #5608 , concluding the transition plan laid out in neondatabase/cloud#7553 (comment) . It follows PR neondatabase/cloud#9202, which we now assume has been deployed to all environments. Full history: * #5608 * neondatabase/cloud#7553 * #6178 * neondatabase/cloud#9202

) Some tests which are unit test alike do not need to run on different pg versions. Logging test is one of them which I found for unrelated reasons. Co-authored-by: Alexander Bayandin <alexander@neon.tech>

## Problem The `--path` argument is only used in testing, for compat tests that use a JSON snapshot of state rather than the postgres database. In regular deployments, it should be omitted (currently one has to specify `--path ""`) ## Summary of changes Make `--path` optional.

## Problem See neondatabase/cloud#8673 ## Summary of changes Download missed SLRU segments from page server ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Fixes some duplication due to extra or misconfigured `#[instrument]`, while filling in the `timeline_id` to delete timeline flow calls.

changes: - two messages instead of message every second when gate was closing - replace the gate name string by using a pointer - slow GateGuards are likely to log who they were (see example) example found in regress tests: <#6542 (comment)>

## Problem Right now if get_role_secret response wasn't cached (e.g. cache already reached max size) it will send the second (exactly the same request). ## Summary of changes Avoid needless request.

to pull in fixes for neondatabase/tokio-epoll-uring#37

This refactoring makes it easier to experimentally replace BACKGROUND_RUNTIME with a single-threaded runtime. Found this useful [during benchmarking](#6555).

Before this patch, pagebench was always showing the same value. refs #6509

* log when `lsn_by_timestamp` finished together with its result * add back logging of the layer name as suggested in #6549 (comment)

Author: Alexander Bayandin <alexander@neon.tech>

This includes a compatibility patch that is needed because pgvector now skips WAL-logging during the index build, and WAL-logs the index only in one go at the end. That's how GIN, GiST and SP-GIST index builds work in core PostgreSQL too, but we need some Neon-specific calls to mark the beginning and end of those build phases. pgvector is the first index AM that does that with parallel workers, so I had to modify those functions in the Neon extension to be aware of parallel workers. Only the leader needs to create the underlying file and perform the WAL-logging. (In principle, the parallel workers could participate in the WAL-logging too, but pgvector doesn't do that. This will need some further work if that changes). The previous attempt at this (#6592) missed that parallel workers needed those changes, and segfaulted in parallel build that spilled to disk. Testing ------- We don't have a place for regression tests of extensions at the moment. I tested this manually with the following script: ``` CREATE EXTENSION IF NOT EXISTS vector; DROP TABLE IF EXISTS tst; CREATE TABLE tst (i serial, v vector(3)); INSERT INTO tst (v) SELECT ARRAY[random(), random(), random()] FROM generate_series(1, 15000) g; -- Serial build, in memory ALTER TABLE tst SET (parallel_workers=0); SET maintenance_work_mem='50 MB'; CREATE INDEX idx ON tst USING hnsw (v vector_l2_ops); -- Test that the index works. (The table contents are random, and the -- search is approximate anyway, so we cannot check the exact values. -- For now, just eyeball that they look reasonable) set enable_seqscan=off; explain SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5; SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5; DROP INDEX idx; -- Serial build, spills to on disk ALTER TABLE tst SET (parallel_workers=0); SET maintenance_work_mem='5 MB'; CREATE INDEX idx ON tst USING hnsw (v vector_l2_ops); SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5; DROP INDEX idx; -- Parallel build, in memory ALTER TABLE tst SET (parallel_workers=4); SET maintenance_work_mem='50 MB'; CREATE INDEX idx ON tst USING hnsw (v vector_l2_ops); SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5; DROP INDEX idx; -- Parallel build, spills to disk ALTER TABLE tst SET (parallel_workers=4); SET maintenance_work_mem='5 MB'; CREATE INDEX idx ON tst USING hnsw (v vector_l2_ops); SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5; DROP INDEX idx; ```

Replace TODO with an existing implementation via `BufMut::writer``.

## Problem Found typos while reading the docs ## Summary of changes Fixed the typos found

The issue is still unsolved because of shmem size in VMs. Need to figure it out before applying this patch. For more details: ``` ERROR: could not resize shared memory segment "/PostgreSQL.2892504480" to 16774205952 bytes: No space left on device ``` As an example, the same issue in community pgvector/pgvector#453.

github-actions · 2024-02-05T06:46:07Z

2388 tests run: 2275 passed, 0 failed, 113 skipped (full report)

Flaky tests (1)

Postgres 15

test_ondemand_download_large_rel: debug

Code coverage (full report)

functions: 54.4% (11286 of 20740 functions)
lines: 81.5% (63538 of 77975 lines)

_{The comment gets automatically updated with the latest test results
d0cb4b8 at 2024-02-05T10:48:37.749Z :recycle:}

There is currently no cleanup done after a delta layer creation error, so delta layers can accumulate. The problem gets worse as the operation gets retried and delta layers accumulate on the disk. Therefore, delete them from disk (if something has been written to disk).

danieltprice · 2024-02-07T00:13:37Z

Reviewed 02-09-2024 changelog

conradludgate and others added 30 commits January 29, 2024 07:26

Patch safekeeper control file on HTTP request (#6455)

2ff1a5c

Closes #6397

Compute: pgvector 0.6.0 (#6517)

8e4da52

Update pgvector extension from 0.5.1 to 0.6.0

build(deps): bump aiohttp from 3.9.0 to 3.9.2 (#6518)

c70bf91

fix: capture initdb stderr, discard others (#6524)

e3cb715

When using spawn + wait_with_output instead of std::process::Command::output or tokio::process::Command::output we must configure the redirection. Fixes: #6523 by discarding the stdout completely, we only care about stderr if any.

Make WAL segment init atomic.

bc684e9

Since fdatasync is used for flushing WAL, changing file size is unsafe. Make segment creation atomic by using tmp file + rename to avoid using partially initialized segments. fixes #6402

Compute: add compatibility patch for pgvector (#6527)

3c3ee8f

## Problem `pgvector` requires a patch to work well with Neon (a patch created by @hlinnaka) ## Summary of changes - Apply the patch to `pgvector`

proxy: use jemalloc (#6531)

c7b02ce

## Summary of changes Experiment with jemalloc in proxy

tests: support for running on single pg version, use in one place (#6525

799db16

) Some tests which are unit test alike do not need to run on different pg versions. Logging test is one of them which I found for unrelated reasons. Co-authored-by: Alexander Bayandin <alexander@neon.tech>

logging: fix span usage (#6549)

66719d7

Fixes some duplication due to extra or misconfigured `#[instrument]`, while filling in the `timeline_id` to delete timeline flow calls.

Proxy: reduce number of get role secret calls (#6557)

271133d

## Problem Right now if get_role_secret response wasn't cached (e.g. cache already reached max size) it will send the second (exactly the same request). ## Summary of changes Avoid needless request.

update tokio-epoll-uring (#6558)

0ac1e71

to pull in fixes for neondatabase/tokio-epoll-uring#37

refactor(pageserver main): signal handling (#6554)

e82625b

This refactoring makes it easier to experimentally replace BACKGROUND_RUNTIME with a single-threaded runtime. Found this useful [during benchmarking](#6555).

pagebench: fix percentiles reporting (#6547)

4c17345

Before this patch, pagebench was always showing the same value. refs #6509

arpad-m and others added 6 commits February 3, 2024 02:16

Minor logging improvements (#6593)

aac8eb2

* log when `lsn_by_timestamp` finished together with its result * add back logging of the layer name as suggested in #6549 (comment)

Reorganize .dockerignore

c96aead

Author: Alexander Bayandin <alexander@neon.tech>

refactor(proxy): std::io::Write for BytesMut exists (#6606)

9dd6919

Replace TODO with an existing implementation via `BufMut::writer``.

chore: update wording in docs to improve readability (#6607)

09519c1

## Problem Found typos while reading the docs ## Summary of changes Fixed the typos found

vipvap requested review from a team as code owners February 5, 2024 06:01

vipvap requested review from save-buffer, petuhovskiy, conradludgate, koivunej and mtyazici and removed request for a team February 5, 2024 06:01

khanova approved these changes Feb 5, 2024

View reviewed changes

arssher approved these changes Feb 5, 2024

View reviewed changes

conradludgate approved these changes Feb 5, 2024

View reviewed changes

vadim2404 approved these changes Feb 5, 2024

View reviewed changes

VladLazar merged commit b923805 into release Feb 5, 2024
46 checks passed

VladLazar deleted the releases/2024-02-05 branch February 5, 2024 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2024-02-05 #6617

Release 2024-02-05 #6617

vipvap commented Feb 5, 2024

github-actions bot commented Feb 5, 2024 •

edited

Loading

Postgres 15

danieltprice commented Feb 7, 2024

Release 2024-02-05 #6617

Release 2024-02-05 #6617

Conversation

vipvap commented Feb 5, 2024