Release 2024-03-04 #6993

vipvap · 2024-03-04T06:17:43Z

Release 2024-03-04

Please merge this Pull Request using 'Create a merge commit' button

## Problem Sometimes folks prefer not to expose secrets as CLI args. ## Summary of changes - Add ability to load secrets from environment variables. We can eventually remove the AWS SM code path here if nobody is using it -- we don't need to maintain three ways to load secrets.

Add off-by-default support for lazy queued tenant activation on attach. This should be useful on bulk migrations as some tenants will be activated faster due to operations or endpoint startup. Eventually all tenants will get activated by reusing the same mechanism we have at startup (`PageserverConf::concurrent_tenant_warmup`). The difference to lazy attached tenants to startup ones is that we leave their initial logical size calculation be triggered by WalReceiver or consumption metrics. Fixes: #6315 Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

…LFC (#6935) ## Summary of changes Calculate number of unique page accesses at compute. It can be used to estimate working set size and adjust cache size (shared_buffers or local file cache). Approximation is made using HyperLogLog algorithm. It is performed by local file cache and so is available only when local file cache is enabled. This calculation doesn't take in account access to the pages present in shared buffers, but includes pages available in local file cache. This information can be retrieved using approximate_working_set_size(reset bool) function from neon extension. reset parameter can be used to reset statistic and so collect unique accesses for the particular interval. Below is an example of estimating working set size after pgbench -c 10 -S -T 100 -s 10: ``` postgres=# select approximate_working_set_size(false); approximate_working_set_size ------------------------------ 19052 (1 row) postgres=# select pg_table_size('pgbench_accounts')/8192; ?column? ---------- 16402 (1 row) ``` ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

The test token expired earlier today (1709200879). I regenerated the token, but without an expiration date this time.

ref #6969 Signed-off-by: Alex Chi Z <chi@neon.tech>

…questContextAdaptor` uses it (#6961) Extracted from #6953 Part of #5899

…rics (#6131) Because of bugs evictions could hang and pause disk usage eviction task. One such bug is known and fixed #6928. Guard each layer eviction with a modest timeout deeming timeouted evictions as failures, to be conservative. In addition, add logging and metrics recording on each eviction iteration: - log collection completed with duration and amount of layers - per tenant collection time is observed in a new histogram - per tenant layer count is observed in a new histogram - record metric for collected, selected and evicted layer counts - log if eviction takes more than 10s - log eviction completion with eviction duration Additionally remove dead code for which no dead code warnings appeared in earlier PR. Follow-up to: #6060.

…in callers (#6960) Extracted from #6953 Part of #5899 Core Change ----------- In #6953, we need the ability to scan the log _after_ a specific line and ignore anything before that line. This PR changes `log_contains` to returns a tuple of `(matching line, cursor)`. Hand that cursor to a subsequent `log_contains` call to search the log for the next occurrence of the pattern. Other Changes ------------- - Inspect all the callsites of `log_contains` to handle the new tuple return type. - Above inspection unveiled many callers aren't using `assert log_contains(...) is not None` but some weaker version of the code that breaks if `log_contains` ever returns a not-None but falsy value. Fix that. - Above changes unveiled that `test_remote_storage_upload_queue_retries` was using `wait_until` incorrectly; after fixing the usage, I had to raise the `wait_until` timeout. So, maybe this will fix its flakiness.

…6980) ## Problem PR #6935 introduced a new function in neon extension: approximate_working_set_size This test case verifies its working correctly. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>

## Problem shard_id in span is repeated: - #6723 Closes: #6723 ## Summary of changes - Only add shard_id to the span when fetching a cached timeline, as it is already added when loading an uncached timeline.

On eu-west-1 during benchmarks we sometimes lose samples. Add more time measurements.

## Problem PR #6851 implemented new output in PostgreSQL explain. this is a test case for the new function. ## Summary of changes ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [x] If it is a core feature, I have added thorough tests. - [no ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [no] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist

## Problem #6661 changed the layer flushing logic and led to OOMs in staging. The issue turned out to be holding on to in-memory layers for too long. After OOMing we'd need to replay potentially a lot of WAL. ## Summary of changes Test that open layers get flushed after the `checkpoint_timeout` config and do not require WAL reingest upon restart. The workload creates a number of timelines and writes some data to each, but not enough to trigger flushes via the `checkpoint_distance` config. I ran this test against #6661 and it was indeed failing.

Nightly has added a bunch of compiler and linter warnings. There is also two dependencies that fail compilation on latest nightly due to using the old `stdsimd` feature name. This PR fixes them.

## Problem PR #6837 fixed secondary locations to avoid spamming log warnings on temp files, but we also have ".temp_download" files to consider. ## Summary of changes - Give temp_download files the same behavior as temp files. - Refactor the relevant helper to pub(crate) from pub

## Problem At high ingest rates, pageservers spuriously disconnect from safekeepers because stats updates don't come in frequently enough to keep the broker/safekeeper LSN delta under the wal lag limit. ## Summary of changes - Increase DEFAULT_MAX_WALRECEIVER_LSN_WAL_LAG from 10MiB to 1GiB. This should be enough for realistic per-timeline throughputs.

The user created with the `--create-test-user` flag is `test` instead of `user`. ref #6848 Signed-off-by: Alex Chi Z <chi@neon.tech>

…y, configuration updates (#6521) During onboarding, the control plane may attempt ad-hoc creation of a secondary location to facilitate live migration. This gives us two problems to solve: - Accept 'Secondary' mode in /location_config and use it to put the tenant into secondary mode on some physical pageserver, then pass through /tenant/xyz/secondary/download requests - Create tenants with no generation initially, since the initial `Secondary` mode call will not provide us a generation. This PR also fixes modification of a tenant's TenantConf during /location_conf, which was previously ignored, and refines the flow for config modification: - avoid bumping generations when the only reason we're reconciling an attached location is a config change - increment TenantState.sequence when spawning a reconciler: usually schedule() does this, but when we do config changes that doesn't happen, so without this change waiters would think reconciliation was done immediately. `sequence` is a bit of a murky thing right now, as it's dual-purposed for tracking waiters, and for checking if an existing reconciliation is already making updates to our current sequence. I'll follow up at some point to clarify it's purpose. - test config modification at the end of onboarding test

github-actions · 2024-03-04T07:03:12Z

2484 tests run: 2361 passed, 0 failed, 123 skipped (full report)

Code coverage* (full report)

functions: 28.7% (6932 of 24161 functions)
lines: 47.2% (42560 of 90170 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
20d0939 at 2024-03-04T07:03:11.367Z :recycle:}

jcsp and others added 18 commits February 29, 2024 10:00

libs: fix expired token in auth decode test (#6963)

5984eda

The test token expired earlier today (1709200879). I regenerated the token, but without an expiration date this time.

test: disable test_superuser on pg15 (#6972)

76ab57f

ref #6969 Signed-off-by: Alex Chi Z <chi@neon.tech>

refactor(compaction): RequestContext shouldn't be Clone, only `Re…

502b69b

…questContextAdaptor` uses it (#6961) Extracted from #6953 Part of #5899

pageserver: fix duplicate shard_id in span (#6981)

f8bdce1

## Problem shard_id in span is repeated: - #6723 Closes: #6723 ## Summary of changes - Only add shard_id to the span when fetching a cached timeline, as it is already added when loading an uncached timeline.

metrics: record more details of the responding (#6979)

5ab10d0

On eu-west-1 during benchmarks we sometimes lose samples. Add more time measurements.

Fix warnings and compile errors on nightly (#6886)

82853cc

Nightly has added a bunch of compiler and linter warnings. There is also two dependencies that fail compilation on latest nightly due to using the old `stdsimd` feature name. This PR fixes them.

neon_local: improved docs and fix wrong connstr (#6954)

ea0d35f

The user created with the `--create-test-user` flag is `test` instead of `user`. ref #6848 Signed-off-by: Alex Chi Z <chi@neon.tech>

vipvap requested review from a team as code owners March 4, 2024 06:17

vipvap requested review from arssher, khanova and save-buffer and removed request for a team March 4, 2024 06:17

vipvap requested review from problame and mattpodraza and removed request for a team March 4, 2024 06:17

arssher approved these changes Mar 4, 2024

View reviewed changes

problame requested review from lubennikovaav and removed request for khanova, save-buffer and mattpodraza March 4, 2024 09:45

lubennikovaav approved these changes Mar 4, 2024

View reviewed changes

problame approved these changes Mar 4, 2024

View reviewed changes

problame merged commit bb7949b into release Mar 4, 2024
144 of 149 checks passed

problame deleted the rc/2024-03-04 branch March 4, 2024 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2024-03-04 #6993

Release 2024-03-04 #6993

vipvap commented Mar 4, 2024

github-actions bot commented Mar 4, 2024

Release 2024-03-04 #6993

Release 2024-03-04 #6993

Conversation

vipvap commented Mar 4, 2024