Release 2024-03-26 #7248

vipvap · 2024-03-26T14:24:54Z

Release 2024-03-26

Please merge this Pull Request using 'Create a merge commit' button

## Problem We recently introduced log file validation for the storage controller. The heartbeater will WARN when it fails for a node, hence the test fails. Closes #7159 ## Summary of changes * Warn only once for each set of heartbeat retries * Allow list heartbeat warns

- Remove code for using AWS secrets manager, as we're deploying with k8s->env vars instead - Load each secret independently, so that one can mix CLI args with environment variables, rather than requiring that all secrets are loaded with the same mechanism. - Add a 'strict mode', enabled by default, which will refuse to start if secrets are not loaded. This avoids the risk of accidentially disabling auth by omitting the public key, for example

## Problem Large quantities of ephemeral layer data can lead to excessive memory consumption (#6939). We currently don't have a way to know how much ephemeral layer data is present on a pageserver. Before we can add new behaviors to proactively roll layers in response to too much ephemeral data, we must calculate that total. Related: #6916 ## Summary of changes - Create GlobalResources and GlobalResourceUnits types, where timelines carry a GlobalResourceUnits in their TimelineWriterState. - Periodically update the size in GlobalResourceUnits: - During tick() - During layer roll - During put() if the latest value has drifted more than 10MB since our last update - Expose the value of the global ephemeral layer bytes counter as a prometheus metric. - Extend the lifetime of TimelineWriterState: - Instead of dropping it in TimelineWriter::drop, let it remain. - Drop TimelineWriterState in roll_layer: this drops our guard on the global byte count to reflect the fact that we're freezing the layer. - Ensure the validity of the later in the writer state by clearing the state in the same place we freeze layers, and asserting on the write-ability of the layer in `writer()` - Add a 'context' parameter to `get_open_layer_action` so that it can skip the prev_lsn==lsn check when called in tick() -- this is needed because now tick is called with a populated state, where prev_lsn==Some(lsn) is true for an idle timeline. - Extend layer rolling test to use this metric

Postgres can always write some more WAL, so previous checks that WAL doesn't change after something had been crafted were wrong; remove them. Add comments here and there. should fix #4691

This test had two flaky failure modes: - pageserver log error for timeline not found: this resulted from changes for DR when timeline destroy/create was added, but endpoint was left running during that operation. - storage controller log error because the test was running for long enough that a background reconcile happened at almost the exact moment of test teardown, and our test fixtures tear down the pageservers before the controller. Closes: #7224

Signed-off-by: availhang <mayangang@outlook.com>

…7234) preliminary refactoring for #7233 part of #7062

…res (#7223) ## Problem While most forms of split rollback don't interrupt clients, there are a couple of cases that do -- this interruption is brief, driven by the time it takes the controller to kick off Reconcilers during the async abort of the split, so it's operationally fine, but can trip up a test. - #7148 ## Summary of changes - Relax test check to require that the tenant is eventually available after split failure, rather than immediately. In the vast majority of cases this will pass on the first iteration.

## Problem - #6966 This test occasionally failed with some layers unexpectedly not present on the secondary pageserver. The issue in that failure is the attached pageserver uploading heatmaps that refer to not-yet-uploaded layers. ## Summary of changes After uploading heatmap, drain upload queue on attached pageserver, to guarantee that all the layers referenced in the haetmap are uploaded.

## Problem https://github.com/neondatabase/cloud/issues/11599 ## Summary of changes Reuse the same sess_id for requests within the one session. TODO: get rid of `session_id` in query params.

github-actions · 2024-03-26T15:10:30Z

2718 tests run: 2582 passed, 0 failed, 136 skipped (full report)

Flaky tests (1)

Postgres 15

test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

functions: 28.2% (6295 of 22343 functions)
lines: 47.0% (44208 of 94138 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
ad072de at 2024-03-26T15:10:29.479Z :recycle:}

VladLazar and others added 11 commits March 25, 2024 09:38

Try to fix test_crafted_wal_end flakiness.

a6c1fdc

Postgres can always write some more WAL, so previous checks that WAL doesn't change after something had been crafted were wrong; remove them. Add comments here and there. should fix #4691

chore: remove repetitive words (#7206)

d837ce0

Signed-off-by: availhang <mayangang@outlook.com>

refactor(remote_timeline_client): infallible stop() and shutdown() (#…

f72415e

…7234) preliminary refactoring for #7233 part of #7062

proxy: reuse sess_id as request_id for the cplane requests (#7245)

6c18109

## Problem https://github.com/neondatabase/cloud/issues/11599 ## Summary of changes Reuse the same sess_id for requests within the one session. TODO: get rid of `session_id` in query params.

Revert "pageserver: use a single tokio runtime (#6555)" (#7246)

ad072de

vipvap requested review from a team as code owners March 26, 2024 14:24

vipvap requested review from khanova, problame, shayanh, conradludgate and petuhovskiy and removed request for a team March 26, 2024 14:24

jcsp approved these changes Mar 26, 2024

View reviewed changes

jcsp merged commit 4e5724d into release Mar 26, 2024
98 of 101 checks passed

jcsp deleted the rc/2024-03-26 branch March 26, 2024 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2024-03-26 #7248

Release 2024-03-26 #7248

vipvap commented Mar 26, 2024

github-actions bot commented Mar 26, 2024

Postgres 15

Release 2024-03-26 #7248

Release 2024-03-26 #7248

Conversation

vipvap commented Mar 26, 2024