Release 2024-05-20 #7807

vipvap · 2024-05-20T06:04:17Z

Release 2024-05-20

Please merge this Pull Request using 'Create a merge commit' button

In timeline detach ancestor tests there is no way to really be sure that there were no subtle off-by one bugs. One such bug is demoed and reverted. Add verifying fullbackup is equal before and after detaching ancestor. Fullbackup is expected to be equal apart from `zenith.signal`, which is known to be good because endpoint can be started without the detached branch receiving writes.

The test has been flaky since 2024-04-11 for unknown reason, and the logging was off. Fix the logging and raise the limit a bit. The problematic ratio reproduces with pg14 and added sleep (not included) but not on pg15. The new ratio abs diff limit works for all inspected examples. Cc: #7536

Per [evidence] the timeline ancestor detach tests can panic while shutting down on vectored get validation. Allow the error because tenant is restarted twice in the test. [evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7708/9058185709/index.html#suites/a1c2be32556270764423c495fad75d47/d444f7e5c0a18ce9

…ethods (#7720) This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386

## Problem Since #6769, the pageserver is intentionally not usable without remote storage: it's purpose is to act as a cache to an object store, rather than as a source of truth in its own right. ## Summary of changes - Make remote storage configuration mandatory: the pageserver will refuse to start if it is not provided. This is a precursor that will make it safe to subsequently remove all the internal Option<>s

See #7718. Fix it by renaming all `types.py` to `common_types.py`. Additionally, add an advert for using `allowed_errors.py` to test any added regex.

## Problem Closes [test_lock_time_tracing](#7691) ## Summary of changes Taking a look at the execution of the same test in logs, it can be concluded that the time we are holding the lock is sometimes not enough(must be above 30s) to cause the second log to be shown by the thread that is creating a timeline. In the [successful execution](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7663/9021247520/index.html#testresult/a21bce8c702b37f0) it can be seen that the log `Operation TimelineCreate on key 5e088fc2dd14945020d0fa6d9efd1e36 has waited 30.000887709s for shared lock` was on the edge of being logged, if it was below 30s it would not be shown. ``` 2024-05-09T18:02:32.552093Z WARN request{method=PUT path=/control/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/policy request_id=af7e4a04-d181-4acb-952f-9597c8eba5a8}: Lock on UpdatePolicy was held for 31.001892592s 2024-05-09T18:02:32.552109Z INFO request{method=PUT path=/control/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/policy request_id=af7e4a04-d181-4acb-952f-9597c8eba5a8}: Request handled, status: 200 OK 2024-05-09T18:02:32.552271Z WARN request{method=POST path=/v1/tenant/5e088fc2dd14945020d0fa6d9efd1e36/timeline request_id=d3af756e-dbb3-476b-89bd-3594f19bbb67}: Operation TimelineCreate on key 5e088fc2dd14945020d0fa6d9efd1e36 has waited 30.000887709s for shared lock ``` In the [failed execution](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7663/9022743601/index.html#/testresult/deb90136aeae4fce): ``` 2024-05-09T20:14:33.526311Z INFO request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Creating timeline 68194ffadb61ca11adcbb11cbeb4ec6e/f72185990ed13f0b0533383f81d877af 2024-05-09T20:14:36.441165Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:41.441657Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:41.535227Z INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: Handling request 2024-05-09T20:14:41.535269Z INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: handle_validate: 68194ffadb61ca11adcbb11cbeb4ec6e(gen 1): valid=true (latest Some(00000001)) 2024-05-09T20:14:41.535284Z INFO request{method=POST path=/upcall/v1/validate request_id=94a7be88-474e-4163-92f8-57b401473add}: Request handled, status: 200 OK 2024-05-09T20:14:46.441854Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:51.441151Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:14:56.441199Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:15:01.440971Z INFO Heartbeat round complete for 1 nodes, 0 offline 2024-05-09T20:15:03.516320Z INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: failpoint "tenant-update-policy-exclusive-lock": sleep done 2024-05-09T20:15:03.518474Z INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Updated scheduling policy to Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000 2024-05-09T20:15:03.518512Z WARN request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Scheduling is disabled by policy Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000 2024-05-09T20:15:03.518540Z WARN request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Lock on UpdatePolicy was held for 31.003712703s 2024-05-09T20:15:03.518570Z INFO request{method=PUT path=/control/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/policy request_id=0edfdb5b-2b05-486b-9879-d83f234d2f0d}: Request handled, status: 200 OK 2024-05-09T20:15:03.518804Z WARN request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Scheduling is disabled by policy Stop tenant_id=68194ffadb61ca11adcbb11cbeb4ec6e shard_id=0000 2024-05-09T20:15:03.518815Z INFO request{method=POST path=/v1/tenant/68194ffadb61ca11adcbb11cbeb4ec6e/timeline request_id=1daa8c31-522d-4805-9114-68cdcffb9823}: Creating timeline on shard 68194ffadb61ca11adcbb11cbeb4ec6e/f72185990ed13f0b0533383f81d877af, attached to node 1 (localhost) ``` we can see that the difference between starting to create timeline `2024-05-09T20:14:33.526311Z` and creating timeline `2024-05-09T20:15:03.518815Z` is not above 30s and will not cause any logs to appear. The proposed solution is to prolong how long we will pause to ensure that the thread that creates the timeline waits above 30s.

## Problem "John pointed out that the switch to protocol version 2 made test_gc_aggressive test flaky: #7692. I tracked it down, and that is indeed an issue. Conditions for hitting the issue: The problem occurs in the primary GC horizon is set to a very low value, e.g. 0. If the primary is actively writing WAL, and GC runs in the pageserver at the same time that the primary sends a GetPage request, it's possible that the GC advances the GC horizon past the GetPage request's LSN. I'm working on a fix here: #7708." - Heikki ## Summary of changes Use protocol version 1 as default.

…ethods (#7725) This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386 Co-authored-by: Joonas Koivunen <joonas@neon.tech>

…7678) ## Problem `benchmarks` job that we run on the main doesn't block anything, so it's easy to miss its failure. Ref neondatabase/cloud#13087 ## Summary of changes - Add `report-benchmarks-failures` job that report failures of `benchmarks` job to a Slack channel

) The background task loop permit metrics do two of `with_label_values` very often. Change the codepath to cache the counters on first access into a `Lazy` with `enum_map::EnumMap`. The expectation is that this should not fix for metric collection failures under load, but it doesn't hurt. Cc: #7161

ref #7443 ## Summary of changes This pull request adds a size estimator for aux files. Each timeline stores a cached `isize` for the estimated total size of aux files. It gets reset on basebackup, and gets updated for each aux file modification. TODO: print a warning when it exceeds the size. The size metrics is not accurate. Race between `on_basebackup` and other functions could create a negative basebackup size, but the chance is rare. Anyways, this does not impose any extra I/Os to the storage as everything is computed in-memory. The aux files are only stored on shard 0. As basebackups are only generated on shard 0, only shard 0 will report this metrics. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem In #7531, I had a test flaky because the GC API endpoint fails if the tenant happens not to be active yet. ## Summary of changes While adding that wait for the tenant to be active, I noticed that this endpoint is kind of strange (spawns a TaskManager task) and has a comment `// TODO: spawning is redundant now, need to hold the gate`, so this PR cleans it up to just run the GC inline while holding a gate. The GC code is updated to avoid assuming it runs inside a task manager task. Avoiding checking the task_mgr cancellation token is safe, because our timeline shutdown always cancels Timeline::cancel.

) ## Problem When layer paths include generations, the lsn parsing does not work and `pagectl` errors out. ## Summary of changes If the last "word" of the layer path contains 8 characters, discard it for the purpose of lsn parsing.

) ## Problem Secondary downloads are a low priority task, and intentionally do not try to max out download speeds. This is almost always fine when they are used through the life of a tenant shard as a continuous "trickle" of background downloads. However, there are sometimes circumstances where we would like to populate a secondary location as fast as we can, within the constraint that we don't want to impact the activity of attached tenants: - During node removal, where we will need to create replacements for secondary locations on the node being removed - After a shard split, we need new secondary locations for the new shards to populate before the shards can be migrated to their final location. ## Summary of changes - Add an activity() function to the remote storage interface, enabling callers to query how busy the remote storage backend is - In the secondary download code, use a very modest amount of concurrency, driven by the remote storage's state: we only use concurrency if the remote storage semaphore is 75% free, and scale the amount of concurrency used within that range. This is not a super clever form of prioritization, but it should accomplish the key goals: - Enable secondary downloads to happen faster when the system is idle - Make secondary downloads a much lower priority than attached tenants when the remote storage is busy. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

…7741) Use the old compute node for compat tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

In general, tiered compaction is splitting delta layers along the key dimension, but this can only continue until a single key is reached: if the changes from a single key don't fit into one layer file, we used to create layer files of unbounded sizes. This patch implements the method listed as TODO/FIXME in the source code. It does the following things: * Make `accum_key_values` take the target size and if one key's modifications exceed it, make it fill `partition_lsns`, a vector of lsns to use for partitioning. * Have `retile_deltas` use that `partition_lsns` to create delta layers separated by lsn. * Adjust the `test_many_updates_for_single_key` to allow layer files below 0.5 the target size. This situation can create arbitarily small layer files: The amount of data is arbitrary that sits between having just cut a new delta, and then stumbling upon the key that needs to be split along lsn. This data will end up in a dedicated layer and it can be arbitrarily small. * Ignore single-key delta layers for depth calculation: in theory we might have only single-key delta layers in a tier, and this might confuse depth calculation as well, but this should be unlikely. Fixes #7243 Part of #7554 --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

We had a lot of code that passed around the two LSNs that are associated with each GetPage request. Introduce a new struct to encapsulate them. I'm about to add a third LSN to the struct in the next commit, this is a mechanical refactoring in preparation for that.

The new protocol version supports sending two LSNs to the pageserver: request LSN and a "not_modified_since" hint. A primary always wants to read the latest version of each page, so having two values was not strictly necessary, and the old protocol worked fine with just the "not_modified_since" LSN and a flag to request the latest page version. Nevertheless, it seemed like a good idea to set the request LSN to the current insert/flush LSN, because that's logically the page version that the primary wants to read. However, that made the test_gc_aggressive test case flaky. When the primary requests a page with the last inserted or flushed LSN, it's possible that by the time that the pageserver processes the request, more WAL has been generated by other processes in the compute and already digested by the pageserver. Furthermore, if the PITR horizon in the pageserver is set to 0, and GC runs during that window, it's possible that the GC horizon has advances past the request LSN, before the pageserver processes the request. It is still correct to send the latest page version in that case, because the compute either has the page locked so the it cannot have been modified in the primary, or if it's a prefetch request, and we will validate the LSNs when the prefetch response is processed and discard it if the page has been modified. But the pageserver doesn't know that and rightly complains. To fix, modify the compute so that the primary always uses Lsn::MAX in the requests. This reverts the primary's behavior to how the protocol version 1 worked. In protocol version 1, there was only one LSN, the "not_modified_since" hint, and a flag was set to read the latest page version, whatever that might be. Requests from computes that are still using protocol version 1 were already mapped to Lsn::MAX in the pageserver, now we do the same with protocol version 2 for primary's requests. (I'm a bit sad about losing the information in the pageserver, what the last LSN was at the time that the request wa made. We never had it with protocol version 1, but I wanted to make it available for debugging purposes.) Add another field, 'effective_request_lsn', to track what the flush LSN was when the request was made. It's not sent to the pageserver, Lsn::MAX is now used as the request LSN, but it's still needed internally in the compute to track the validity of prefetch requests. Fixes issue #7692

## Problem Currently we do a large number of heatmap uploads for tiny tenants. "tiny" in this context is defined as being less than a single layer in size. These uploads are triggered by atime changes rather than changes in the set of layers. Uploading heatmaps for atime changes on small tenants isn't useful, because even without bumping these atimes, disk usage eviction still avoids evicting the largest resident layer of a tenant, which in practice keeps tiny/empty tenants mostly resident irrespective of atimes. ## Summary of changes - For tenants smaller than one checkpoint interval, only upload heatmap if the set of layers has changed, not if only the atimes have changed. - Include the heatmap period in the uploaded heatmap, as a precursor to implementing #6200 (auto-adjusting download intervals to match upload intervals)

We recently added support for local layer paths that contain a generation number: - #7609 - #7640 Now that we've cut a [release](#7735) that includes those changes, we can proceed to enable writing the new format without breaking forward compatibility.

## Problem `report-benchmarks-failures` job is triggered for any failure in the CI pipeline, but we need it to be triggered only for failed `benchmarks` job ## Summary of changes - replace `failure()` with `needs.benchmarks.result == 'failure'` in the condition

A test for #7684. This pull request checks if the pageserver version we specified is the one actually running by comparing the git hash in forward compatibility tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem Shards with number >0 could hang waiting for `await_initial_logical_size`, as we don't calculate logical size on these shards. This causes them to hold onto semaphore units and starve other tenants out from proceeding with warmup activation. That doesn't hurt availability (we still have on-demand activation), but it does mean that some background tasks like consumption metrics would omit some tenants. ## Summary of changes - Skip waiting for logical size calculation on shards >0 - Upgrade unexpected code paths to use debug_assert!(), which acts as an implicit regression test for this issue, and make the info() one into a warn()

The [2.31.0 release](https://github.com/rui314/mold/releases/tag/v2.31.0) of mold includes a 10% speed improvement for binaries with a lot of debug info. As we have such, it might be useful to update mold to the latest release. The jump is from 2.4.0 to 2.31.0, but it's not been many releases in between as the version number was raised by the mold maintainers to 2.30.0 after 2.4.1 [to avoid confusion for some tools](https://github.com/rui314/mold/releases/tag/v2.30.0).

…#7747) This PR allows setting the `PAGESERVER_DEFAULT_TENANT_CONFIG_COMPACTION_ALGORITHM` env var to override the `tenant_config.compaction_algorithm` field in the initial `pageserver.toml` for all tests. I tested manually that this works by halting a test using pdb and inspecting the `effective_config` in the tenant status managment API. If the env var is set, the tests are parametrized by the `kind` tag field, allowing to do a matrix build in CI and let Allure summarize everything in a nice report. If the env var is not set, the tests are not parametrized. So, merging this PR doesn't cause problems for flaky test detection. In fact, it doesn't cause any runtime change if the env var is not set. There are some tests in the test suite that set used to override the entire tenant_config using `NeonEnvBuilder.pageserver_config_override`. Since config overrides are merged non-recursively, such overrides that don't specify `kind = ` cause a fallback to pageserver's built-in `DEFAULT_COMPACTION_ALGORITHM`. Such cases can be found using ``` ["']tenant_config\s*[='"] ``` We'll deal with these tests in a future PR. closes #7555

…#7740) ## Problem There are not enough arm runners and jobs in `neon-extra-builds` workflow take about the same amount of time on a small-arm runner as on large-arm. ## Summary of changes - Switch `neon-extra-builds` workflow from `large-arm64` to `small-arm64` runners

…file refs (#7752) ## Problem This is historical baggage from when the pageserver could be run with local disk only: we had a bunch of places where we had to treat remote storage as optional. Closes: #6890 ## Changes - Remove Option<> around remote storage (in #7722 we made remote storage clearly mandatory) - Remove code for deleting old metadata files: they're all gone now. - Remove other references to metadata files when loading directories, as none exist. I checked last 14 days of logs for "found legacy metadata", there are no instances.

They have merged our PR nical/rust_debug#4 but they haven't released a new crate version yet. refs #7763

@problame

pointed out by @problame : we use the literal 8192 instead of a properly defined constant. replace the literal by a PAGE_SZ constant.

As of #6202 we support `AWS_PROFILE` as well, which is more convenient. Change the docs to using it instead of `SSO_ACCOUNT_ID`. Also, remove `SSO_ACCOUNT_ID` from BucketConfig as it is confusing to the code's reader: it's not the "main" way of setting up authentication for the scrubber any more. It is a breaking change for the on-disk format as we persist `sso_account_id` to disk, but it was quite inconsistent with the other methods which are not persistet. Also, I don't think we want to support the case where one version writes the json and another version reads it. Related: #7667

Tiered compaction employs two sliding windows over the keyspace: `KeyspaceWindow` for the image layer generation and `Window` for the delta layer generation. Do some fixes to both windows: * The distinction between the two windows is not very clear. Do the absolute minimum to mention where they are used in the rustdoc description of the struct. Maybe we should rename them (say `WindowForImage` and `WindowForDelta`) or merge them into one window implementation. * Require the keys to strictly increase. The `accum_key_values` already combines the key, so there is no logic needed in `Window::feed` for the same key repeating. This is a follow-up to address the request in #7671 (review) * In `choose_next_delta`, we claimed in the comment to use 1.25 as the factor but it was 1.66 instead. Fix this discrepancy by using `*5/4` as the two operations.

…propriate (#7771) Before this PR, the changed tests would overwrite the entire `tenant_config` because `pageserver_config_override` is merged non-recursively into the `ps_cfg`. This meant they would override the `PAGESERVER_DEFAULT_TENANT_CONFIG_COMPACTION_ALGORITHM`, impacting our matrix build for `compaction_algorithm=Tiered|Legacy` in #7748. I found the tests fixed in this PR using the `NEON_PAGESERVER_PANIC_ON_UNSPECIFIED_COMPACTION_ALGORITHM` env var that I added in #7748. Therefore, I think this is an exhaustive fix. This is better than just searching the code base for `tenant_config`, which is what I had sketched in #7747. refs #7749

by having 100 copy operations in flight twe climb up to 2500 requests per min or 41/s. This is still probably less than is allowed, but fast enough for our purposes.

## Problem - When a layer with legacy local path format is evicted and then re-downloaded, a panic happened because the path downloaded by remote storage didn't match the path stored in Layer. - While investigating, I also realized that secondary locations would have a similar issue with evictions. Closes: #7783 ## Summary of changes - Make remote timeline client take local paths as an input: it should not have its own ideas about local paths, instead it just uses the layer path that the Layer has. - Make secondary state store an explicit local path, populated on scan of local disk at startup. This provides the same behavior as for Layer, that our local_layer_path is a _default_, but the layer path can actually be anything (e.g. an old style one). - Add tests for both cases.

## Problem The storage controller generally assumes that things like updating generation numbers are atomic: it should use a strict isolation level. ## Summary of changes - Wrap all database operations in a SERIALIZABLE transaction. - Retry serialization failures, as these do not indicate problems and are normal when plenty of concurrent work is happening. Using this isolation level for all reads is overkill, but much simpler than reasoning about it on a per-operation basis, and does not hurt performance. Tested this with a modified version of storage_controller_many_tenants test with 128k shards, to check that our performance is still fine: it is.

## Problem Currently tenants are only split into multiple shards if a human being calls the API to do it. Issue: #7388 ## Summary of changes - Add a pageserver API for returning the top tenants by size - Add a step to the controller's background loop where if there is no reconciliation or optimization to be done, it looks for things to split. - Add a test that runs pgbench on many tenants concurrently, and checks that splitting happens as expected as tenants grow, without interrupting the client I/O. This PR is quite basic: there is a tasklist in #7388 for further work. This PR is meant to be safe (off by default), and sufficient to enable our staging environment to run lots of sharded tenants without a human having to set them up.

Part of #7462 ## Summary of changes Tenant config is not persisted unless it's attached on the storage controller. In this pull request, we persist the aux file policy flag in the `index_part.json`. Admins can set `switch_aux_file_policy` in the storage controller or using the page server API. Upon the first aux file gets written, the write path will compare the aux file policy target with the current policy. If it is switch-able, we will do the switch. Otherwise, the original policy will be used. The test cases show what the admins can do / cannot do. The `last_aux_file_policy` is stored in `IndexPart`. Updates to the persisted policy are done via `schedule_index_upload_for_aux_file_policy_update`. On the write path, the writer will update the field. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>

part of #7462 ## Summary of changes This pull request adds two APIs to the pageserver management API: list_aux_files and ingest_aux_files. The aux file pagebench is intended to be used on an empty timeline because the data do not go through the safekeeper. LSNs are advanced by 8 for each ingestion, to avoid invariant checks inside the pageserver. For now, I only care about space amplification / read amplification, so the bench is designed in a very simple way: ingest 10000 files, and I will manually dump the layer map to analyze. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

The comment says that this checks if there's enough space on the page for logical message *and* an XLOG_SWITCH. So the sizes of the logical message and the XLOG_SWITCH record should be added together, not subtracted. I saw a panic in the test that led me to investigate and notice this (https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7803/9142396223/index.html): RuntimeError: Run ['/tmp/neon/bin/wal_craft', 'in-existing', 'last_wal_record_xlog_switch_ends_on_page_boundary', "host=localhost port=16165 user=cloud_admin dbname=postgres options='-cstatement_timeout=120s '"] failed: stdout: stderr: thread 'main' panicked at libs/postgres_ffi/wal_craft/src/lib.rs:370:27: attempt to subtract with overflow stack backtrace: 0: rust_begin_unwind at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:645:5 1: core::panicking::panic_fmt at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:72:14 2: core::panicking::panic at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:145:5 3: <wal_craft::LastWalRecordXlogSwitchEndsOnPageBoundary as wal_craft::Crafter>::craft::<postgres::client::Client> at libs/postgres_ffi/wal_craft/src/lib.rs:370:27 4: wal_craft::main::{closure#0} at libs/postgres_ffi/wal_craft/src/bin/wal_craft.rs:21:17 5: wal_craft::main at libs/postgres_ffi/wal_craft/src/bin/wal_craft.rs:66:47 6: <fn() -> core::result::Result<(), anyhow::Error> as core::ops::function::FnOnce<()>>::call_once at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

## Summary of changes Updates the parquet lib. one change left that we need is in an open PR against upstream, hopefully we can remove the git dependency by 52.0.0 apache/arrow-rs#5773 I'm not sure why the parquet files got a little bit bigger. I tested them and they still open fine. 🤷 side effect of the update, chrono updated and added yet another deprecation warning (hence why the safekeepers change)

github-actions · 2024-05-20T06:52:16Z

3078 tests run: 2951 passed, 0 failed, 127 skipped (full report)

Code coverage* (full report)

functions: 31.3% (6383 of 20401 functions)
lines: 47.7% (48647 of 101902 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
a5ecca9 at 2024-05-20T06:52:15.515Z :recycle:}

danieltprice · 2024-05-23T22:27:07Z

Reviewed for changelog.

koivunej and others added 30 commits May 13, 2024 10:58

chore(pageserver): plumb through RequestContext to VirtualFile read m…

b58a615

…ethods (#7720) This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386

python: allow using allowed_errors.py (#7719)

d9dcbff

See #7718. Fix it by renaming all `types.py` to `common_types.py`. Additionally, add an advert for using `allowed_errors.py` to test any added regex.

chore(pageserver): plumb through RequestContext to VirtualFile open m…

6ff7429

…ethods (#7725) This PR introduces no functional changes. The `open()` path will be done separately. refs #6107 refs #7386 Co-authored-by: Joonas Koivunen <joonas@neon.tech>

fix(test): ensure compatibility test uses the correct compute node (#…

9ffb852

…7741) Use the old compute node for compat tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

chore(test): add version check for forward compat test (#7685)

30d15ad

A test for #7684. This pull request checks if the pageserver version we specified is the one actually running by comparing the git hash in forward compatibility tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

Add restart_lsn metric for logical slots.

1a2a3cb

chore(deps): use upstream svg_fmt after they merged our PR (#7764)

bc78b0e

They have merged our PR nical/rust_debug#4 but they haven't released a new crate version yet. refs #7763

arpad-m and others added 12 commits May 16, 2024 16:48

tiered compaction: introduce PAGE_SZ constant and use it (#7785)

ec069dc

pointed out by @problame : we use the literal 8192 instead of a properly defined constant. replace the literal by a PAGE_SZ constant.

chore: update defaults for timeline_detach_ancestor (#7779)

c1390bf

by having 100 copy operations in flight twe climb up to 2500 requests per min or 41/s. This is still probably less than is allowed, but fast enough for our purposes.

vipvap requested review from a team as code owners May 20, 2024 06:04

vipvap requested review from arssher, conradludgate, hlinnaka and nikitakalyanov and removed request for a team May 20, 2024 06:04

petuhovskiy approved these changes May 20, 2024

View reviewed changes

petuhovskiy merged commit fc538a3 into release May 20, 2024
119 of 123 checks passed

petuhovskiy deleted the rc/2024-05-20 branch May 20, 2024 11:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2024-05-20 #7807

Release 2024-05-20 #7807

vipvap commented May 20, 2024

github-actions bot commented May 20, 2024

danieltprice commented May 23, 2024

Release 2024-05-20 #7807

Release 2024-05-20 #7807

Conversation

vipvap commented May 20, 2024