Release 2023-09-12 #5283

github-actions · 2023-09-12T07:04:53Z

No description provided.

## Problem We want to convert the `VirtualFile` APIs to async fn so that we can adopt one of the async I/O solutions. ## Summary of changes Convert the following APIs of `VirtualFile` to async fn (as well as all of the APIs calling it): * `VirtualFile::seek` * `VirtualFile::metadata` * Also, prepare for deletion of the write impl by writing the summary to a buffer before writing it to disk, as suggested in #4743 (comment) . This change adds an additional warning for the case when the summary exceeds a block. Previously, we'd have silently corrupted data in this (unlikely) case. * `WriteBlobWriter::write_blob`, in preparation for making `VirtualFile::write_all` async.

## Problem When a remote custom extension build fails, it looks a bit confusing on neon CI: - `trigger-custom-extensions-build` is green - `wait-for-extensions-build` is red - `build-and-upload-extensions` is red But to restart the build (to get everything green), you need to restart the only passed `trigger-custom-extensions-build`. ## Summary of changes - Merge `trigger-custom-extensions-build` and `wait-for-extensions-build` jobs into `trigger-custom-extensions-build-and-wait`

Fix issue where updating the size of the Local File Cache could lead to invalid reads: ## Problem LFC cache can get re-enabled when lfc_max_size is set, e.g. through an autoscaler configuration, or PostgreSQL not liking us setting the variable. 1. initialize: LFC enabled (lfc_size_limit > 0; lfc_desc = 0) 2. Open LFC file fails, lfc_desc = -1. lfc_size_limit is set to 0; 3. lfc_size_limit is updated by autoscaling to >0 4. read() now thinks LFC is enabled (size_limit > 0) and lfc_desc is valid, but doesn't try to read from the invalid file handle and thus doesn't update the buffer content with the page's data, but does think the data was read... Any buffer we try to read from local file cache is essentially uninitialized memory. Those are likely 0-bytes, but might potentially be any old buffer that was previously read from or flushed to disk. Fix this by adding a more definitive disable flag, plus better invalid state handling.

## Problem We want to display coverage information for each PR. - an example of a full coverage report: https://neon-github-public-dev.s3.amazonaws.com/code-coverage/abea64800fb390c32a3efe6795d53d8621115c83/lcov/index.html - an example of GitHub auto-comment with coverage information: #4999 (comment) ## Summary of changes - Use patched[*](neondatabase/lcov@426e7e7) lcov to generate coverage report - Upload HTML coverage report to S3 - `scripts/comment-test-report.js`: add coverage information

## Problem `approved-for-ci-run` label logic doesn't work as expected: - #4722 (comment) - #4722 (comment) Continuation of #4663 Closes #2222 (hopefully) ## Summary of changes - Create a twin PR automatically - Allow `GITHUB_TOKEN` to manipulate with labels

## Problem Continuation of #4663, #5210 We're still getting an error: ``` GraphQL: Resource not accessible by integration (removeLabelsFromLabelable) ``` ## Summary of changes - trigger `approved-for-ci-run.yml` workflow on `pull_request_target` instead of `pull_request`

…5188) Fixes #3830 by adding the `#[cfg(not(feature = "testing"))]` attribute to unnecessary loggings in `pageserver/src/tenant/tasks.rs`. Co-authored-by: Joonas Koivunen <joonas@neon.tech>

## Problem It's hard to find out which DB size we use for OLAP benchmarks (TPC-H in particular). This PR adds handling of `TEST_OLAP_SCALE` env var, which is get added to a test name as a parameter. This is required for performing larger periodic benchmarks. ## Summary of changes - Handle `TEST_OLAP_SCALE` in `test_runner/performance/test_perf_olap.py` - Set `TEST_OLAP_SCALE` in `.github/workflows/benchmarking.yml` to a TPC-H scale

## Problem - Scrubber's `tidy` command requires presence of a control plane - Scrubber has no tests at all ## Summary of changes - Add re-usable async streams for reading metadata from a bucket - Add a `scan-metadata` command that reads from those streams and calls existing `checks.rs` code to validate metadata, then returns a summary struct for the bucket. Command returns nonzero status if errors are found. - Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that tests using remote storage can request to have the scrubber run after they finish - Enable remote storarge and scrub_on_exit in test_pageserver_restart and test_pageserver_chaos This is a "toe in the water" of the overall space of validating the scrubber. Later, we should: - Enable scrubbing at end of tests using remote storage by default - Make the success condition stricter than "no errors": tests should declare what tenants+timelines they expect to see in the bucket (or sniff these from the functions tests use to create them) and we should require that the scrubber reports on these particular tenants/timelines. The `tidy` command is untouched in this PR, but it should be refactored later to use similar async streaming interface instead of the current batch-reading approach (the streams are faster with large buckets), and to also be covered by some tests. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Conrad Ludgate <conrad@neon.tech>

## Problem Pull Requests created by GitHub Actions bot doesn't have access to secrets, so we need to use our bot for it to auto-trigger a tests run See previous PRs #4663, #5210, #5212 ## Summary of changes - Use our bot to create PRs

) ## Problem - #5050 Closes: #5136 ## Summary of changes - A new configuration property `control_plane_api` controls other functionality in this PR: if it is unset (default) then everything still works as it does today. - If `control_plane_api` is set, then on startup we call out to control plane `/re-attach` endpoint to discover our attachments and their generations. If an attachment is missing from the response we implicitly detach the tenant. - Calls to pageserver `/attach` API may include a `generation` parameter. If `control_plane_api` is set, then this parameter is mandatory. - RemoteTimelineClient's loading of index_part.json is generation-aware, and will try to load the index_part with the most recent generation <= its own generation. - The `neon_local` testing environment now includes a new binary `attachment_service` which implements the endpoints that the pageserver requires to operate. This is on by default if running `cargo neon` by hand. In `test_runner/` tests, it is off by default: existing tests continue to run with in the legacy generation-less mode. Caveats: - The re-attachment during startup assumes that we are only re-attaching tenants that have previously been attached, and not totally new tenants -- this relies on the control plane's attachment logic to keep retrying so that we should eventually see the attach API call. That's important because the `/re-attach` API doesn't tell us which timelines we should attach -- we still use local disk state for that. Ref: #5173 - Testing: generations are only enabled for one integration test right now (test_pageserver_restart), as a smoke test that all the machinery basically works. Writing fuller tests that stress tenant migration will come later, and involve extending our test fixtures to deal with multiple pageservers. - I'm not in love with "attachment_service" as a name for the neon_local component, but it's not very important because we can easily rename these test bits whenever we want. - Limited observability when in re-attach on startup: when I add generation validation for deletions in a later PR, I want to wrap up the control plane API calls in some small client class that will expose metrics for things like errors calling the control plane API, which will act as a strong red signal that something is not right. Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>

## Problem `CI_ACCESS_TOKEN` has quite limited access (which is good), but this doesn't allow it to remove labels from PRs (which is bad) ## Summary of changes - Use `GITHUB_TOKEN` to remove labels - Use `CI_ACCESS_TOKEN` to create PRs

… async (#5203) ## Problem We want to convert the `VirtualFile` APIs to async fn so that we can adopt one of the async I/O solutions. ## Summary of changes This PR is a follow-up of #5189, #5190, and #5195, and does the following: * Move the used `Write` trait functions of `VirtualFile` into inherent functions * Add optional buffering to `WriteBlobWriter`. The buffer is discarded on drop, similarly to how tokio's `BufWriter` does it: drop is neither async nor does it support errors. * Remove the generics by `Write` impl of `WriteBlobWriter`, alwaays using `VirtualFile` * Rename `WriteBlobWriter` to `BlobWriter` * Make various functions in the write path async, like `VirtualFile::{write,write_all}`. Part of #4743.

## Problem When the next release is coming, we want to let everyone know about it by posting a message to the Slack channel with a list of commits. ## Summary of changes - `.github/workflows/release-notify.yml` is added - the workflow sends a message to `vars.SLACK_UPCOMING_RELEASE_CHANNEL_ID` (or [#test-release-notifications](https://neondb.slack.com/archives/C05QQ9J1BRC) if not configured) - On each PR update, the workflow updates the list of commits in the message (it doesn't send additional messages)

## Problem - `SCALE: unbound variable` from #5079 - The layout of the GitHub auto-comment is broken if the code coverage section follows flaky test section from #4999 ## Summary of changes - `benchmarking.yml`: Rename `SCALE` to `TEST_OLAP_SCALE` - `comment-test-report.js`: Add an extra new-line before Code coverage section

Fixes #3894 by: - Refactor the pageserver router creation flow - Create the router state in `pageserver/src/bin/pageserver.rs`

On a clean system `lsof` needs to be installed. Add it to the list just to keep things nice and copy-pasteable (except for poetry).

With #5181, the generics for `FileBlockReader` have been removed, so having a `Virtual` postfix makes less sense now.

It was easy to interpret comment in the page cache initialization code to be about justifying why we leak here at all, not just why this specific type of leaking is done (which the comment was actually meant to describe). See #5125 (comment) --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>

… endpoint (#5196) Add a `walreceiver_state` field to `TimelineInfo` (response of `GET /v1/tenant/:tenant_id/timeline/:timeline_id`) and while doing that, refactor out a common `Timeline::walreceiver_state(..)`. No OpenAPI changes, because this is an internal debugging addition. Fixes #3115. Co-authored-by: Joonas Koivunen <joonas.koivunen@gmail.com>

## Problem We've got `approved-for-ci-run` to work 🎉 But it's still a bit rough, this PR should improve the UX for external contributors. ## Summary of changes - `build_and_test.yml`: add `check-permissions` job, which fails if PR is created from a fork. Make all jobs in the workflow to be dependant on `check-permission` to fail fast - `approved-for-ci-run.yml`: add `cleanup` job to close `ci-run/pr-*` PRs and delete linked branches when the parent PR is closed - `approved-for-ci-run.yml`: fix the layout for the `ci-run/pr-*` PR description - GitHub Autocomment: add a comment with tests result to the original PR (instead of a PR from `ci-run/pr-*` )

## Problem We likely need this to support Postgres 16 It's also been asked by a user #5042 The latest version is 3.2.0, but it requires some changes in the build script (which I haven't checked, but it didn't work right away) ## Summary of changes ``` 3.1.8 2023-08-01 - force v8 to compile in release mode 3.1.7 2023-06-26 - fix byteoffset issue with arraybuffers - support postgres 16 beta 3.1.6 2023-04-08 - fix crash issue on fetch apply - fix interrupt issue ``` From https://github.com/plv8/plv8/blob/v3.1.8/Changes

## Problem Fix pg_config version parsing ## Summary of changes Use regex to capture major version of postgres #5146

It's a good idea to keep up-to-date in general. One noteworthy change is that PostGIS 3.3.3 adds support for PostgreSQL v16. We'll need that. PostGIS 3.4.0 has already been released, and we should consider upgrading to that. However, it's a major upgrade and requires running "SELECT postgis_extensions_upgrade();" in each database, to upgrade the catalogs. I don't want to deal with that right now.

This includes v16 support.

…} async fn (#5224) ## Problem Once we use async file system APIs for `VirtualFile`, these functions will also need to be async fn. ## Summary of changes Makes the functions `open, open_with_options, create,sync_all,with_file` of `VirtualFile` async fn, including all functions that call it. Like in the prior PRs, the actual I/O operations are not using async APIs yet, as per request in the #4743 epic. We switch towards not using `VirtualFile` in the par_fsync module, hopefully this is only temporary until we can actually do fully async I/O in `VirtualFile`. This might cause us to exhaust fd limits in the tests, but it should only be an issue for the local developer as we have high ulimits in prod. This PR is a follow-up of #5189, #5190, #5195, and #5203. Part of #4743.

It makes the logs too verbose. ref https://neondb.slack.com/archives/C03F5SM1N02/p1694281232874719?thread_ts=1694272777.207109&cid=C03F5SM1N02

Some VMs, when already scaled up as much as possible, end up spamming the autoscaler-agent with upscale requests that will never be fulfilled. If postgres is using memory greater than the cgroup's memory.high, it can emit new memory.high events 1000 times per second, which... just means unnecessary load on the rest of the system. This changes the vm-monitor so that we skip sending upscale requests if we already sent one within the last second, to avoid spamming the autoscaler-agent. This matches previous behavior that the vm-informant hand.

## Problem Add a CI pipeline that checks GitHub Workflows with https://github.com/rhysd/actionlint (it uses `shellcheck` for shell scripts in steps) To run it locally: `SHELLCHECK_OPTS=--exclude=SC2046,SC2086 actionlint` ## Summary of changes - Add `.github/workflows/actionlint.yml` - Fix actionlint warnings

## Problem When PR `ci-run/pr-*` is created the GitHub Autocomment with test results are supposed to be posted to the original PR, currently, this doesn't work. I created this PR from a personal fork to debug and fix the issue. ## Summary of changes - `scripts/comment-test-report.js`: use `pull_request.head` instead of `pull_request.base` 🤦

@vipvap

## Problem Another thing I overlooked regarding'approved-for-ci-run`: - When we create a PR, the action is associated with @vipvap and this triggers the pipeline — this is good. - When we update the PR by force-pushing to the branch, the action is associated with @github-actions, which doesn't trigger a pipeline — this is bad. Initially spotted in #5239 / #5211 ([link](https://github.com/neondatabase/neon/actions/runs/6122249456/job/16633919558?pr=5239)) — `check-permissions` should not fail. ## Summary of changes - Use `CI_ACCESS_TOKEN` to check out the repo (I expect this token will be reused in the following `git push`)

## Problem Detaching a tenant can involve many thousands of local filesystem metadata writes, but the control plane would benefit from us not blocking detach/delete responses on these. ## Summary of changes After rename of local tenant directory ack tenant detach and delete tenant directory in background #5183 --------- Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>

Assorted flakyness fixes from #5198, might not be flaky on `main`. Migrate some tests using neon_simple_env to just neon_env_builder and using initial_tenant to make flakyness understanding easier. (Did not understand the flakyness of `test_timeline_create_break_after_uninit_mark`.) `test_download_remote_layers_api` is flaky because we have no atomic "wait for WAL, checkpoint, wait for upload and do not receive any more WAL". `test_tenant_size` fixes are just boilerplate which should had always existed; we should wait for the tenant to be active. similarly for `test_timeline_delete`. `test_timeline_size_post_checkpoint` fails often for me with reading zero from metrics. Give it a few attempts.

…ive operations (#5273) Introduce the `StorageIoOperation` enum, `StorageIoTime` struct, and `STORAGE_IO_TIME_METRIC` static which provides lockless access to histograms consumed by `VirtualFile`. Closes #5131 Co-authored-by: Joonas Koivunen <joonas@neon.tech>

## Problem For #4743, we want to convert everything up to the actual I/O operations of `VirtualFile` to `async fn`. ## Summary of changes This PR is the last change in a series of changes to `VirtualFile`: #5189, #5190, #5195, #5203, and #5224. It does the last preparations before the I/O operations are actually made async. We are doing the following things: * First, we change the locks for the file descriptor cache to tokio's locks that support Send. This is important when one wants to hold locks across await points (which we want to do), otherwise the Future won't be Send. Also, one shouldn't generally block in async code as executors don't like that. * Due to the lock change, we now take an approach for the `VirtualFile` destructors similar to the one proposed by #5122 for the page cache, to use `try_write`. Similarly to the situation in the linked PR, one can make an argument that if we are in the destructor and the slot has not been reused yet, we are the only user accessing the slot due to owning the lock mutably. It is still possible that we are not obtaining the lock, but the only cause for that is the clock algorithm touching the slot, which should be quite an unlikely occurence. For the instance of `try_write` failing, we spawn an async task to destroy the lock. As just argued however, most of the time the code path where we spawn the task should not be visited. * Lastly, we split `with_file` into a macro part, and a function part that contains most of the logic. The function part returns a lock object, that the macro uses. The macro exists to perform the operation in a more compact fashion, saving code from putting the lock into a variable and then doing the operation while measuring the time to run it. We take the locks approach because Rust has no support for async closures. One can make normal closures return a future, but that approach gets into lifetime issues the moment you want to pass data to these closures via parameters that has a lifetime (captures work). For details, see [this](https://smallcultfollowing.com/babysteps/blog/2023/03/29/thoughts-on-async-closures/) and [this](https://users.rust-lang.org/t/function-that-takes-an-async-closure/61663) link. In #5224, we ran into a similar problem with the `test_files` function, and we ended up passing the path and the `OpenOptions` by-value instead of by-ref, at the expense of a few extra copies. This can be done as the data is cheaply copyable, and we are in test code. But here, we are not, and while `File::try_clone` exists, it [issues system calls internally](https://github.com/rust-lang/rust/blob/1e746d7741d44551e9378daf13b8797322aa0b74/library/std/src/os/fd/owned.rs#L94-L111). Also, it would allocate an entirely new file descriptor, something that the fd cache was built to prevent. * We change the `STORAGE_IO_TIME` metrics to support async. Part of #4743.

## Problem Previously, we were using `observe_closure_duration` in `VirtualFile` file opening code, but this doesn't support async open operations, which we want to use as part of #4743. ## Summary of changes * Move the duration measurement from the `with_file` macro into a `observe_duration` macro. * Some smaller drive-by fixes to replace the old strings with the new variant names introduced by #5273 Part of #4743, follow-up of #5247.

## Problem `block_in_place` is a quite expensive operation, and if it is used, we should explicitly have to opt into it by allowing the `clippy::disallowed_methods` lint. For more, see #5023 (comment). Similar arguments exist for `Handle::block_on`, but we don't do this yet as there is still usages. ## Summary of changes Adds a clippy.toml file, configuring the [`disallowed_methods` lint](https://rust-lang.github.io/rust-clippy/master/#/disallowed_method).

bayandin · 2023-09-12T08:05:09Z

Hah, I've just realised that GitHub doesn't trigger workflows on PRs into the release if the current branch doesn't contain any changes from the main. It just carried statuses from the main 🤔

This could be related to the thing that PR is created by @github-actions bot, and it doesn't have permission to trigger workflows.

The sequence that can lead to a deadlock: 1. DELETE request gets all the way to `tenant.shutdown(progress, false).await.is_err() ` , while holding TENANTS.read() 2. POST request for tenant creation comes in, calls `tenant_map_insert`, it does `let mut guard = TENANTS.write().await;` 3. Something that `tenant.shutdown()` needs to wait for needs a `TENANTS.read().await`. The only case identified in exhaustive manual scanning of the code base is this one: Imitate size access does `get_tenant().await`, which does `TENANTS.read().await` under the hood. In the above case (1) waits for (3), (3)'s read-lock request is queued behind (2)'s write-lock, and (2) waits for (1). Deadlock. I made a reproducer/proof-that-above-hypothesis-holds in #5281 , but, it's not ready for merge yet and we want the fix _now_. fixes #5284

github-actions · 2023-09-12T11:38:31Z

1644 tests run: 1569 passed, 0 failed, 75 skipped (full report)

Flaky tests (1)

Postgres 14

test_get_tenant_size_with_multiple_branches: release

Code coverage (full report)

functions: 53.0% (7584 of 14310 functions)
lines: 81.6% (44824 of 54965 lines)

_{The comment gets automatically updated with the latest test results
521438a at 2023-09-12T11:53:38.227Z :recycle:}

## Problem If @github-actions creates release PR, the CI pipeline is not triggered (but we have `release-notify.yml` workflow that we expect to run on this event). I suspect this happened because @github-actions is not a repository member. Ref #5283 (comment) ## Summary of changes - Use `CI_ACCESS_TOKEN` to create a PR - Use `gh` instead of `thomaseizinger/create-pull-request` - Restrict permissions for GITHUB_TOKEN to `contents: write` only (required for `git push`)

arpad-m and others added 30 commits September 5, 2023 12:55

Create Release PR at 7:00 UTC every Tuesday (#5213)

88b1ac4

chore(pageserver): remove unnecessary logging in tenant task loops (#…

4fec48f

…5188) Fixes #3830 by adding the `#[cfg(not(feature = "testing"))]` attribute to unnecessary loggings in `pageserver/src/tenant/tasks.rs`. Co-authored-by: Joonas Koivunen <joonas@neon.tech>

approved-for-ci-run.yml: use our bot (#5216)

da60f69

## Problem Pull Requests created by GitHub Actions bot doesn't have access to secrets, so we need to use our bot for it to auto-trigger a tests run See previous PRs #4663, #5210, #5212 ## Summary of changes - Use our bot to create PRs

remove the duplicate entries in postgresql.conf (#5090)

dfe2e51

Complete the missing metrics for files_created/bytes_written (#5120)

9a9187b

refactor(pageserver): refactor pageserver router state creation (#5165)

31e1568

Fixes #3894 by: - Refactor the pageserver router creation flow - Create the router state in `pageserver/src/bin/pageserver.rs`

doc: tests expect lsof (#5226)

3a96685

On a clean system `lsof` needs to be installed. Add it to the list just to keep things nice and copy-pasteable (except for poetry).

Remove Virtual from FileBlockReaderVirtual variant name (#5225)

6243b44

With #5181, the generics for `FileBlockReader` have been removed, so having a `Virtual` postfix makes less sense now.

Fix pg_config version parsing (#5200)

485a2cf

## Problem Fix pg_config version parsing ## Summary of changes Use regex to capture major version of postgres #5146

Misc workflows: fix quotes in bash (#5235)

d15563f

Upgrade h3-pg to version 4.1.3. (#5237)

e5adc4e

This includes v16 support.

sharnoff and others added 12 commits September 10, 2023 20:15

vm-monitor: Don't include Args in top-level span (#5264)

853552d

It makes the logs too verbose. ref https://neondb.slack.com/archives/C03F5SM1N02/p1694281232874719?thread_ts=1694272777.207109&cid=C03F5SM1N02

Measure pageserver wal recovery time and fix flush() method (#5240)

c0ed362

github-actions bot requested review from a team as code owners September 12, 2023 07:04

github-actions bot requested review from conradludgate, hlinnaka and piercypixel and removed request for a team September 12, 2023 07:04

bayandin mentioned this pull request Sep 12, 2023

Use CI_ACCESS_TOKEN to create release PR #5286

Merged

5 tasks

shanyp approved these changes Sep 12, 2023

View reviewed changes

shanyp merged commit 0e6fdc8 into release Sep 12, 2023
36 checks passed

shanyp deleted the releases/2023-09-12 branch September 12, 2023 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2023-09-12 #5283

Release 2023-09-12 #5283

github-actions bot commented Sep 12, 2023

bayandin commented Sep 12, 2023

github-actions bot commented Sep 12, 2023 •

edited

Loading

Postgres 14

Release 2023-09-12 #5283

Release 2023-09-12 #5283

Conversation

github-actions bot commented Sep 12, 2023

bayandin commented Sep 12, 2023

github-actions bot commented Sep 12, 2023 • edited Loading

1644 tests run: 1569 passed, 0 failed, 75 skipped (full report)

Postgres 14

Code coverage (full report)

github-actions bot commented Sep 12, 2023 •

edited

Loading