pageserver: run all Rust tests with remote storage enabled #5164

problame · 2023-08-31T16:00:32Z

For #5086 we will require remote storage to be configured in pageserver.

This PR enables localfs-based storage for all Rust unit tests.

Changes:

In TenantHarness, set up localfs remote storage for the tenant.
create_test_timeline should mimic what real timeline creation does, and real timeline creation waits for the timeline to reach remote storage. With this PR, create_test_timeline now does that as well.
All the places that create the harness tenant twice need to shut down the tenant before the re-create through a second call to try_load or load.
- Without shutting down, upload tasks initiated by/through the first incarnation of the harness tenant might still be ongoing when the second incarnation of the harness tenant is try_load/loaded. That doesn't make sense in the tests that do that, they generally try to set up a scenario similar to pageserver stop & start.
There was one test that recreates a timeline, not the tenant. For that case, I needed to create a Timeline::shutdown method. It's a refactoring of the existing Tenant::shutdown method.
The remote_timeline_client tests previously set up their own GenericRemoteStorage and RemoteTimelineClient. Now they re-use the one that's pre-created by the TenantHarness. Some adjustments to the assertions were needed because the assertions now need to account for the initial image layer that's created by create_test_timeline to be present.

github-actions · 2023-08-31T16:32:25Z

1624 tests run: 1550 passed, 0 failed, 74 skipped (full report)

Flaky tests (1)

Postgres 14

test_pageserver_lsn_wait_error_safekeeper_stop: debug

_{The comment gets automatically updated with the latest test results
05bc06c at 2023-09-01T15:08:15.523Z :recycle:}

pageserver/src/tenant.rs

I think it was a pre-existing issue?

The `remote_timeline_client` tests use `#[tokio::test]` and rely on the fact that the test runtime that is set up by this macro is single-threaded. In PR #5164, we observed interesting flakiness with the `upload_scheduling` test case: it would observe the upload of the third layer (`layer_file_name_3`) before we did `wait_completion`. Under the single-threaded-runtime assumption, that wouldn't be possible, because the test code doesn't await inbetween scheduling the upload and calling `wait_completion`. However, RemoteTimelineClient was actually using `BACKGROUND_RUNTIME`. That means there was parallelism where the tests didn't expect it, leading to flakiness such as execution of an UploadOp task before the test calls `wait_completion`. The most confusing scenario is code like this: ``` schedule upload(A); wait_completion.await; // B schedule_upload(C); wait_completion.await; // D ``` On a single-threaded executor, it is guaranteed that the upload up C doesn't run before D, because we (the test) don't relinquish control to the executor before D's `await` point. However, RemoteTimelineClient actually scheduled onto the BACKGROUND_RUNTIME, so, `A` could start running before `B` and `C` could start running before `D`. This would cause flaky tests when making assertions about the state manipulated by the operations. The concrete issue that led to discover of this bug was an assertion about `remote_fs_dir` state in #5164.

problame · 2023-09-01T11:17:32Z

Preliminary: #5177

…ime (#5177) The `remote_timeline_client` tests use `#[tokio::test]` and rely on the fact that the test runtime that is set up by this macro is single-threaded. In PR #5164, we observed interesting flakiness with the `upload_scheduling` test case: it would observe the upload of the third layer (`layer_file_name_3`) before we did `wait_completion`. Under the single-threaded-runtime assumption, that wouldn't be possible, because the test code doesn't await inbetween scheduling the upload and calling `wait_completion`. However, RemoteTimelineClient was actually using `BACKGROUND_RUNTIME`. That means there was parallelism where the tests didn't expect it, leading to flakiness such as execution of an UploadOp task before the test calls `wait_completion`. The most confusing scenario is code like this: ``` schedule upload(A); wait_completion.await; // B schedule_upload(C); wait_completion.await; // D ``` On a single-threaded executor, it is guaranteed that the upload up C doesn't run before D, because we (the test) don't relinquish control to the executor before D's `await` point. However, RemoteTimelineClient actually scheduled onto the BACKGROUND_RUNTIME, so, `A` could start running before `B` and `C` could start running before `D`. This would cause flaky tests when making assertions about the state manipulated by the operations. The concrete issue that led to discover of this bug was an assertion about `remote_fs_dir` state in #5164.

…e-layer-map/1-rust-tests

…ime (#5177) The `remote_timeline_client` tests use `#[tokio::test]` and rely on the fact that the test runtime that is set up by this macro is single-threaded. In PR #5164, we observed interesting flakiness with the `upload_scheduling` test case: it would observe the upload of the third layer (`layer_file_name_3`) before we did `wait_completion`. Under the single-threaded-runtime assumption, that wouldn't be possible, because the test code doesn't await inbetween scheduling the upload and calling `wait_completion`. However, RemoteTimelineClient was actually using `BACKGROUND_RUNTIME`. That means there was parallelism where the tests didn't expect it, leading to flakiness such as execution of an UploadOp task before the test calls `wait_completion`. The most confusing scenario is code like this: ``` schedule upload(A); wait_completion.await; // B schedule_upload(C); wait_completion.await; // D ``` On a single-threaded executor, it is guaranteed that the upload up C doesn't run before D, because we (the test) don't relinquish control to the executor before D's `await` point. However, RemoteTimelineClient actually scheduled onto the BACKGROUND_RUNTIME, so, `A` could start running before `B` and `C` could start running before `D`. This would cause flaky tests when making assertions about the state manipulated by the operations. The concrete issue that led to discover of this bug was an assertion about `remote_fs_dir` state in #5164.

For [#5086](#5086 (comment)) we will require remote storage to be configured in pageserver. This PR enables `localfs`-based storage for all Rust unit tests. Changes: - In `TenantHarness`, set up localfs remote storage for the tenant. - `create_test_timeline` should mimic what real timeline creation does, and real timeline creation waits for the timeline to reach remote storage. With this PR, `create_test_timeline` now does that as well. - All the places that create the harness tenant twice need to shut down the tenant before the re-create through a second call to `try_load` or `load`. - Without shutting down, upload tasks initiated by/through the first incarnation of the harness tenant might still be ongoing when the second incarnation of the harness tenant is `try_load`/`load`ed. That doesn't make sense in the tests that do that, they generally try to set up a scenario similar to pageserver stop & start. - There was one test that recreates a timeline, not the tenant. For that case, I needed to create a `Timeline::shutdown` method. It's a refactoring of the existing `Tenant::shutdown` method. - The remote_timeline_client tests previously set up their own `GenericRemoteStorage` and `RemoteTimelineClient`. Now they re-use the one that's pre-created by the TenantHarness. Some adjustments to the assertions were needed because the assertions now need to account for the initial image layer that's created by `create_test_timeline` to be present.

problame added 3 commits August 31, 2023 16:26

WIP

6cffe96

WIP

1ed63bf

tests pass

ea5df3d

problame requested a review from koivunej August 31, 2023 16:09

cargo fmt

7f706bc

problame mentioned this pull request Aug 31, 2023

rfc: Crash-Consistent Layer Map Updates By Leveraging index_part.json #5086

Merged

forgot to pass on the span

5bed7dd

problame mentioned this pull request Sep 1, 2023

Epic: crash-consistent layer map through index_part.json #5172

Closed

fix python tests sensitive to log message changes

e844568

problame marked this pull request as ready for review September 1, 2023 08:37

problame requested review from a team as code owners September 1, 2023 08:37

problame requested review from fprasx and removed request for a team September 1, 2023 08:37

koivunej approved these changes Sep 1, 2023

View reviewed changes

koivunej reviewed Sep 1, 2023

View reviewed changes

pageserver/src/tenant.rs Show resolved Hide resolved

koivunej reviewed Sep 1, 2023

View reviewed changes

pageserver/src/tenant.rs Show resolved Hide resolved

fix flakiness/raciness observed in previous runs

b1202cf

I think it was a pre-existing issue?

problame mentioned this pull request Sep 1, 2023

remote_timeline_client: tests: run upload ops on the tokio::test runtime #5177

Merged

Merge remote-tracking branch 'origin/main' into problame/always-remot…

05bc06c

…e-layer-map/1-rust-tests

problame merged commit cfc0fb5 into main Sep 1, 2023
33 checks passed

problame deleted the problame/always-remote-layer-map/1-rust-tests branch September 1, 2023 16:10

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: run all Rust tests with remote storage enabled #5164

pageserver: run all Rust tests with remote storage enabled #5164

problame commented Aug 31, 2023 •

edited

Loading

github-actions bot commented Aug 31, 2023 •

edited

Loading

Postgres 14

problame commented Sep 1, 2023

pageserver: run all Rust tests with remote storage enabled #5164

pageserver: run all Rust tests with remote storage enabled #5164

Conversation

problame commented Aug 31, 2023 • edited Loading

github-actions bot commented Aug 31, 2023 • edited Loading

1624 tests run: 1550 passed, 0 failed, 74 skipped (full report)

Postgres 14

problame commented Sep 1, 2023

problame commented Aug 31, 2023 •

edited

Loading

github-actions bot commented Aug 31, 2023 •

edited

Loading