Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: run all Rust tests with remote storage enabled #5164

Merged
merged 8 commits into from
Sep 1, 2023

Conversation

problame
Copy link
Contributor

@problame problame commented Aug 31, 2023

For #5086 we will require remote storage to be configured in pageserver.

This PR enables localfs-based storage for all Rust unit tests.

Changes:

  • In TenantHarness, set up localfs remote storage for the tenant.
  • create_test_timeline should mimic what real timeline creation does, and real timeline creation waits for the timeline to reach remote storage. With this PR, create_test_timeline now does that as well.
  • All the places that create the harness tenant twice need to shut down the tenant before the re-create through a second call to try_load or load.
    • Without shutting down, upload tasks initiated by/through the first incarnation of the harness tenant might still be ongoing when the second incarnation of the harness tenant is try_load/loaded. That doesn't make sense in the tests that do that, they generally try to set up a scenario similar to pageserver stop & start.
  • There was one test that recreates a timeline, not the tenant. For that case, I needed to create a Timeline::shutdown method. It's a refactoring of the existing Tenant::shutdown method.
  • The remote_timeline_client tests previously set up their own GenericRemoteStorage and RemoteTimelineClient. Now they re-use the one that's pre-created by the TenantHarness. Some adjustments to the assertions were needed because the assertions now need to account for the initial image layer that's created by create_test_timeline to be present.

@problame problame requested a review from koivunej August 31, 2023 16:09
@github-actions
Copy link

github-actions bot commented Aug 31, 2023

1624 tests run: 1550 passed, 0 failed, 74 skipped (full report)


Flaky tests (1)

Postgres 14

  • test_pageserver_lsn_wait_error_safekeeper_stop: debug
The comment gets automatically updated with the latest test results
05bc06c at 2023-09-01T15:08:15.523Z :recycle:

@problame problame marked this pull request as ready for review September 1, 2023 08:37
@problame problame requested review from a team as code owners September 1, 2023 08:37
@problame problame requested review from fprasx and removed request for a team September 1, 2023 08:37
I think it was a pre-existing issue?
problame added a commit that referenced this pull request Sep 1, 2023
The `remote_timeline_client` tests use `#[tokio::test]` and rely on the
fact that the test runtime that is set up by this macro is
single-threaded.

In PR #5164, we observed
interesting flakiness with the `upload_scheduling` test case:
it would observe the upload of the third layer (`layer_file_name_3`)
before we did `wait_completion`.

Under the single-threaded-runtime assumption, that wouldn't be possible,
because the test code doesn't await inbetween scheduling the upload
and calling `wait_completion`.

However, RemoteTimelineClient was actually using `BACKGROUND_RUNTIME`.
That means there was parallelism where the tests didn't expect it,
leading to flakiness such as execution of an UploadOp task before
the test calls `wait_completion`.

The most confusing scenario is code like this:

```
schedule upload(A);
wait_completion.await; // B
schedule_upload(C);
wait_completion.await; // D
```

On a single-threaded executor, it is guaranteed that the upload up C
doesn't run before D, because we (the test) don't relinquish control
to the executor before D's `await` point.

However, RemoteTimelineClient actually scheduled onto the
BACKGROUND_RUNTIME, so, `A` could start running before `B` and
`C` could start running before `D`.

This would cause flaky tests when making assertions about the state
manipulated by the operations. The concrete issue that led to discover
of this bug was an assertion about `remote_fs_dir` state in #5164.
@problame
Copy link
Contributor Author

problame commented Sep 1, 2023

Preliminary: #5177

problame added a commit that referenced this pull request Sep 1, 2023
…ime (#5177)

The `remote_timeline_client` tests use `#[tokio::test]` and rely on the
fact that the test runtime that is set up by this macro is
single-threaded.

In PR #5164, we observed
interesting flakiness with the `upload_scheduling` test case:
it would observe the upload of the third layer (`layer_file_name_3`)
before we did `wait_completion`.

Under the single-threaded-runtime assumption, that wouldn't be possible,
because the test code doesn't await inbetween scheduling the upload
and calling `wait_completion`.

However, RemoteTimelineClient was actually using `BACKGROUND_RUNTIME`.
That means there was parallelism where the tests didn't expect it,
leading to flakiness such as execution of an UploadOp task before
the test calls `wait_completion`.

The most confusing scenario is code like this:

```
schedule upload(A);
wait_completion.await; // B
schedule_upload(C);
wait_completion.await; // D
```

On a single-threaded executor, it is guaranteed that the upload up C
doesn't run before D, because we (the test) don't relinquish control
to the executor before D's `await` point.

However, RemoteTimelineClient actually scheduled onto the
BACKGROUND_RUNTIME, so, `A` could start running before `B` and
`C` could start running before `D`.

This would cause flaky tests when making assertions about the state
manipulated by the operations. The concrete issue that led to discover
of this bug was an assertion about `remote_fs_dir` state in #5164.
@problame problame merged commit cfc0fb5 into main Sep 1, 2023
33 checks passed
@problame problame deleted the problame/always-remote-layer-map/1-rust-tests branch September 1, 2023 16:10
jcsp pushed a commit that referenced this pull request Sep 4, 2023
…ime (#5177)

The `remote_timeline_client` tests use `#[tokio::test]` and rely on the
fact that the test runtime that is set up by this macro is
single-threaded.

In PR #5164, we observed
interesting flakiness with the `upload_scheduling` test case:
it would observe the upload of the third layer (`layer_file_name_3`)
before we did `wait_completion`.

Under the single-threaded-runtime assumption, that wouldn't be possible,
because the test code doesn't await inbetween scheduling the upload
and calling `wait_completion`.

However, RemoteTimelineClient was actually using `BACKGROUND_RUNTIME`.
That means there was parallelism where the tests didn't expect it,
leading to flakiness such as execution of an UploadOp task before
the test calls `wait_completion`.

The most confusing scenario is code like this:

```
schedule upload(A);
wait_completion.await; // B
schedule_upload(C);
wait_completion.await; // D
```

On a single-threaded executor, it is guaranteed that the upload up C
doesn't run before D, because we (the test) don't relinquish control
to the executor before D's `await` point.

However, RemoteTimelineClient actually scheduled onto the
BACKGROUND_RUNTIME, so, `A` could start running before `B` and
`C` could start running before `D`.

This would cause flaky tests when making assertions about the state
manipulated by the operations. The concrete issue that led to discover
of this bug was an assertion about `remote_fs_dir` state in #5164.
jcsp pushed a commit that referenced this pull request Sep 4, 2023
For
[#5086](#5086 (comment))
we will require remote storage to be configured in pageserver.

This PR enables `localfs`-based storage for all Rust unit tests.

Changes:

- In `TenantHarness`, set up localfs remote storage for the tenant.
- `create_test_timeline` should mimic what real timeline creation does,
and real timeline creation waits for the timeline to reach remote
storage. With this PR, `create_test_timeline` now does that as well.
- All the places that create the harness tenant twice need to shut down
the tenant before the re-create through a second call to `try_load` or
`load`.
- Without shutting down, upload tasks initiated by/through the first
incarnation of the harness tenant might still be ongoing when the second
incarnation of the harness tenant is `try_load`/`load`ed. That doesn't
make sense in the tests that do that, they generally try to set up a
scenario similar to pageserver stop & start.
- There was one test that recreates a timeline, not the tenant. For that
case, I needed to create a `Timeline::shutdown` method. It's a
refactoring of the existing `Tenant::shutdown` method.
- The remote_timeline_client tests previously set up their own
`GenericRemoteStorage` and `RemoteTimelineClient`. Now they re-use the
one that's pre-created by the TenantHarness. Some adjustments to the
assertions were needed because the assertions now need to account for
the initial image layer that's created by `create_test_timeline` to be
present.
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants