Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add performance regress test_ondemand_download_churn.py #7242

Merged

Conversation

jbajic
Copy link
Contributor

@jbajic jbajic commented Mar 26, 2024

Summary of changes

Add performance regress test test_ondemand_download_churn.py.

Closes #7146

Tagging @problame

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jbajic jbajic requested a review from a team as a code owner March 26, 2024 10:32
@jbajic jbajic requested a review from jcsp March 26, 2024 10:32
@jcsp jcsp requested review from problame and removed request for jcsp March 26, 2024 10:57
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! A couple of thoughts:

  1. Latency stats are useful, and you copied the right pattern with the task-locals; but they need to be partitioned by the request type, i.e., is it an eviction or an on-demand download. Also, I'm afraid that the histogram range might be optimized for getpage-level latencies from the get-page-latest-lsn sub-benchmark, not on-demand download latencies. So, overall, I don't know whether it makes sense to have latency stats yet.
  2. Instead, the idea behind on-demand download churn was to determine max throughput.
  3. I don't think we the n_tenants and pgbench_scale cruft that you likely copied over from the get-page-latest-lsn sub-benchmark. Please read the public Notion page I linked from the issue; it describes the setup: a single tenant initialized with pgbench using the parameters described in that Notion doc should be sufficent to max out throughput form real S3 to an i3en.3xlarge EC2 instance.

Of course, if you don't want to spend the money on AWS, you can set up a local minio instance for testing.

I don't think we have good documentation for how to run the locally run the test suite with real S3.
Probably the best instructions for that are in

# Run separate tests for real S3
export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty
export REMOTE_STORAGE_S3_BUCKET=neon-github-ci-tests
export REMOTE_STORAGE_S3_REGION=eu-central-1
# Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
${cov_prefix} cargo nextest run $CARGO_FLAGS -E 'package(remote_storage)' -E 'test(test_real_s3)'

Feel free to augment [test_runner/README.md](Useful environment variables:) once you get it running locally.

Again, thanks for taking a stab at this, as described in the issue, it's a fairly self-contained task but requires some ramp-up with the test suite.

@problame
Copy link
Contributor

Please re-request review once you've made changes. I'll be unavailable until next Tuesday.

@jbajic jbajic requested a review from problame April 11, 2024 13:25
@jbajic
Copy link
Contributor Author

jbajic commented Apr 11, 2024

The tests with 10, and 100 concurrency_per_target fail at the end but still produce results... I will try to investigate it a bit further.

@problame problame self-assigned this Apr 15, 2024
@jbajic
Copy link
Contributor Author

jbajic commented Apr 19, 2024

The tests with 10, and 100 concurrency_per_target fail at the end but still produce results... I will try to investigate it a bit further.

I have found the issue and fixed it with the last commit. The requests are dropped at the end of the test and it is detected as an error in logs.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply.

I merged from main and pushed some changes, mostly orthogonal to yours.

Couple of high-level remarks:

  • The ondemand_download_churn.rs should produce absolute numbers instead of the per-second average.
  • the addition of AWS_ENDPOINT_URL: I guess you needed that for minio. At Neon, we use AWS_PROFILE for this sort of stuff. So, you'd use a custom aws profile for the minio S3 endpoint. An example for how to configure the profile is givin here. Would be nice if we can streamline your instructions in the README to just use AWS_PROFILE.

Again, sorry for the delayed reply. Let me know if you're still willing to pick up this PR.

@jbajic
Copy link
Contributor Author

jbajic commented Apr 30, 2024

Sorry for the late reply.

I merged from main and pushed some changes, mostly orthogonal to yours.

Couple of high-level remarks:

  • The ondemand_download_churn.rs should produce absolute numbers instead of the per-second average.
  • the addition of AWS_ENDPOINT_URL: I guess you needed that for minio. At Neon, we use AWS_PROFILE for this sort of stuff. So, you'd use a custom aws profile for the minio S3 endpoint. An example for how to configure the profile is givin here. Would be nice if we can streamline your instructions in the README to just use AWS_PROFILE.

Again, sorry for the delayed reply. Let me know if you're still willing to pick up this PR.

No issues, yes I would like to pick it up, and finish this test

@problame
Copy link
Contributor

Thanks, go ahead and re-request review from me if that's possible, otherwise reply to this thread.

@jbajic
Copy link
Contributor Author

jbajic commented May 5, 2024

  • The ondemand_download_churn.rs should produce absolute numbers instead of the per-second average.

I have changed the output to look like this now:

{
  "downloads": 447,
  "evictions": 609,
  "runtime": {
    "nanos": 32640994,
    "secs": 30
  },
  "timeline_restarts": 1
}
  • the addition of AWS_ENDPOINT_URL: I guess you needed that for minio. At Neon, we use AWS_PROFILE for this sort of stuff. So, you'd use a custom aws profile for the minio S3 endpoint. An example for how to configure the profile is givin here. Would be nice if we can streamline your instructions in the README to just use AWS_PROFILE.

I have also added usage of AWS_PROFILE to replace AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Using endpoint_url in profile still does not fix how the pageserver command is constructed, since endpoint needs to be explicitly defined for pageserver for it to work in:

./target/release/neon_local init --config=/tmp/tmphvpqctqc --pg-version 15 --pageserver-config-override=remote_storage={ bucket_name = "neon-bucket", bucket_region = "eu-central-1", prefix_in_bucket = "2eedf453-65aa-43b7-a1ed-379824d75180/test-download-churn-release-pg15-1-tokio-epoll-uring-30-/pageserver", endpoint = "http://127.0.0.1:9000" }
 --pageserver-config-override=page_cache_size=16384 --pageserver-config-override=max_file_descriptors=500000

and I have not found a nice way to extract an endpoint_url defined in profile configuration to pass it on to pageserver startup.

@jbajic jbajic requested a review from problame May 6, 2024 11:59
@problame
Copy link
Contributor

problame commented May 8, 2024

The endpoint is optional:

/// A base URL to send S3 requests to.
/// By default, the endpoint is derived from a region name, assuming it's
/// an AWS S3 region name, erroring on wrong region name.
/// Endpoint provides a way to support other S3 flavors and their regions.
///
/// Example: `http://127.0.0.1:5000`
pub endpoint: Option<String>,

Also, this week, I merged a bunch of PRs that change how neon_local populates pageserver config; tl;dr: the config override stuff is gone, the [[pageservers]] section in .neon/config is gone, and .neon/pageserver_$i/pageserver.toml is only written once during neon_local init.

@problame
Copy link
Contributor

problame commented May 8, 2024

I'll try out your instructions and changes and report back.

@problame
Copy link
Contributor

problame commented May 8, 2024

@jbajic I took the time to write some more elaborate docs on test suite & remote storage: #7664

Can you try the instructions in that PR? Keep in mind that you need my changes from that PR, so, I recommend you rebase this PR on top of it.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please re-request review once more when PR #7664 has been merged and this PR has been rebased 🙏
We're close to wrapping up!

@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label May 13, 2024
@vipvap vipvap mentioned this pull request May 13, 2024
Copy link

github-actions bot commented May 13, 2024

3147 tests run: 3007 passed, 0 failed, 140 skipped (full report)


Flaky tests (2)

Postgres 16

  • test_download_remote_layers_api: release

Postgres 14

  • test_basebackup_with_high_slru_count[github-actions-selfhosted-sequential-10-13-30]: release

Code coverage* (full report)

  • functions: 31.4% (6338 of 20195 functions)
  • lines: 47.3% (47946 of 101327 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
9e5cd29 at 2024-05-15T16:03:16.541Z :recycle:

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed some final changes from my side.

Requesting review from @bayandin before merging this.

@problame problame requested a review from bayandin May 13, 2024 09:26
@problame problame added the approved-for-ci-run Changes are safe to trigger CI for the PR label May 13, 2024
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label May 13, 2024
@problame
Copy link
Contributor

problame commented May 13, 2024

@bayandin I think this benchmark makes sense when run on EC2 against real S3.

Should we mark it skip until we have that?

@problame problame assigned bayandin and problame and unassigned problame May 13, 2024
@bayandin bayandin added the approved-for-ci-run Changes are safe to trigger CI for the PR label May 13, 2024
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label May 13, 2024
@bayandin
Copy link
Member

Merging main and doing a CI run to get benchmarks run in #7717 (I've added a label to that PR)

@jbajic
Copy link
Contributor Author

jbajic commented May 14, 2024

Last commit for fixing the error:

ERROR test_runner/performance/pageserver/pagebench/test_ondemand_download_churn.py::test_download_churn[release-pg14-github-actions-selfhosted-100-std-fs-30] - AssertionError: Log errors on pageserver_1: (9559, '2024-05-13T14:09:35.888170Z  WARN request{method=DELETE path=/v1/tenant/f40d46f8ad8e8508aa35f94adfeb9048/timeline/7ee05d10dd8bd3b978e809763cb7f98d/layer/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000034CD1261-000000003567D801 request_id=7cf26d30-7520-460f-aff8-af55c710e2f8}: request was dropped before completing\n')

@problame problame added the approved-for-ci-run Changes are safe to trigger CI for the PR label May 15, 2024
@github-actions github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label May 15, 2024
@problame problame merged commit affc18f into neondatabase:main May 15, 2024
65 checks passed
@problame
Copy link
Contributor

It's done, @jbajic ! Thanks for your patience!

@jbajic
Copy link
Contributor Author

jbajic commented May 15, 2024

It's done, @jbajic ! Thanks for your patience!

Thank you for the support and reviews!

a-masterov pushed a commit that referenced this pull request May 20, 2024
Add performance regress test  for on-demand download throughput.

Closes #7146

Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

performance regression test for pagebench ondemand-download-churn
3 participants