Add performance regress `test_ondemand_download_churn.py` #7242

jbajic · 2024-03-26T10:32:28Z

Summary of changes

Add performance regress test test_ondemand_download_churn.py.

Closes #7146

Tagging @problame

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

test_runner/performance/pageserver/pagebench/test_ondemand_download_churn.py

problame

Thanks for working on this! A couple of thoughts:

Latency stats are useful, and you copied the right pattern with the task-locals; but they need to be partitioned by the request type, i.e., is it an eviction or an on-demand download. Also, I'm afraid that the histogram range might be optimized for getpage-level latencies from the get-page-latest-lsn sub-benchmark, not on-demand download latencies. So, overall, I don't know whether it makes sense to have latency stats yet.
Instead, the idea behind on-demand download churn was to determine max throughput.
I don't think we the n_tenants and pgbench_scale cruft that you likely copied over from the get-page-latest-lsn sub-benchmark. Please read the public Notion page I linked from the issue; it describes the setup: a single tenant initialized with pgbench using the parameters described in that Notion doc should be sufficent to max out throughput form real S3 to an i3en.3xlarge EC2 instance.

Of course, if you don't want to spend the money on AWS, you can set up a local minio instance for testing.

I don't think we have good documentation for how to run the locally run the test suite with real S3.
Probably the best instructions for that are in

neon/.github/workflows/build_and_test.yml

Lines 369 to 374 in 7d32af5

    
                     # Run separate tests for real S3 
        
                     export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty 
        
                     export REMOTE_STORAGE_S3_BUCKET=neon-github-ci-tests 
        
                     export REMOTE_STORAGE_S3_REGION=eu-central-1 
        
                     # Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now 
        
                     ${cov_prefix} cargo nextest run $CARGO_FLAGS -E 'package(remote_storage)' -E 'test(test_real_s3)'

Feel free to augment [test_runner/README.md](Useful environment variables:) once you get it running locally.

Again, thanks for taking a stab at this, as described in the issue, it's a fairly self-contained task but requires some ramp-up with the test suite.

problame · 2024-03-27T11:08:09Z

Please re-request review once you've made changes. I'll be unavailable until next Tuesday.

jbajic · 2024-04-11T13:55:43Z

The tests with 10, and 100 concurrency_per_target fail at the end but still produce results... I will try to investigate it a bit further.

jbajic · 2024-04-19T19:05:16Z

The tests with 10, and 100 concurrency_per_target fail at the end but still produce results... I will try to investigate it a bit further.

I have found the issue and fixed it with the last commit. The requests are dropped at the end of the test and it is detected as an error in logs.

problame

Sorry for the late reply.

I merged from main and pushed some changes, mostly orthogonal to yours.

Couple of high-level remarks:

The ondemand_download_churn.rs should produce absolute numbers instead of the per-second average.
the addition of AWS_ENDPOINT_URL: I guess you needed that for minio. At Neon, we use AWS_PROFILE for this sort of stuff. So, you'd use a custom aws profile for the minio S3 endpoint. An example for how to configure the profile is givin here. Would be nice if we can streamline your instructions in the README to just use AWS_PROFILE.

Again, sorry for the delayed reply. Let me know if you're still willing to pick up this PR.

jbajic · 2024-04-30T13:48:42Z

Sorry for the late reply.

I merged from main and pushed some changes, mostly orthogonal to yours.

Couple of high-level remarks:

The ondemand_download_churn.rs should produce absolute numbers instead of the per-second average.

the addition of AWS_ENDPOINT_URL: I guess you needed that for minio. At Neon, we use AWS_PROFILE for this sort of stuff. So, you'd use a custom aws profile for the minio S3 endpoint. An example for how to configure the profile is givin here. Would be nice if we can streamline your instructions in the README to just use AWS_PROFILE.

Again, sorry for the delayed reply. Let me know if you're still willing to pick up this PR.

No issues, yes I would like to pick it up, and finish this test

problame · 2024-04-30T14:21:05Z

Thanks, go ahead and re-request review from me if that's possible, otherwise reply to this thread.

jbajic · 2024-05-05T16:16:03Z

The ondemand_download_churn.rs should produce absolute numbers instead of the per-second average.

I have changed the output to look like this now:

{
  "downloads": 447,
  "evictions": 609,
  "runtime": {
    "nanos": 32640994,
    "secs": 30
  },
  "timeline_restarts": 1
}

the addition of AWS_ENDPOINT_URL: I guess you needed that for minio. At Neon, we use AWS_PROFILE for this sort of stuff. So, you'd use a custom aws profile for the minio S3 endpoint. An example for how to configure the profile is givin here. Would be nice if we can streamline your instructions in the README to just use AWS_PROFILE.

I have also added usage of AWS_PROFILE to replace AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Using endpoint_url in profile still does not fix how the pageserver command is constructed, since endpoint needs to be explicitly defined for pageserver for it to work in:

./target/release/neon_local init --config=/tmp/tmphvpqctqc --pg-version 15 --pageserver-config-override=remote_storage={ bucket_name = "neon-bucket", bucket_region = "eu-central-1", prefix_in_bucket = "2eedf453-65aa-43b7-a1ed-379824d75180/test-download-churn-release-pg15-1-tokio-epoll-uring-30-/pageserver", endpoint = "http://127.0.0.1:9000" }
 --pageserver-config-override=page_cache_size=16384 --pageserver-config-override=max_file_descriptors=500000

and I have not found a nice way to extract an endpoint_url defined in profile configuration to pass it on to pageserver startup.

problame · 2024-05-08T15:44:23Z

The endpoint is optional:

neon/libs/remote_storage/src/lib.rs

Lines 557 to 563 in c18d334

    
           /// A base URL to send S3 requests to. 
        
           /// By default, the endpoint is derived from a region name, assuming it's 
        
           /// an AWS S3 region name, erroring on wrong region name. 
        
           /// Endpoint provides a way to support other S3 flavors and their regions. 
        
           /// 
        
           /// Example: `http://127.0.0.1:5000` 
        
           pub endpoint: Option<String>,

Also, this week, I merged a bunch of PRs that change how neon_local populates pageserver config; tl;dr: the config override stuff is gone, the [[pageservers]] section in .neon/config is gone, and .neon/pageserver_$i/pageserver.toml is only written once during neon_local init.

problame · 2024-05-08T15:44:56Z

I'll try out your instructions and changes and report back.

problame · 2024-05-08T18:55:06Z

@jbajic I took the time to write some more elaborate docs on test suite & remote storage: #7664

Can you try the instructions in that PR? Keep in mind that you need my changes from that PR, so, I recommend you rebase this PR on top of it.

problame

Please re-request review once more when PR #7664 has been merged and this PR has been rebased 🙏
We're close to wrapping up!

github-actions · 2024-05-13T09:06:44Z

3147 tests run: 3007 passed, 0 failed, 140 skipped (full report)

Flaky tests (2)

Postgres 16

test_download_remote_layers_api: release

Postgres 14

test_basebackup_with_high_slru_count[github-actions-selfhosted-sequential-10-13-30]: release

Code coverage* (full report)

functions: 31.4% (6338 of 20195 functions)
lines: 47.3% (47946 of 101327 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
9e5cd29 at 2024-05-15T16:03:16.541Z :recycle:}

…wnload

test_runner/performance/pageserver/pagebench/test_ondemand_download_churn.py

problame

Pushed some final changes from my side.

Requesting review from @bayandin before merging this.

problame · 2024-05-13T09:28:32Z

@bayandin I think this benchmark makes sense when run on EC2 against real S3.

Should we mark it skip until we have that?

run pagebench performance tests on predictable, developer-reproducible infrastructure #6297

bayandin · 2024-05-13T09:42:37Z

Merging main and doing a CI run to get benchmarks run in #7717 (I've added a label to that PR)

jbajic · 2024-05-14T19:50:07Z

Last commit for fixing the error:

ERROR test_runner/performance/pageserver/pagebench/test_ondemand_download_churn.py::test_download_churn[release-pg14-github-actions-selfhosted-100-std-fs-30] - AssertionError: Log errors on pageserver_1: (9559, '2024-05-13T14:09:35.888170Z  WARN request{method=DELETE path=/v1/tenant/f40d46f8ad8e8508aa35f94adfeb9048/timeline/7ee05d10dd8bd3b978e809763cb7f98d/layer/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000034CD1261-000000003567D801 request_id=7cf26d30-7520-460f-aff8-af55c710e2f8}: request was dropped before completing\n')

test_runner/performance/pageserver/pagebench/test_ondemand_download_churn.py

problame · 2024-05-15T16:41:39Z

It's done, @jbajic ! Thanks for your patience!

jbajic · 2024-05-15T17:31:31Z

It's done, @jbajic ! Thanks for your patience!

Thank you for the support and reviews!

Add performance regress test for on-demand download throughput. Closes #7146 Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech>

jbajic requested a review from a team as a code owner March 26, 2024 10:32

jbajic requested a review from jcsp March 26, 2024 10:32

jbajic commented Mar 26, 2024

View reviewed changes

test_runner/performance/pageserver/pagebench/test_ondemand_download_churn.py Outdated Show resolved Hide resolved

jbajic commented Mar 26, 2024

View reviewed changes

test_runner/performance/pageserver/pagebench/test_ondemand_download_churn.py Outdated Show resolved Hide resolved

jcsp requested review from problame and removed request for jcsp March 26, 2024 10:57

problame reviewed Mar 26, 2024

View reviewed changes

jbajic requested a review from problame April 11, 2024 13:25

problame self-assigned this Apr 15, 2024

problame reviewed Apr 30, 2024

View reviewed changes

jbajic requested a review from problame May 6, 2024 11:59

problame reviewed May 8, 2024

View reviewed changes

jbajic added 9 commits May 9, 2024 11:52

Enable runtime args in ondemand_donwload_churn

be03ee7

Add test_ondemand_download_churn

be9f584

Add stats counter in ondemand_download_churn

e0b6074

Reduce concurrency

093634f

Add global stats

859cc45

Update test

677a285

Make setup optional

44a11f1

Update README

3f06034

Ignore dropped requests errors

83265cb

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label May 13, 2024

vipvap mentioned this pull request May 13, 2024

CI run for PR #7242 #7717

Closed

problame added 2 commits May 13, 2024 08:39

increase default duration

265a49b

include downloaded bytes rate in live stats & sum in output

261e96d

use runtime provided by the test as sometimes there's a straggling do…

b0ed67c

…wnload

problame reviewed May 13, 2024

View reviewed changes

test_runner/performance/pageserver/pagebench/test_ondemand_download_churn.py Show resolved Hide resolved

problame reviewed May 13, 2024

View reviewed changes

problame requested a review from bayandin May 13, 2024 09:26

problame added the approved-for-ci-run Changes are safe to trigger CI for the PR label May 13, 2024

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label May 13, 2024

problame assigned bayandin and problame and unassigned problame May 13, 2024

Merge branch 'main' into add-perf-regress--test-ondemand-download-churn

5d4fb27

bayandin added the approved-for-ci-run Changes are safe to trigger CI for the PR label May 13, 2024

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label May 13, 2024

bayandin mentioned this pull request May 13, 2024

approved-for-ci-run: carry labels to ci-run/pr-* #7723

Closed

Ignore all dropped packets

9e5cd29

problame added the approved-for-ci-run Changes are safe to trigger CI for the PR label May 15, 2024

github-actions bot removed the approved-for-ci-run Changes are safe to trigger CI for the PR label May 15, 2024

bayandin approved these changes May 15, 2024

View reviewed changes

test_runner/performance/pageserver/pagebench/test_ondemand_download_churn.py Show resolved Hide resolved

problame approved these changes May 15, 2024

View reviewed changes

problame merged commit affc18f into neondatabase:main May 15, 2024
65 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add performance regress `test_ondemand_download_churn.py` #7242

Add performance regress `test_ondemand_download_churn.py` #7242

jbajic commented Mar 26, 2024 •

edited

Loading

problame left a comment

problame commented Mar 27, 2024

jbajic commented Apr 11, 2024

jbajic commented Apr 19, 2024

problame left a comment

jbajic commented Apr 30, 2024

problame commented Apr 30, 2024

jbajic commented May 5, 2024 •

edited

Loading

problame commented May 8, 2024

problame commented May 8, 2024

problame commented May 8, 2024

problame left a comment

github-actions bot commented May 13, 2024 •

edited

Loading

Postgres 16

Postgres 14

problame left a comment

problame commented May 13, 2024 •

edited

Loading

bayandin commented May 13, 2024

jbajic commented May 14, 2024

problame commented May 15, 2024

jbajic commented May 15, 2024

	# Run separate tests for real S3
	export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty
	export REMOTE_STORAGE_S3_BUCKET=neon-github-ci-tests
	export REMOTE_STORAGE_S3_REGION=eu-central-1
	# Avoid `$CARGO_FEATURES` since there's no `testing` feature in the e2e tests now
	${cov_prefix} cargo nextest run $CARGO_FLAGS -E 'package(remote_storage)' -E 'test(test_real_s3)'

Add performance regress test_ondemand_download_churn.py #7242

Add performance regress test_ondemand_download_churn.py #7242

Conversation

jbajic commented Mar 26, 2024 • edited Loading

Summary of changes

Checklist before requesting a review

Checklist before merging

problame left a comment

Choose a reason for hiding this comment

problame commented Mar 27, 2024

jbajic commented Apr 11, 2024

jbajic commented Apr 19, 2024

problame left a comment

Choose a reason for hiding this comment

jbajic commented Apr 30, 2024

problame commented Apr 30, 2024

jbajic commented May 5, 2024 • edited Loading

problame commented May 8, 2024

problame commented May 8, 2024

problame commented May 8, 2024

problame left a comment

Choose a reason for hiding this comment

github-actions bot commented May 13, 2024 • edited Loading

3147 tests run: 3007 passed, 0 failed, 140 skipped (full report)

Postgres 16

Postgres 14

Code coverage* (full report)

problame left a comment

Choose a reason for hiding this comment

problame commented May 13, 2024 • edited Loading

bayandin commented May 13, 2024

jbajic commented May 14, 2024

problame commented May 15, 2024

jbajic commented May 15, 2024

Add performance regress `test_ondemand_download_churn.py` #7242

Add performance regress `test_ondemand_download_churn.py` #7242

jbajic commented Mar 26, 2024 •

edited

Loading

jbajic commented May 5, 2024 •

edited

Loading

github-actions bot commented May 13, 2024 •

edited

Loading

problame commented May 13, 2024 •

edited

Loading