Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test test_backpressure_received_lsn_lag does not pass when failpoint is properly enabled #1587

Open
Tracked by #2028
LizardWizzard opened this issue Apr 27, 2022 · 7 comments
Assignees
Labels
a/tech_debt Area: related to tech debt
Milestone

Comments

@LizardWizzard
Copy link
Contributor

In #1571 I discovered that failpoints were not enabled, and after fixing failpoint integration test_backpressure_received_lsn_lag started to fail. Currently, it fails because it reaches statement_timeout. I tried to increase it up to 10 minutes but it didnt help. Needs more investigation. @lubennikovaav could you please take a look?

@lubennikovaav lubennikovaav self-assigned this Apr 27, 2022
@stepashka
Copy link
Member

this test confims that backpressure works
backpressure ensures that when compute has a very intensive write workload, we don't time out on reads

@LizardWizzard
Copy link
Contributor Author

Merging #596 into this one.

Cross posting the idea:

Lets also test backpressure via applying CPU limits on pageserver

@LizardWizzard LizardWizzard added the a/tech_debt Area: related to tech debt label Feb 27, 2023
@kelvich
Copy link
Contributor

kelvich commented Feb 27, 2023

What that should prove? If the pageserver is slow enough then back pressure kicks in and slows down the compute. What is the goal or suspected misbehavior?

@LizardWizzard
Copy link
Contributor Author

Not sure I follow the last question.

If the pageserver is slow enough then back pressure kicks in and slows down the compute.

Yes, we need to prove that it works as intended.

@kelvich
Copy link
Contributor

kelvich commented Feb 27, 2023

Yes, we need to prove that it works as intended.

Sure, what would be the proof or the test to validate that? My point is that if due to the failpoints (aka slow pageserver) compute times out that means that back pressure works as expected.

@LizardWizzard
Copy link
Contributor Author

Yes, thats correct.

@jcsp jcsp closed this as completed Apr 26, 2024
@koivunej koivunej reopened this Aug 1, 2024
@koivunej
Copy link
Member

koivunej commented Aug 1, 2024

The skip still exists, so I reopened this issue. The fail point used in the test with the long unsupported psql fail points command (replaced with an HTTP API) uses pause on a plain fail::fail_point!, which freezes the runtime thread.

While looking around "do we have any backpressure tests":

  • test_runner/regress/test_timeline_size.py::test_timeline_size_quota
    • uses a bespoke wait_for_last_flush_lsn via neon.backpressure_lsns
    • unsure why we would even wait (it's a boostrapped timeline, so we have accurate logical size)
  • test_runner/performance/test_wal_backpressure.py uses backpressure_lsns() to determine how late is pageserver
    • this is reused in another generate series benchmark as well
  • test_runner/regress/test_sharding.py::test_sharding_backpressure

Cc: #7317

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt
Projects
None yet
Development

No branches or pull requests

6 participants