Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple test of pageserver recovery after crash. #1324

Merged
merged 1 commit into from
May 3, 2022

Conversation

lubennikovaav
Copy link
Contributor

To cause a crash, use failpoints in checkpointer

Rebased version of #1043 by @knizhnik

This test requires pageserver built with cargo build --features fail/failpoints, otherwise it will hang.
Is there a way to check it from pytest and throw a warning?

@hlinnaka
Copy link
Contributor

This test requires pageserver built with cargo build --features fail/failpoints, otherwise it will hang.
Is there a way to check it from pytest and throw a warning?

IMHO we should always build with failpoints enabled. There's little harm in compiling them in, we'll just not use them in production.

@knizhnik
Copy link
Contributor

This test requires pageserver built with cargo build --features fail/failpoints, otherwise it will hang.
Is there a way to check it from pytest and throw a warning?

IMHO we should always build with failpoints enabled. There's little harm in compiling them in, we'll just not use them in production.

I am not sure. Documentation of failpoints says:

Fail points are disabled by default and can be enabled via the failpoints feature. When failpoints are disabled, no code is generated by the macro.

So if we enable this feature, then failpoint! macro will generate some code which (I assume) will check if there are some actions are associated with this failpoint. I do not know how expensive this check is (most likely hash table lookup, but protected by some mutex), but as far as failpoints may be inserted in performance critical parts of the code, I prefer not to enable them in production. At least without measuring first performance penalty introduced by failpoints. I will check it tomorrow.

@hlinnaka
Copy link
Contributor

Fail points are disabled by default and can be enabled via the failpoints feature. When failpoints are disabled, no code is generated by the macro.

So if we enable this feature, then failpoint! macro will generate some code which (I assume) will check if there are some actions are associated with this failpoint. I do not know how expensive this check is (most likely hash table lookup, but protected by some mutex), but as far as failpoints may be inserted in performance critical parts of the code, I prefer not to enable them in production. At least without measuring first performance penalty introduced by failpoints. I will check it tomorrow.

Hmm, looking at the code, it acquires an RwLock in read mode, and then a hash table lookup. Yeah, I wouldn't want to put that into any performance-senstive loop. I wish it had a fast-path for the case that no failpoints are set, by checking atomic variable first or something...

The failpoints included here are not performance sensitive, so we could always enable them for now. And figure out how to make it cheaper later, if we want to put a failpoint in a more critical path.

@funbringer
Copy link
Contributor

funbringer commented Feb 24, 2022

Is there a way to check it from pytest and throw a warning?

We could use conditional compilation + cmdline flag (say, --enabled-features) to give the caller a list of enabled features and then check it from test harness. Here's an example:

[dependencies]
fail = "*"

[features]
failpoints = ["fail/failpoints"]
fn main() {
    let features: &[&str] = &[
        #[cfg(feature = "failpoints")]
        "failpoints",
    ];

    // TODO: hide this print behind a cmdline flag
    println!("available features: {:?}", features);
}

Also, I'd throw a hard error instead of warning. Nobody reads warnings if they don't abort CI anyway.

@stepashka stepashka added a/test Area: related to testing c/storage/pageserver Component: storage: pageserver labels Mar 10, 2022
@lubennikovaav lubennikovaav force-pushed the failpoints_rebased branch 2 times, most recently from 8042ea3 to 08dc656 Compare March 15, 2022 10:39
pageserver/src/bin/pageserver.rs Outdated Show resolved Hide resolved
test_runner/batch_others/test_recovery.py Outdated Show resolved Hide resolved
test_runner/fixtures/zenith_fixtures.py Outdated Show resolved Hide resolved
@lubennikovaav
Copy link
Contributor Author

@funbringer , rebased branch didn't compile with some your changes.
I've commented them for now

error[E0599]: no function or associated item named `auth_failed` found for struct `auth::AuthError` in the current scope
  --> proxy/src/auth/credentials.rs:52:28
   |
52 |             Err(AuthError::auth_failed("failpoint triggered"))
   |                            ^^^^^^^^^^^ function or associated item not found in `auth::AuthError`
   |
  ::: proxy/src/auth.rs:62:1

@funbringer
Copy link
Contributor

@lubennikovaav Thanks! I guess you could drop this piece of code altogether. I forgot to remove it after testing.

@LizardWizzard
Copy link
Contributor

I'm doing some fail points related stuff in #1571 I can extract it to merge it faster if it'll help

@lubennikovaav lubennikovaav merged commit 2f9b17b into main May 3, 2022
@lubennikovaav lubennikovaav deleted the failpoints_rebased branch May 3, 2022 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants