Add simple test of pageserver recovery after crash. #1324

lubennikovaav · 2022-02-23T17:32:53Z

To cause a crash, use failpoints in checkpointer

Rebased version of #1043 by @knizhnik

This test requires pageserver built with cargo build --features fail/failpoints, otherwise it will hang.
Is there a way to check it from pytest and throw a warning?

hlinnaka · 2022-02-23T17:37:03Z

This test requires pageserver built with cargo build --features fail/failpoints, otherwise it will hang.
Is there a way to check it from pytest and throw a warning?

IMHO we should always build with failpoints enabled. There's little harm in compiling them in, we'll just not use them in production.

knizhnik · 2022-02-23T18:56:24Z

This test requires pageserver built with cargo build --features fail/failpoints, otherwise it will hang.
Is there a way to check it from pytest and throw a warning?

IMHO we should always build with failpoints enabled. There's little harm in compiling them in, we'll just not use them in production.

I am not sure. Documentation of failpoints says:

Fail points are disabled by default and can be enabled via the failpoints feature. When failpoints are disabled, no code is generated by the macro.

So if we enable this feature, then failpoint! macro will generate some code which (I assume) will check if there are some actions are associated with this failpoint. I do not know how expensive this check is (most likely hash table lookup, but protected by some mutex), but as far as failpoints may be inserted in performance critical parts of the code, I prefer not to enable them in production. At least without measuring first performance penalty introduced by failpoints. I will check it tomorrow.

hlinnaka · 2022-02-23T19:16:07Z

Fail points are disabled by default and can be enabled via the failpoints feature. When failpoints are disabled, no code is generated by the macro.

So if we enable this feature, then failpoint! macro will generate some code which (I assume) will check if there are some actions are associated with this failpoint. I do not know how expensive this check is (most likely hash table lookup, but protected by some mutex), but as far as failpoints may be inserted in performance critical parts of the code, I prefer not to enable them in production. At least without measuring first performance penalty introduced by failpoints. I will check it tomorrow.

Hmm, looking at the code, it acquires an RwLock in read mode, and then a hash table lookup. Yeah, I wouldn't want to put that into any performance-senstive loop. I wish it had a fast-path for the case that no failpoints are set, by checking atomic variable first or something...

The failpoints included here are not performance sensitive, so we could always enable them for now. And figure out how to make it cheaper later, if we want to put a failpoint in a more critical path.

funbringer · 2022-02-24T14:28:34Z

Is there a way to check it from pytest and throw a warning?

We could use conditional compilation + cmdline flag (say, --enabled-features) to give the caller a list of enabled features and then check it from test harness. Here's an example:

[dependencies]
fail = "*"

[features]
failpoints = ["fail/failpoints"]

fn main() {
    let features: &[&str] = &[
        #[cfg(feature = "failpoints")]
        "failpoints",
    ];

    // TODO: hide this print behind a cmdline flag
    println!("available features: {:?}", features);
}

Also, I'd throw a hard error instead of warning. Nobody reads warnings if they don't abort CI anyway.

pageserver/src/bin/pageserver.rs

test_runner/batch_others/test_recovery.py

test_runner/fixtures/zenith_fixtures.py

lubennikovaav · 2022-04-21T11:36:55Z

@funbringer , rebased branch didn't compile with some your changes.
I've commented them for now

error[E0599]: no function or associated item named `auth_failed` found for struct `auth::AuthError` in the current scope
  --> proxy/src/auth/credentials.rs:52:28
   |
52 |             Err(AuthError::auth_failed("failpoint triggered"))
   |                            ^^^^^^^^^^^ function or associated item not found in `auth::AuthError`
   |
  ::: proxy/src/auth.rs:62:1

funbringer · 2022-04-21T11:40:20Z

@lubennikovaav Thanks! I guess you could drop this piece of code altogether. I forgot to remove it after testing.

LizardWizzard · 2022-04-28T17:14:38Z

I'm doing some fail points related stuff in #1571 I can extract it to merge it faster if it'll help

… use failpoints in checkpointer

lubennikovaav requested review from funbringer and knizhnik February 23, 2022 17:33

lubennikovaav mentioned this pull request Feb 23, 2022

Failpoints #1043

Closed

knizhnik approved these changes Feb 23, 2022

View reviewed changes

stepashka assigned lubennikovaav Mar 10, 2022

stepashka added a/test Area: related to testing c/storage/pageserver Component: storage: pageserver labels Mar 10, 2022

lubennikovaav force-pushed the failpoints_rebased branch 2 times, most recently from 8042ea3 to 08dc656 Compare March 15, 2022 10:39

funbringer reviewed Mar 15, 2022

View reviewed changes

pageserver/src/bin/pageserver.rs Outdated Show resolved Hide resolved

lubennikovaav force-pushed the failpoints_rebased branch from 08dc656 to f7453c5 Compare March 16, 2022 17:01

funbringer reviewed Mar 16, 2022

View reviewed changes

pageserver/src/bin/pageserver.rs Outdated Show resolved Hide resolved

test_runner/batch_others/test_recovery.py Outdated Show resolved Hide resolved

test_runner/fixtures/zenith_fixtures.py Outdated Show resolved Hide resolved

lubennikovaav force-pushed the failpoints_rebased branch from f7453c5 to 5eb05f3 Compare April 21, 2022 11:34

lubennikovaav force-pushed the failpoints_rebased branch from 5eb05f3 to e9b261d Compare April 21, 2022 12:21

lubennikovaav mentioned this pull request Apr 21, 2022

"could not find layer with more data for key" error in CI #1433

Closed

lubennikovaav force-pushed the failpoints_rebased branch 2 times, most recently from 5ca8e6f to 95ecd98 Compare April 27, 2022 17:54

Add simple test of pageserver recovery after crash. To cause a crash,…

f7877e8

… use failpoints in checkpointer

lubennikovaav force-pushed the failpoints_rebased branch from 95ecd98 to f7877e8 Compare May 3, 2022 13:28

lubennikovaav merged commit 2f9b17b into main May 3, 2022

lubennikovaav deleted the failpoints_rebased branch May 3, 2022 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add simple test of pageserver recovery after crash. #1324

Add simple test of pageserver recovery after crash. #1324

lubennikovaav commented Feb 23, 2022

hlinnaka commented Feb 23, 2022

knizhnik commented Feb 23, 2022

hlinnaka commented Feb 23, 2022

funbringer commented Feb 24, 2022 •

edited

Loading

lubennikovaav commented Apr 21, 2022

funbringer commented Apr 21, 2022

LizardWizzard commented Apr 28, 2022

Add simple test of pageserver recovery after crash. #1324

Add simple test of pageserver recovery after crash. #1324

Conversation

lubennikovaav commented Feb 23, 2022

hlinnaka commented Feb 23, 2022

knizhnik commented Feb 23, 2022

hlinnaka commented Feb 23, 2022

funbringer commented Feb 24, 2022 • edited Loading

lubennikovaav commented Apr 21, 2022

funbringer commented Apr 21, 2022

LizardWizzard commented Apr 28, 2022

funbringer commented Feb 24, 2022 •

edited

Loading