testutils backtraces are usually not useful #1266

brson · 2023-11-30T01:34:14Z

This issue I have encountered repeatedly when debugging fuzzing errors, but is a bit hard to explain.

In various situations where an error is returned but not expected, the host calls reject_error, and in the testutils configuration, this mode logs diagnostics and dumps a backtrace. This backtrace though does not locate the original source of the error - it only locates the source of the call to reject_error.

The backtrace is mostly useless: the most effective way to find the actual source of the error is to read the diagnostics and hope one of them contains a greppable string. And then if you do locate the source of the error, you still don't have the backtrace that lead to that error. I almost always end up instrumenting the runtime and contracts with printlns to actually figure out the call path to the error.

Example:

In some recent tests I triggered this error in Storage::extend:

        if old_live_until < ledger_seq {
            return Err(host.err(
                ScErrorType::Storage,
                ScErrorCode::InternalError,
                "accessing no-longer-live entry",
                &[old_live_until.into(), ledger_seq.into()],
            ));
        }

The end result is that I see this output:

thread '<unnamed>' panicked at /home/azureuser/data/stellar/rs-soroban-env/soroban-env-host/src/host.rs:684:9:
HostError: Error(Storage, InternalError)

Event log (newest first):
   0: [Diagnostic Event] topics:[error, Error(Storage, InternalError)], data:"escalating error to panic"
   1: [Diagnostic Event] topics:[error, Error(Storage, InternalError)], data:["contract call failed", allowance, [Address(Contract(dd222222262222222222ffffff
ffffffffffffffffffffffffffffffffffffff)), Address(Contract(dd222222262222222222ffffffffffffffffffffffffffffffffffffffffffff))]]
   2: [Failed Diagnostic Event (not emitted)] contract:0000000000000000000000000000000000000000000000000000000000000001, topics:[error, Error(Storage, Intern
alError)], data:"caught error from function"
   3: [Failed Diagnostic Event (not emitted)] contract:0000000000000000000000000000000000000000000000000000000000000001, topics:[error, Error(Storage, Intern
alError)], data:"escalating error to panic"
   4: [Failed Diagnostic Event (not emitted)] contract:0000000000000000000000000000000000000000000000000000000000000001, topics:[error, Error(Storage, Intern
alError)], data:["accessing no-longer-live entry", 4095, 8739]
   5: [Diagnostic Event] topics:[fn_call, Bytes(0000000000000000000000000000000000000000000000000000000000000001), allowance], data:[Address(Contract(dd22222
2262222222222ffffffffffffffffffffffffffffffffffffffffffff)), Address(Contract(dd222222262222222222ffffffffffffffffffffffffffffffffffffffffffff))]

Backtrace (newest first):
   0: soroban_env_host::budget::Budget::with_shadow_mode
             at /home/azureuser/data/stellar/rs-soroban-env/soroban-env-host/src/budget.rs:653:21
   1: soroban_env_host::host::Host::with_debug_mode
             at /home/azureuser/data/stellar/rs-soroban-env/soroban-env-host/src/host.rs:533:24
      soroban_env_host::host::error::<impl soroban_env_host::host::Host>::maybe_get_debug_info
             at /home/azureuser/data/stellar/rs-soroban-env/soroban-env-host/src/host/error.rs:271:13
      soroban_env_host::host::error::<impl soroban_env_host::host::Host>::error::{{closure}}
             at /home/azureuser/data/stellar/rs-soroban-env/soroban-env-host/src/host/error.rs:255:23
   2: soroban_env_host::budget::Budget::with_shadow_mode
             at /home/azureuser/data/stellar/rs-soroban-env/soroban-env-host/src/budget.rs:653:21
   3: soroban_env_host::host::Host::with_debug_mode
             at /home/azureuser/data/stellar/rs-soroban-env/soroban-env-host/src/host.rs:533:24
      soroban_env_host::host::error::<impl soroban_env_host::host::Host>::error
             at /home/azureuser/data/stellar/rs-soroban-env/soroban-env-host/src/host/error.rs:242:9
      <soroban_env_host::host::Host as soroban_env_common::env::EnvBase>::escalate_error_to_panic
             at /home/azureuser/data/stellar/rs-soroban-env/soroban-env-host/src/host.rs:683:26
   4: soroban_sdk::env::internal::reject_err::{{closure}}
             at /home/azureuser/data/stellar/rs-soroban-sdk/soroban-sdk/src/env.rs:52:23
   5: core::result::Result<T,E>::map_err
             at /rustc/9a66e4471f71283fd54d80ef8147630d34756332/library/core/src/result.rs:829:27
      soroban_sdk::env::internal::reject_err
             at /home/azureuser/data/stellar/rs-soroban-sdk/soroban-sdk/src/env.rs:52:9
      <soroban_sdk::env::Env as soroban_env_common::env::Env>::call
             at /home/azureuser/data/stellar/rs-soroban-sdk/soroban-sdk/src/env.rs:1618:13
      soroban_sdk::env::Env::invoke_contract
             at /home/azureuser/data/stellar/rs-soroban-sdk/soroban-sdk/src/env.rs:370:18
   6: soroban_sdk::token::TokenClient::allowance
             at /home/azureuser/data/stellar/rs-soroban-sdk/soroban-sdk/src/token.rs:31:1
   7: fuzz_target_1::_::__libfuzzer_sys_run
             at fuzz/fuzz_targets/fuzz_target_1.rs:316:29

The backtrace doesn't indicate where that storage error was, just that something happened during the call to the token contract.

Something interesting here is that the original error does appear to generate a backtrace that we don't get to see: it calls Host::err, which produces a HostError containing a backtrace (in maybe_get_debug_info):

    pub fn err(&self, type_: ScErrorType, code: ScErrorCode, msg: &str, args: &[Val]) -> HostError {
        let error = Error::from_type_and_code(type_, code);
        self.error(error, msg, args)
    }

    /// At minimum constructs and returns a [HostError] build from the provided
    /// [Error], and when running in [DiagnosticMode::Debug] additionally
    /// records a diagnostic event with the provided `msg` and `args` and then
    /// enriches the returned [Error] with [DebugInfo] in the form of a
    /// [Backtrace] and snapshot of the [Events] buffer.
    pub fn error(&self, error: Error, msg: &str, args: &[Val]) -> HostError {
        let mut he = HostError::from(error);
        self.with_debug_mode(|| {
            // We _try_ to take a mutable borrow of the events buffer refcell
            // while building up the event we're going to emit into the events
            // log, failing gracefully (just emitting a no-debug-info
            // `HostError` wrapping the supplied `Error`) if we cannot acquire
            // the refcell. This is to handle the "double fault" case where we
            // get an error _while performing_ any of the steps needed to record
            // an error as an event, below.
            if let Ok(mut events_refmut) = self.0.events.try_borrow_mut() {
                self.record_err_diagnostics(events_refmut.deref_mut(), error, msg, args);
            }
            he = HostError {
                error,
                info: self.maybe_get_debug_info(),
            };
            Ok(())
        });
        he
    }

This HostError is not the one that ultimately gets printed. Instead, escalate_error_to_panic creates a new HostError and prints that:

    #[cfg(feature = "testutils")]
    fn escalate_error_to_panic(&self, e: Self::Error) -> ! {
        let _ = self.with_current_frame_opt(|f| {
            if let Some(Frame::TestContract(frame)) = f {
                if let Ok(mut panic) = frame.panic.try_borrow_mut() {
                    *panic = Some(e.error);
                }
            }
            Ok(())
        });
        // XXX discard a HostError and create a new one to panic/print XXX
        let escalation = self.error(e.error, "escalating error to panic", &[]);
        panic!("{:#?}", escalation);
    }

This new HostError generates the useless backtrace.

It's not obvious though how to thread the backtrace from the original error all the way to this escalate_error_to_panic function (the HostError received as input to this function is still not the original error, as the errors go through multiple layers of possibly-lossy translation from HostError to ScError and back; reusing the input HostErrors backtrace would be more useful than the one that is printed though, and in some cases may be the actual desired backtrace).

At the time the original error is generated, diagnostics are also logged, so there's an opportunity there to stash a backtrace in the diagnostics; then (somehow) figure out that particular diagnostic contains the backtrace of interest.

The text was updated successfully, but these errors were encountered:

graydon · 2023-11-30T06:47:33Z

Hmm. This is sad. I would prefer to try stashing in the panic slot in the native frame.

I'm sympathetic to this being quite frustrating and hard to follow. IME fixing these sorts of bugs takes a lot of careful tracing of the propagation paths before you find a change that'll do the trick.

Where are the ScError transitions happening?

brson · 2023-12-04T01:01:15Z

Where are the ScError transitions happening?

I want to spend some more time debugging. I have looked at the code that is intended to propagate errors, see that it could/should be able to proparage this backtrace, but don't quite have the full picture.

In this case I am having the token client call a native-compiled contract, which makes an extend_ttl call (which is infallible to the perspective of the contract); the runtime creates a HostError, which gets escalated to a panic, then that tricky testutils code in call_n_internal catches the panic and propagates a new error; the backtrace getting lost in the process.

(It's at this point that I think call_n_internal pulls an error payload out of the frame, but that error payload is not a HostError and doesn't have the original backtrace. I am not sure where the error in the frame was generated.).

There is surely some way to get escalate_error_to_panic and call_n_internal to cooperate to preserve the backtrace.

I think I can figure this out myself, just need some more time to poke at it.

brson · 2023-12-04T23:45:15Z

Changing the TestContractFrame's panic slot to hold a HostError instead of an Error is sufficient to recover the backtrace after a panic and continue propagating it.

Here's a wip patch that does so: brson@7cdcbab

brson · 2023-12-05T00:01:30Z

Where are the ScError transitions happening?

It wasn't actually literally converting to ScError, but rather Error - escalate_error_to_panic stuffs the Error from HostError into the frame, discards the original HostError, generates a new HostError, throws that HostError; then call_n_internal catches, then pulls the Error out of the frame, creates and returns a new HostError with yet another new backtrace.

brson · 2024-02-12T01:46:15Z

I've done a little more work on this here: https://github.com/brson/rs-soroban-env/tree/backtraces2

I've found 3 cases where a secondary error event discards the backtrace from the initial error event. One of them I have captured in a test case.

Closes stellar#1266

brson added the bug Something isn't working label Nov 30, 2023

brson added a commit to brson/rs-soroban-env that referenced this issue Feb 17, 2024

Preserve backtraces across secondary error events

f99ae48

Closes stellar#1266

brson added a commit to brson/rs-soroban-env that referenced this issue Feb 17, 2024

Preserve backtraces across secondary error events

8ffc57d

Closes stellar#1266

brson added a commit to brson/rs-soroban-env that referenced this issue Feb 17, 2024

Preserve backtraces across secondary error events

95edbe4

Closes stellar#1266

brson linked a pull request Feb 17, 2024 that will close this issue

Preserve backtraces across secondary error events #1349

Open

anupsdf assigned graydon Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testutils backtraces are usually not useful #1266

testutils backtraces are usually not useful #1266

brson commented Nov 30, 2023 •

edited

Loading

graydon commented Nov 30, 2023

brson commented Dec 4, 2023 •

edited

Loading

brson commented Dec 4, 2023

brson commented Dec 5, 2023 •

edited

Loading

brson commented Feb 12, 2024 •

edited

Loading

testutils backtraces are usually not useful #1266

testutils backtraces are usually not useful #1266

Comments

brson commented Nov 30, 2023 • edited Loading

graydon commented Nov 30, 2023

brson commented Dec 4, 2023 • edited Loading

brson commented Dec 4, 2023

brson commented Dec 5, 2023 • edited Loading

brson commented Feb 12, 2024 • edited Loading

brson commented Nov 30, 2023 •

edited

Loading

brson commented Dec 4, 2023 •

edited

Loading

brson commented Dec 5, 2023 •

edited

Loading

brson commented Feb 12, 2024 •

edited

Loading