Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange on-target-test results on rust beta #823

Open
jannic opened this issue Jul 27, 2024 · 6 comments
Open

Strange on-target-test results on rust beta #823

jannic opened this issue Jul 27, 2024 · 6 comments

Comments

@jannic
Copy link
Member

jannic commented Jul 27, 2024

I tried this several times because I couldn't believe it, and it seems to be reproducible:

The command CARGO_TARGET_THUMBV6M_NONE_EABI_RUNNER="probe-rs run" cargo +beta test -p on-target-tests --no-fail-fast --test i2c_loopback --test i2c_loopback_async -- --chip rp2040 just hangs on the second test (i2c_loopback_async).

On stable, it works reliably: CARGO_TARGET_THUMBV6M_NONE_EABI_RUNNER="probe-rs run" cargo +stable test -p on-target-tests --no-fail-fast --test i2c_loopback --test i2c_loopback_async -- --chip rp2040.

If I only run i2c_loopback_async without running i2c_loopback first, it works on both stable and beta.

The hang vanishes if I add 4 cortex_m::asm::nop() calls in non_blocking::reset after the two free calls. 3 are not enough. A cortex_m::asm::delay(1000) in the same place neither so it's not purely a timing issue. More likely depends on the code size somehow?

Changing the i2c_loopback test (ie. the one I have to call first to make the second one fail) to only contain the embedded_hal test causes the call to i2c_loopback_async to return with:

ERROR panicked at on-target-tests/tests/i2c_tests/non_blocking.rs:36:93:
`async fn` resumed after completion
└─ panic_probe::print_defmt::print @ /home/jan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/panic-probe-0.3.2/src/lib.rs:104
<lvl> (1/3) running `embedded_hal`...
└─ i2c_loopback_async::tests::__defmt_test_entry @ tests/i2c_loopback_async.rs:56
<lvl> (2/3) running `transactions_iter`...
└─ i2c_loopback_async::tests::__defmt_test_entry @ tests/i2c_loopback_async.rs:62
<lvl> (3/3) running `i2c_write_iter`...
└─ i2c_loopback_async::tests::__defmt_test_entry @ tests/i2c_loopback_async.rs:68
<lvl> all tests passed!
└─ i2c_loopback_async::tests::__defmt_test_entry @ tests/i2c_loopback_async.rs:43

Tested on a raspberry pico with the documented test connections, pico-probe, probe-rs 0.24.0 on linux. Rust versions are 1.80.0 and 1.81.0-beta.2.

@jamesmunns
Copy link

Noting my comments from chat:

Flaky behavior like this is generally from one of three sources:

  1. Potential UB - could be from nostd_async, or within rp-hal: Soundness issues sammhicks/nostd_async#4
  2. Improper sync primitive usage, such as "check then wait", instead of "wait then check", I pointed to
    if self.tx_fifo_full() {
    // wait for more room in the fifo
    abort_reason = CPFn::new(
    self,
    Self::poll_tx_not_full,
    Self::unmask_tx_not_full,
    Self::cancel,
    )
    .await;
    if abort_reason.is_err() {
    break;
    }
    }
    , or
    let r = (is_ready)(periph);
    if r.is_pending() {
    Periph::waker().register(cx.waker());
    (setup_flags)(periph);
    } else {
    *done = true;
    }
    r
    as examples of this where it could be possible to race with the interrupt
  3. (distant third) rustc/LLVM miscompilations not unheard of but very rare (only once I can think of in the history of embedded rust being stable?)

@ithinuel
Copy link
Member

The unsoundnesses raised in sammhicks/nostd_async#4 while true, are not exercised in these examples.

They all stem from the fact that if tasks are stack allocated, they may be moved which would break the internally kept pointer to them.

In this tests/examples, the tests run and block that main thread until the task has completed.
By then, there’s no longer any pointer to the tasks kept anywhere (neither in nostd_async, nor in any Waker as there are no IRQ expected).

@jamesmunns I don’t quite understand your point 2, so for this answer I’ll assume you meant

in async_utils a race is possible if the interrupt fires after line 121 returned Pending but before the waker is even registered.

But the interrupts are expected to be masked (within the peripheral, but enabled in the NVIC), therefore the IRQ isn’t fired until (setup_flags)(periph) is called on line 124 after the waker has been registered.

There may be a need for more detailed documentation to explain the expected behaviour & responsibility, eg:

  • There must be NO IRQ source remaining unmasked when no IRQ is expected.
    As when one becomes expected, the IRQ handler may fire even before a waker is registered for it.
  • The IRQ handler must clear the flag that caused it (to avoid IRQ locking)
  • An IRQ may only be unmasked in the peripheral AFTER a waker has been registered for it

@jamesmunns
Copy link

@ithinuel I was more pointing to this part from dirbaio's comments:

cargo miri test and cargo miri run --example ... fail. I haven't looked deep in the code to know whether this is due to the above issues, or to other issues.

It's possible there is additional latent UB. Also UB doesn't necessarily need to be exercised to cause miscompilations.

You were correct at my second point, I'd definitely need to audit in more detail to come up with a specific indication!

@jannic
Copy link
Member Author

jannic commented Aug 4, 2024

Lacking any better idea, I did some bisecting: nightly-2024-06-13 is the first version where this happens, with nightly-2024-06-12 the on-target-tests succeed.

@sammhicks
Copy link

sammhicks commented Sep 16, 2024

Hi, I'm the developer of nostd_async.

There was definitely UB in nostd_async, which should now have been resolved.

Miri complains about version 0.6.1, but not 0.7.0 to 0.7.2.

However, I'm no longer developing it, so you may wish to use a different async runtime, such as embassy.

@thejpster
Copy link
Member

thejpster commented Sep 16, 2024

Does diffing the objdump turn up anything interesting? I like to use arm-none-eabi-objdump --no-addresses --no-show-raw-insn -dCt ./file.elf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants