Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift: assertion left == right failed: the memory base pointer may be incorrect due to sharing memory #8652

Closed
whitequark opened this issue May 19, 2024 · 24 comments · Fixed by #8750
Labels
bug Incorrect behavior in the current implementation that needs fixing cranelift Issues related to the Cranelift code generator

Comments

@whitequark
Copy link
Contributor

whitequark commented May 19, 2024

Steps to Reproduce

I have a binary (wasm-ld, which is lld from the LLVM distribution) built with wasi-sdk-22.0 for the wasm32-wasi-threads target, with CFLAGS of -pthread and LDFLAGS of -Wl,--max-memory=4294967296. It is built with debug symbols using the normal LLVM mechanism (RelWithDebInfo). It is too big to include here but I believe the relevant excerpt is:

  (memory (;0;) 53 65536 shared)
  (export "memory" (memory 0))

I run the binary with RUST_BACKTRACE=full WASMTIME_BACKTRACE_DETAILS=1 wasmtime -S threads --dir . --dir /tmp -D debug-info -D address-map ./llvm-build/bin/wasm-ld <...arguments that cause a trap...>.

Expected Results

A backtrace with Wasm-level symbols is displayed.

Actual Results

An assertion failure in Cranelift.
thread 'main' panicked at /home/whitequark/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wasmtime-cranelift-20.0.2/src/compiler.rs:559:13:
assertion `left == right` failed: the memory base pointer may be incorrect due to sharing memory
  left: 1
 right: 0
stack backtrace:
   0:     0x55f64b236496 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h402f02d9b4df5fe1
   1:     0x55f64b262c4c - core::fmt::write::h8fddb4828840201f
   2:     0x55f64b23234f - std::io::Write::write_fmt::h38b9b1a6a63dbb61
   3:     0x55f64b236244 - std::sys_common::backtrace::print::h98a1789ae7f04118
   4:     0x55f64b237a3b - std::panicking::default_hook::{{closure}}::hbf9ed991ae15cec9
   5:     0x55f64b237789 - std::panicking::default_hook::h0e9a403389616b5a
   6:     0x55f64b237edd - std::panicking::rust_panic_with_hook::h47d6ea0dd5408232
   7:     0x55f64b237db2 - std::panicking::begin_panic_handler::{{closure}}::h92169a5d672aa955
   8:     0x55f64b236976 - std::sys_common::backtrace::__rust_end_short_backtrace::h78311f5de6c73a58
   9:     0x55f64b237ae4 - rust_begin_unwind
  10:     0x55f649eb0a15 - core::panicking::panic_fmt::h9fcc2d8bb7eb13a8
  11:     0x55f649eb0e8f - core::panicking::assert_failed_inner::h77eb1113bacb0f31
  12:     0x55f649e588af - core::panicking::assert_failed::he44b2686efc08dff
  13:     0x55f64abe2d30 - <wasmtime_cranelift::compiler::Compiler as wasmtime_environ::compile::Compiler>::append_dwarf::h14d11c20fa9e5f82
  14:     0x55f64aafb697 - wasmtime::compile::FunctionIndices::link_and_append_code::hf766901eb87a60b8
  15:     0x55f64aaf4c31 - wasmtime::compile::build_artifacts::h1809dbb6af98174d
  16:     0x55f64ab4360c - core::ops::function::FnOnce::call_once::h8f5620eee6192dd4
  17:     0x55f64ab6bdc5 - wasmtime_cache::ModuleCacheEntry::get_data_raw::h5918a4673179324b
  18:     0x55f64ab4f059 - wasmtime::compile::runtime::<impl wasmtime::compile::code_builder::CodeBuilder>::compile_module::h6cfb67fd1fb024ce
  19:     0x55f649ee4dec - wasmtime_cli::common::RunCommon::load_module::h0243b5cafcaca8ef
  20:     0x55f649fd2f2a - wasmtime_cli::commands::run::RunCommand::execute::h741dadf6ee6383ad
  21:     0x55f649eb9c01 - wasmtime::Wasmtime::execute::h9f6abe47b919fe59
  22:     0x55f649eb5088 - wasmtime::old_cli::main::hbf2a698a610c379d
  23:     0x55f649eb69a3 - std::sys_common::backtrace::__rust_begin_short_backtrace::h2d92647e84e7ad00
  24:     0x55f649eb69bd - std::rt::lang_start::{{closure}}::h3b525ebc181e5daf
  25:     0x55f64b228533 - std::rt::lang_start_internal::hb867f343d27fa61e
  26:     0x55f649ebb4a5 - main
  27:     0x7f85f8f3f24a - __libc_start_call_main
                               at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  28:     0x7f85f8f3f305 - __libc_start_main_impl
                               at ./csu/../csu/libc-start.c:360:3
  29:     0x55f649eb10c1 - _start
  30:                0x0 - <unknown>

Versions and Environment

Cranelift version or commit: wasmtime-cranelift-20.0.2

Operating system: Linux

Architecture: x86_64

@whitequark whitequark added bug Incorrect behavior in the current implementation that needs fixing cranelift Issues related to the Cranelift code generator labels May 19, 2024
@alexcrichton
Copy link
Member

The assertion triggered is this one and is more-or-less this issue where the quality of the current implementation could be better. It might be easy enough to fix this particular assertion, but I don't know enough about the DWARF bits to know.

In the meantime though dropping -D debug-info should get this further

@whitequark
Copy link
Contributor Author

Yeah so I looked at the assertion and couldn't figure out what that meant at all, so I decided to file a bug instead.

In the end I had to use wabt's wasm-objdump and the address to manually correlate the Wasm bytecode to the C source, which was fairly unpleasant. I still have no idea why addr2line doesn't work--I had to pass -O0 to wasm-ld (which, while we're at it, is an incredibly unintuitive flag--it took me literal years and the compiler driver source to find out that it removes the wasm-opt invocation; made worse by the fact that wasm-opt corrupts the debug info) and that gave me function names but still no line numbers.

@alexcrichton
Copy link
Member

When -D debug-info was omitted, WASMTIME_BACKTRACE_DETAILS=1 was passed, and the original source had DWARF, you didn't see filenames and/or line numbers? If so that's a bug because backtrace-based filenames/line numbers are unrelated to -D debug-info so shouldn't be affected by this assertion.

If it helps I added wasm-tools addr2line for this sort of use case awhile back, but if it doesn't work in Wasmtime it may not work there either.

@whitequark
Copy link
Contributor Author

That's correct, and I figured it's a bug. wasm-tools addr2line didn't work either, and neither did LLVM's addr2line that I had on my system (from LLVM 15). No idea why!

@cfallin
Copy link
Member

cfallin commented May 20, 2024

I had to pass -O0 to wasm-ld (which, while we're at it, is an incredibly unintuitive flag--it took me literal years and the compiler driver source to find out that it removes the wasm-opt invocation; made worse by the fact that wasm-opt corrupts the debug info)

FWIW, a number of us are frustrated at the default llvm behavior here to invoke wasm-opt automagically if it's found in the path (and only if -- nondeterministic builds!). You can find a (very) long thread of folks discovering this footgun and registering new reasons for it to be removed at llvm/llvm-project#55781 (nothing would be more welcome than to fix this imho; I've been bitten by it a handful of times too). In the meantime, a workaround a number of folks have found -- and even used in quasi-official places like WASI builds of SpiderMonkey in various SDKs -- is to prepend a fake wasm-opt to the path that is an empty shell script (with exit 0 or whatever). Perhaps we should document this footgun and workaround somewhere.

@alexcrichton
Copy link
Member

@whitequark would you be able to share either the wasm binary itself (perhaps not here if it's too large but via some other means) or instructions on how to build a local copy of it? It sounds like the dwarf info may be corrupted and that could range from a build process thing to an llvm bug to a tooling issue in wasmtime/wasm-tools.

@whitequark
Copy link
Contributor Author

would you be able to share either the wasm binary itself (perhaps not here if it's too large but via some other means) or instructions on how to build a local copy of it?

I can do both! The binary is about half a gig but it's built with a simple script. I'm going to double-check the build to make sure you can reproduce the crash and then upload. I don't envy you looking at half a GB of DWARF though...

@whitequark
Copy link
Contributor Author

FWIW, a number of us are frustrated at the default llvm behavior here to invoke wasm-opt automagically if it's found in the path (and only if -- nondeterministic builds!).

I found it baffling and frustrating, yes. I assumed that it was (uncharacteristically!) a decision driven by the Wasm folks. If it's not, should we not just fix that?

@whitequark
Copy link
Contributor Author

@alexcrichton This branch contains a reproducer--just run ./build.sh on Linux and it should produce a binary. Here's how it fails:

$ WASMTIME_BACKTRACE_DETAILS=1 wasmtime -S threads --dir . ./llvm-build/bin/wasm-ld test.o -o test.wasm --no-entry
Error: error while executing at wasm backtrace:
    0: 0x25c5755 - lld!wasi_thread_start

Caused by:
    0: memory fault at wasm address 0x374fc0 in linear memory of size 0x340000
    1: wasm trap: out of bounds memory access
$ wasm-tools objdump ./llvm-build/bin/wasm-ld
  types                                  |        0xb -      0x9af |      2468 bytes | 246 count
  imports                                |      0x9b2 -      0xdd9 |      1063 bytes | 28 count
  functions                              |      0xddd -    0x10e9f |     65730 bytes | 65506 count
  tables                                 |    0x10ea1 -    0x10eaa |         9 bytes | 1 count
  memories                               |    0x10eac -    0x10eb2 |         6 bytes | 1 count
  globals                                |    0x10eb4 -    0x10ed2 |        30 bytes | 5 count
  exports                                |    0x10ed4 -    0x10efd |        41 bytes | 3 count
  start                                  |    0x10eff -    0x10f00 |         1 bytes | 1 count
  elements                               |    0x10f04 -    0x1e359 |     54357 bytes | 1 count
  data count                             |    0x1e35b -    0x1e35c |         1 bytes | 1 count
  code                                   |    0x1e361 -  0x25c6e2a |  39488201 bytes | 65506 count
  data                                   |  0x25c6e2f -  0x28b122c |   3056637 bytes | 3 count
  custom ".debug_loc"                    |  0x28b123c -  0x666c7a6 |  64730474 bytes | 1 count
  custom ".debug_abbrev"                 |  0x666c7b9 -  0x6a04a8b |   3769042 bytes | 1 count
  custom ".debug_info"                   |  0x6a04a9d - 0x17802250 | 283105203 bytes | 1 count
  custom ".debug_ranges"                 | 0x17802263 - 0x18e53e6b |  23403528 bytes | 1 count
  custom ".debug_str"                    | 0x18e53e7b - 0x207465b2 | 126822199 bytes | 1 count
  custom ".debug_line"                   | 0x207465c3 - 0x23d9c125 |  56974178 bytes | 1 count
  custom "name"                          | 0x23d9c12f - 0x24b19d6f |  14146624 bytes | 1 count
  custom "producers"                     | 0x24b19d7c - 0x24b19e07 |       139 bytes | 1 count
  custom "target_features"               | 0x24b19e19 - 0x24b19e4b |        50 bytes | 1 count
$ wasm-tools addr2line ./llvm-build/bin/wasm-ld 0x25c5755
0x25c5755: no dwarf frames found for this address

Also here is an upload if you want to skip building it.

@cfallin
Copy link
Member

cfallin commented May 20, 2024

If it's not, should we not just fix that?

Feel free to add more data upstream! I think several folks have pushed on this (see above thread) but additional anecdotes might add some motivation or spawn new interest...

@whitequark
Copy link
Contributor Author

I mean, who's the code owner for wasm-ld? Should we not submit a patch and ping them? Adding an option to disable wasm-opt seems completely uncontroversial; making it on-by-default perhaps more so, but even that would make it more in line with other targets.

@alexcrichton
Copy link
Member

Ah I think everything's actually working as intended in the dwarf bits there, albeit unfortunately. The faulting address there is in wasi_thread_start which is defined here. The trapping instruction is this one.

It looks like the bug here has to do with the protocol of spawning a thread as the second instruction in the thread probably shouldn't be the one faulting.

I tested another address by running wasm-tools print -p llvm-build/bin/wasm-ld -o wasm-ld.wat and then within wasm-ld.wat I picked a random offset and got

$ wasm-tools addr2line ./llvm-build/bin/wasm-ld 0x1fb0b
0x1fb0b: llvm::raw_ostream::operator<<(llvm::StringRef) /home/alex/code/clang/llvm-src/llvm/include/llvm/Support/raw_ostream.h:216:25
        llvm::raw_ostream::operator<<(char const*) /home/alex/code/clang/llvm-src/llvm/include/llvm/Support/raw_ostream.h:244:18
        llvm::cl::basic_parser_impl::printOptionInfo(llvm::cl::Option const&, unsigned long) const /home/alex/code/clang/llvm-src/llvm/lib/Support/CommandLine.cpp:1939:14

I believe that the wasi_thread_start function doesn't have dwarf since it's inline asm. I also believe that wasi-libc isn't built with debug info so any symbols in libc also won't have dwarf for them. Finally you're also running into debugging limitations here where there's currently no great way to debug what spawned the thread and/or get a listing of all wasm threads.

@alexcrichton
Copy link
Member

I mean, who's the code owner for wasm-ld?

I'm personally under the impression it's Sam Clegg (not pinging here directly as it might be a bit out of context)

@whitequark
Copy link
Contributor Author

It looks like the bug here has to do with the protocol of spawning a thread as the second instruction in the thread probably shouldn't be the one faulting.

Yeah, it's left me quite confused. Also, previously I got a fault on a similarly non-memory-accessing instruction, it was on args->start_func() here which similarly didn't make sense (the backtrace pointed to call_indirect).

@alexcrichton
Copy link
Member

Ah ok yeah given that it's still in wasi-libc that makes sense that there's no filename/line number there either because you haven't made it into clang/lld yet. That being said that instruction should also not be faulting, so my hypothesis would be some sort of corruption/bug in the threads part of wasi-threads. I'm not sure how well tested the whole port is (with threads), so this isn't necessarily altogether surprising

@whitequark
Copy link
Contributor Author

I'm not sure how well tested the whole port is (with threads), so this isn't necessarily altogether surprising

By port here do you mean the wasi-libc port?

@whitequark
Copy link
Contributor Author

Ah ok yeah given that it's still in wasi-libc that makes sense that there's no filename/line number there either because you haven't made it into clang/lld yet.

Ah that explains it! Can we ship wasi-libc with debug symbols?

@alexcrichton
Copy link
Member

Ah sorry, but yes by port I mean the implementation in wasi-libc. The implementation of pthread-style APIs there I don't believe has seen a large amount of testing. WebAssembly/wasi-libc#405 is an example older issue where I think WebAssembly/wasi-libc#409 fixed some things but that thread mentions known open issues.

@alexcrichton
Copy link
Member

Er, I should keep reading before replaying, apologies!

Can we ship wasi-libc with debug symbols?

Personally I don't see any reason why not, they're easy enough to strip out and otherwise difficult to retrofit and very useful in situations like this. Changes to the builds of wasi-libc would go in wasi-sdk around the wasi-libc build rules, probably appending to CFLAGS for that invocation.

@alexcrichton
Copy link
Member

Actually if you run the exact example today from WebAssembly/wasi-libc#405 it is not fixed, and might be what's happening here.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Jun 3, 2024
This is more-or-less a prerequisite for bytecodealliance#8652 and extends the generated
dwarf with expressions to not only dereference owned memories but
additionally imported memories which involve some extra address
calculations to be emitted in the dwarf.
@alexcrichton
Copy link
Member

@whitequark btw if you hadn't seen already from WebAssembly/wasi-libc#405 this was enough for me to get the wasm-ld binary to run locally:

diff --git a/build.sh b/build.sh
index f99daf8..e659a9d 100755
--- a/build.sh
+++ b/build.sh
@@ -14,7 +14,7 @@ WASI_LDFLAGS="-O0" # "-flto -Wl,--strip-all"
 # LLVM doesn't build without <mutex>, etc, even with -DLLVM_ENABLE_THREADS=OFF.
 WASI_TARGET="${WASI_TARGET}-threads"
 WASI_CFLAGS="${WASI_CFLAGS} -pthread"
-WASI_LDFLAGS="${WASI_LDFLAGS} -Wl,--max-memory=4294967296"
+WASI_LDFLAGS="${WASI_LDFLAGS} -Wl,--max-memory=4294967296,--import-memory,--export-memory"
 # LLVM assumes the existence of mmap.
 WASI_CFLAGS="${WASI_CFLAGS} -D_WASI_EMULATED_MMAN"
 WASI_LDFLAGS="${WASI_LDFLAGS} -lwasi-emulated-mman"

@whitequark
Copy link
Contributor Author

btw if you hadn't seen already from WebAssembly/wasi-libc#405 this was enough for me to get the wasm-ld binary to run locally:

Ah that's interesting, I didn't realize that's the fix!

github-merge-queue bot pushed a commit that referenced this issue Jun 5, 2024
This is more-or-less a prerequisite for #8652 and extends the generated
dwarf with expressions to not only dereference owned memories but
additionally imported memories which involve some extra address
calculations to be emitted in the dwarf.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Jun 5, 2024
This commit resolves an assert in the dwarf generating of core wasm
modules when the module has a defined linear memory which is flagged
`shared`. This is represented slightly differently in the `VMContext`
than owned memories that aren't `shared`, and looks more like an
imported memory. With support in bytecodealliance#8740 it's now much easier to support
this.

Closes bytecodealliance#8652
@alexcrichton
Copy link
Member

With #8750 the assert here is resolved. I'll note that our DWARF bits are pretty unoptimized so the final *.cwasm ends up 2G in size and takes quite some time (30+s) to compile. Once it's emitted though as-is with wasi-sdk I got:

(lldb) r
Process 1611674 launched: '/home/alex/code/wasmtime/target/release/wasmtime' (x86_64)
Process 1611674 stopped
* thread #35, name = 'wasi-thread-1', stop reason = signal SIGSEGV: invalid permissions for mapped object (fault address: 0x7ffa1036dc70)
    frame #0: 0x00007fff0137f4f2 JIT(0x7ffd90000010)`wasi_thread_start(var0=1, var1=3595376) at lld.wasm:39606101
(lldb) bt
* thread #35, name = 'wasi-thread-1', stop reason = signal SIGSEGV: invalid permissions for mapped object (fault address: 0x7ffa1036dc70)
  * frame #0: 0x00007fff0137f4f2 JIT(0x7ffd90000010)`wasi_thread_start(var0=1, var1=3595376) at lld.wasm:39606101
    frame #1: 0x00007fff0145f95e JIT(0x7ffd90000010)
    frame #2: 0x0000555556d559d5 wasmtime`wasmtime_setjmp_23_0_0 + 85
    frame #3: 0x00005555564ebf61 wasmtime`wasmtime::runtime::vm::traphandlers::catch_traps::h3210ee3123e12229 + 193
    frame #4: 0x0000555556436cc8 wasmtime`wasmtime::runtime::func::invoke_wasm_and_catch_traps::h6cd1b96ad0407a69 + 216
    frame #5: 0x00005555566055e5 wasmtime`wasmtime::runtime::func::typed::TypedFunc$LT$Params$C$Results$GT$::call::he7d0c04b09b99a3e + 309
    frame #6: 0x00005555564ef78d wasmtime`std::panicking::try::h85af750e37223f0d + 445
    frame #7: 0x0000555556654aa5 wasmtime`std::sys_common::backtrace::__rust_begin_short_backtrace::hdf15cda4de348ee3 + 117
    frame #8: 0x0000555556615720 wasmtime`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::he3448505790ef0a2 + 320
    frame #9: 0x00005555574e71b5 wasmtime`std::sys::pal::unix::thread::Thread::new::thread_start::h420dad5cf01a9f35 + 37
    frame #10: 0x00007ffff7c9ca94 libc.so.6`start_thread(arg=<unavailable>) at pthread_create.c:447:8
    frame #11: 0x00007ffff7d29c3c libc.so.6`__clone3 at clone3.S:78

also noting that it takes quite some time to also load the debug info into LLDB.

Using an updated wasi-sdk (from this CI build) I ended up getting the same thing. I think that's because this faulting function is in assembly which either isn't built with dwarf by default or doesn't get dwarf due to it being assembly.

github-merge-queue bot pushed a commit that referenced this issue Jun 6, 2024
This commit resolves an assert in the dwarf generating of core wasm
modules when the module has a defined linear memory which is flagged
`shared`. This is represented slightly differently in the `VMContext`
than owned memories that aren't `shared`, and looks more like an
imported memory. With support in #8740 it's now much easier to support
this.

Closes #8652
@whitequark
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior in the current implementation that needs fixing cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants