Cranelift: de-duplicate bounds checks in legalizations #5190

fitzgen · 2022-11-03T18:40:17Z

When both (1) "dynamic" memories that need explicit bounds checks and (2)
spectre mitigations that perform bounds checks are enabled, reuse the same
bounds checks between the two legalizations.

This reduces the overhead of explicit bounds checks and spectre mitigations over
using virtual memory guard pages with spectre mitigations from ~1.9-2.1x
overhead to ~1.6-1.8x overhead. That is about a 14-19% speed up for when dynamic
memories and spectre mitigations are enabled.

virtual-memory-guards.so vs bounds-checks.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3422455129.47 ± 120159.49 (confidence = 99%)

  virtual-memory-guards.so is 2.09x to 2.09x faster than bounds-checks.so!

  [6563931659 6564063496.07 6564301535] bounds-checks.so
  [3141492675 3141608366.60 3141895249] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 338716136.87 ± 1.38 (confidence = 99%)

  virtual-memory-guards.so is 2.08x to 2.08x faster than bounds-checks.so!

  [651961494 651961495.47 651961497] bounds-checks.so
  [313245357 313245358.60 313245362] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 22742944.07 ± 331.73 (confidence = 99%)

  virtual-memory-guards.so is 1.87x to 1.87x faster than bounds-checks.so!

  [48841295 48841567.33 48842139] bounds-checks.so
  [26098439 26098623.27 26099479] virtual-memory-guards.so

virtual-memory-guards.so vs de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 2465900207.27 ± 146476.61 (confidence = 99%)

  virtual-memory-guards.so is 1.78x to 1.78x faster than de-duped-bounds-checks.so!

  [5607275431 5607442989.13 5607838342] de-duped-bounds-checks.so
  [3141445345 3141542781.87 3141711213] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 234253620.20 ± 2.33 (confidence = 99%)

  virtual-memory-guards.so is 1.75x to 1.75x faster than de-duped-bounds-checks.so!

  [547498977 547498980.93 547498985] de-duped-bounds-checks.so
  [313245357 313245360.73 313245363] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 16605659.13 ± 315.78 (confidence = 99%)

  virtual-memory-guards.so is 1.64x to 1.64x faster than de-duped-bounds-checks.so!

  [42703971 42704284.40 42704787] de-duped-bounds-checks.so
  [26098432 26098625.27 26099234] virtual-memory-guards.so

bounds-checks.so vs de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 104462517.13 ± 7.32 (confidence = 99%)

  de-duped-bounds-checks.so is 1.19x to 1.19x faster than bounds-checks.so!

  [651961493 651961500.80 651961532] bounds-checks.so
  [547498981 547498983.67 547498989] de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 956556982.80 ± 103034.59 (confidence = 99%)

  de-duped-bounds-checks.so is 1.17x to 1.17x faster than bounds-checks.so!

  [6563930590 6564019842.40 6564243651] bounds-checks.so
  [5607307146 5607462859.60 5607677763] de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 6137307.87 ± 247.75 (confidence = 99%)

  de-duped-bounds-checks.so is 1.14x to 1.14x faster than bounds-checks.so!

  [48841303 48841472.93 48842000] bounds-checks.so
  [42703965 42704165.07 42704718] de-duped-bounds-checks.so

cfallin

LGTM, thanks!

cfallin · 2022-11-03T18:44:41Z

cranelift/codegen/src/legalizer/mod.rs

@@ -46,6 +47,8 @@ fn imm_const(pos: &mut FuncCursor, arg: Value, imm: Imm64, is_signed: bool) -> V
 /// Perform a simple legalization by expansion of the function, without
 /// platform-specific transforms.
 pub fn simple_legalize(func: &mut ir::Function, cfg: &mut ControlFlowGraph, isa: &dyn TargetIsa) {
+    trace!("Pre-legalization function:\n{}", func.display());
+


These trace!s seem a bit verbose to leave in -- big function bodies will be replicated two more times (in addition to the pre-opt, pre-lowering and lowered vcode forms). I don't feel too strongly about that, just a suggestion.

I'll remove the post-legalization logging (since you can see that as the input to the next pass) but I think that in general every pass should trace! log its input.

fitzgen · 2022-11-04T17:08:22Z

Trevor and I investigated the failing test on aarch64 and this ultimately led us
to the conclusion that this patch shouldn't actually land.

Fundamentally, this PR changes the semantics of heap_addr from "calculate the
base address of the given memory access or trap if it is out of bounds" to
"calculate the base address of the given memory access or return an address
that will trap when accessed". And this semantic change only happens when
dynamic memories and spectre mitigations are enabled. In other configurations
heap_addr still has the old semantics.

This leads us into a bit of a mess:

If the result of an out of bounds heap_addr is never loaded from or stored
to, then the trap never happens. Not even "eventually". This very much does
not match heap_addr's documented semantics.
Piggybacking off the spectre mitigation, this PR use 0 as the address that
will trap when accessed. However, I'm not sure that this will always work in
the face of large offsets? There is a bit of code emitted from
cranelift-wasm that does a heap_addr to get a base address and then adds
offsets to that base. Need more investigation here.
Also according to heap_addr's documented semantics today, it would be valid
to mark a load as notrap when given the result of a heap_addr because if
we got to the load then heap_addr already verified that the address is in
bounds for us:
```
v0 = heap_addr ...
v1 = load.i32 little heap notrap v0
```
The notrap flag would become incorrect under these semantic changes.

That said, I think heap_addr does have roughly the "or return an address
that will trap when accessed" semantics when we are using static memories? So
maybe we actually can make this semantic change to heap_addr and then this
would actually make it more consistent across dynamic and static memories? And
then the notrap flag would be invalid here.

I'm not sure how to resolve these issues and still deduplicate the bounds checks
(without going full value range analysis). Given that last point, it might
actually be better to change heap_addr's semantics to "or return an address
that will trap when accessed". We would need to be careful to check that all
libcalls that take an address do explicit bounds checks on that address rather
than assuming that the JIT code already did one via its heap_addr. This is a
little bit scary since it means that valid assumptions code could have been
written under no longer hold. (Although I don't think we generally make this
assumption in the code base and usually we take wasm addresses rather than the
results of heap_addrs for libcalls; need to take a census to be more sure
about this).

Interested in folks thoughts on this.

Separately, Trevor and I discovered the following issues / not necessarily
issues but maybe things we should tighten up:

validate_atomic_addr in libcalls.rs only validates a single byte address,
not a full size of the access. This means that a bounds check could "succeed"
when the first byte being accessed is the last byte of memory and then the
other bytes of the access are out of bounds. In practice we don't have this
bug, because all callers call validate_atomic_addr with base + size_of_access but this is a bit of a foot gun waiting to bite us. Would be
nicer to pass the base and size_of_access into validate_atomic_addr,
similar to the heap_addr instruction.
The heap access spectre mitigation--at least for dynamic memories, I haven't
looked at static memories yet--only checks the first byte of the access, not
the last byte. This means that speculative execution could do an i64 load
beginning at the last byte of memory and bring (the cache line for) seven out
of bounds bytes into cache.
The heap_addr instruction should have .can_trap(true) in the meta crate,
according to its documented semantics. It doesn't currently have
this. However, this would also prevent heap_addrs from being GVN'd or
LICM'd, which we definitely want to happen, we just want to make sure that any
load of a heap_addr'd address is still heap_addr'd. Open to suggestions
here.

Although if we do change heap_addr's semantics to "or returns an address
that will trap when accessed" we don't need to mark it can_trap(true) and
this problem goes away.

cc @cfallin @elliottt @jameysharp

alexcrichton · 2022-11-04T17:37:06Z

Fundamentally, this PR changes the semantics of heap_addr from "calculate the
base address of the given memory access or trap if it is out of bounds" to
"calculate the base address of the given memory access or return an address
that will trap when accessed".

I think it changes the semantics for dynamic heaps, but this doesn't change the semantics for static heaps I think? In that sense all our libcalls should already be guarded against "this address is in bounds and should not segfault" I think?

cfallin · 2022-11-04T17:38:50Z

Thanks for digging into this; a few thoughts:

I think that, as you suggest, we are already relying on the trap happening when the address is accessed, in the static-memory case. It makes the most sense to me to update the documentation for heap_addr in that light.
I was somewhat concerned about this point:

There is a bit of code emitted from
cranelift-wasm that does a heap_addr to get a base address and then adds
offsets to that base. Need more investigation here.

and looked into it a bit; are you referring to the code in prepare_addr? If so I think that should generally be safe: its optimization (using heap_addr with the same offset of 1 for every load with an actual offset potentially larger) is triggered only when it knows that the heap is using a large-enough guard region.

Arguably such an optimization should be a rewrite rule in Cranelift instead -- translate all heap_addrs on a heap-with-guard-region to use offset=0 so they can be GVN'd together -- but I don't think this will cause a correctness issue in the short term, unless I'm misunderstanding or missing some other code...
Re: this

We would need to be careful to check that all
libcalls that take an address do explicit bounds checks on that address rather
than assuming that the JIT code already did one via its heap_addr. This is a
little bit scary since it means that valid assumptions code could have been
written under no longer hold.

if the address will trap when executed, then we're fine safety-wise (this won't be a backdoor around the heap sandbox) but I agree it will cause problems in other ways -- specifically it will result in a SIGSEGV in the runtime rather than in guest code, which will (usually) kill the process. So I agree we should verify that libcalls take Wasm addresses.

I think this is an issue today though, as (as you noted) the guard-region case already relies on "actual load/store traps, not heap_addr". (That increases the importance of doing the above verification, to be clear.)

Overall, I think I'm not as pessimistic on this: it basically changes heap_addr to work in the dynamic-memory case like it does in the static-memory case, and Wasmtime is already well-tested with its page-protection-based bounds checking. As long as the address-0 page is unmapped (as it is on all of our platforms), we are basically using it as a one-page-sized guard region.

fitzgen · 2022-11-04T18:33:59Z

are you referring to the code in prepare_addr? If so I think that should generally be safe: its optimization (using heap_addr with the same offset of 1 for every load with an actual offset potentially larger) is triggered only when it knows that the heap is using a large-enough guard region.

Yeah, and I agree it is safe, it just makes me a little nervous to produce CLIF that has heap_addrs that aren't immediately followed by the associated memory access. As you allude to, I would be more comfortable having this kind of thing done in the mid-end.

fitzgen · 2022-11-04T19:58:20Z

I guess the biggest unknown in my mind is how large of a NULL guard page region we can rely on? Because this PR bakes in assumptions that the spectre guard already makes about 0 + offset will always trap for any 32-bit Wasm offset immediate.

For example, given

(module
  (memory (export "memory") 1)
  (func (export "f") (param i32) (result i32)
    local.get 0
    i32.load offset=0xfffffff0))

We lower that Wasm to this clif:

v4 = heap_addr.i64 heap0, v2, 0x8000_0000
v5 = iadd_imm v4, 0xffff_fff0
v6 = load.i32 little heap v5

And then (with static memories (but smaller than 4GiB guard region) + spectre guards) legalize that clif to this clif:

;; check if the pointer is larger than the static memory size (including both
;; wasm memory and guard pages, I think?)
v7 = uextend.i64 v2
v8 = icmp_imm ugt v7, 0x8000_0000
trapnz v8, heap_oob
v9 = iconst.i64 0x8000_0000
;; gv4 is the memory base
v10 = global_value.i64 gv4
v11 = iadd v10, v7
v12 = iconst.i64 0
v13 = icmp ugt v7, v9  ; v9 = 0x8000_0000
v4 = select_spectre_guard v13, v12, v11  ; v12 = 0
v5 = iadd_imm v4, 0xffff_fff0
v6 = load.i32 little heap v5

Do we guarantee a 4GiB-sized NULL guard region? Because doing the spectre guard on the base (v11) rather than base plus offset (v5) makes that assumption. This doesn't seem like something we can rely on, especially across platforms?

This becomes even more problematic when we look at the legalization of that lowered Wasm's initial clif, this time configured with dynamic memories and spectre guards, and using this patch which removes the dynamic memory's bounds check in favor of the spectre guard's bounds check:

;; extend wasm i32 pointer to native i64 pointer
v7 = uextend.i64 v2
;; gv4 is the base of memory
v8 = global_value.i64 gv4
;; overflow check for pointer + offset
v9 = iconst.i64 0xffff_0000
v10 = uadd_overflow_trap v7, v9, heap_oob  ; v9 = 0xffff_0000
;; gv5 is the length of memory
v11 = global_value.i64 gv5
v12 = iadd v11, v7
v13 = iconst.i64 0
v14 = icmp ugt v10, v8
v4 = select_spectre_guard v14, v13, v12  ; v13 = 0
v5 = iadd_imm v4, 0xffff_fff0
v6 = load.i32 little heap v5

We're baking in that same assumption. Now we aren't just letting speculative execution access the hopefully-null-guard-page-protected region of memory in 0..4GiB, but regular Wasm execution as well.

I think we can fix all this ambiguity by effectively doing select_spectre_guard(base + offset) instead of select_spectre_guard(base) + offset?

fitzgen · 2022-11-04T20:16:23Z

I think we can fix all this ambiguity by effectively doing select_spectre_guard(base + offset) instead of select_spectre_guard(base) + offset?

This requires changes to heap_addr where it will need to be given not the accumulated access size (wasm offset + size_of(type)) but those individual components, so that it can return the native address of the translated wasm address, so that we can spectre guard that rather than the native base.

bjorn3 · 2022-11-04T20:28:45Z

I don't think we can guarantee a 4GB null guard. Memory may be mapped within this range, whether explicitly using mmal or because the executable has a fixed position within this range. Only 4096 bytes is guaranteed on linux as this is the default value for vm.mmap_min_addr.

cfallin · 2022-11-04T20:54:43Z

If I understand the code correctly, I think we handle all of this already -- first, here we get offset_guard_size from the heap spec, and in the return-NULL-to-trap configuration we should have a guard size of 0 as with any dynamically-bounds-checked memory (or maybe it's a small nominal amount, I don't remember, but certainly not 4GiB). Then, here we include that offset on the heap_addr, and it will bounds-check appropriately taking the offset into consideration.

Or I guess all of this is a long way of saying: I don't think we actually lower into that CLIF if dynamic memory is configured. This is subtly different from

This becomes even more problematic when we look at the legalization of that lowered Wasm's initial clif, this time configured with dynamic memories and spectre guards

I think -- if I understand your methodology correct, you took the CLIF generated by translation with static memories enabled, then compiled it with dynamic memories enabled? With the above-linked code I think doing the translation with dynamic memories enabled should result in code that does not assume a 4GiB guard at address 0.

cfallin · 2022-11-04T20:55:43Z

(And to answer the immediate question, no, we definitely can't assume a 4GiB guard region at address 0! That would be an un-portable mess, as you observe.)

fitzgen · 2022-11-04T20:58:19Z

I think -- if I understand your methodology correct, you took the CLIF generated by translation with static memories enabled, then compiled it with dynamic memories enabled? With the above-linked code I think doing the translation with dynamic memories enabled should result in code that does not assume a 4GiB guard at address 0.

No, each different legalization example is generated from the original input Wasm with different flags passed to Wasmtime. I was not working with just the CLIF. Always using the original Wasm as input.

fitzgen · 2022-11-04T21:03:00Z

Then, here we include that offset on the heap_addr, and it will bounds-check appropriately taking the offset into consideration.

This still doesn't prevent speculated execution from reading 0..4GiB though, because the generated code ultimately does

spectre_guard(base) + offset

rather than

spectre_guard(base + offset)

cfallin · 2022-11-04T21:13:26Z

Ah, OK, I am misreading your concern, sorry. I had thought you were saying that the actual bounds-check (the non-speculative behavior that should result in a trap) was wrong; and was very confused because the result of the uadd_overflow_trap was fed into the icmp, so all seemed to account for the offset.

But indeed, the select is picking the base prior to the offset addition. We should fix that by reordering the ops, as you suggest, and providing heap_addr with separate offset and size as you've noted.

Once we do that, are there any other issues with this approach?

fitzgen · 2022-11-04T21:16:09Z

I had thought you were saying that the actual bounds-check (the non-speculative behavior that should result in a trap) was wrong;

Today, actual execution is not affected by this.

After this PR as currently written, actual execution is affected by this when dynamic memories and spectre guards are both enabled.

But indeed, the select is picking the base prior to the offset addition. We should fix that by reordering the ops, as you suggest, and providing heap_addr with separate offset and size as you've noted.

Once we do that, are there any other issues with this approach?

Not that I am aware of.

Currently working on the heap_addr changes. Looks like this is maybe gonna touch more than I anticipated (just got all the errors from non-cranelift-codegen crates, such as the interpreter, out of rustc; we'll see how deep the rabbit hole goes).

cfallin · 2022-11-04T21:20:49Z

Today, actual execution is not affected by this.

After this PR as currently written, actual execution is affected by this when dynamic memories and spectre guards are both enabled.

Right, sorry, makes sense now; not sure why I missed that!

fitzgen · 2022-11-04T21:22:30Z

not sure why I missed that!

No worries, this stuff is pretty subtle, it has basically taken me all day just to write these comments and read through the code enough to understand what is happening here :-p

This is a continuation of the thrust in bytecodealliance#5207 for reducing page faults and lock contention when using the pooling allocator. To that end this commit implements support for efficient memory management in the pooling allocator when using wasm that is instrumented with bounds checks. The `MemoryImageSlot` type now avoids unconditionally shrinking memory back to its initial size during the `clear_and_remain_ready` operation, instead deferring optional resizing of memory to the subsequent call to `instantiate` when the slot is reused. The instantiation portion then takes the "memory style" as an argument which dictates whether the accessible memory must be precisely fit or whether it's allowed to exceed the maximum. This in effect enables skipping a call to `mprotect` to shrink the heap when dynamic memory checks are enabled. In terms of page fault and contention this should improve the situation by: * Fewer calls to `mprotect` since once a heap grows it stays grown and it never shrinks. This means that a write lock is taken within the kernel much more rarely from before (only asymptotically now, not N-times-per-instance). * Accessed memory after a heap growth operation will not fault if it was previously paged in by a prior instance and set to zero with `memset`. Unlike bytecodealliance#5207 which requires a 6.0 kernel to see this optimization this commit enables the optimization for any kernel. The major cost of choosing this strategy is naturally the performance hit of the wasm itself. This is being looked at in PRs such as bytecodealliance#5190 to improve Wasmtime's story here. This commit does not implement any new configuration options for Wasmtime but instead reinterprets existing configuration options. The pooling allocator no longer unconditionally sets `static_memory_bound_is_maximum` and then implements support necessary for this memory type. This other change to this commit is that the `Tunables::static_memory_bound` configuration option is no longer gating on the creation of a `MemoryPool` and it will now appropriately size to `instance_limits.memory_pages` if the `static_memory_bound` is to small. This is done to accomodate fuzzing more easily where the `static_memory_bound` will become small during fuzzing and otherwise the configuration would be rejected and require manual handling. The spirit of the `MemoryPool` is one of large virtual address space reservations anyway so it seemed reasonable to interpret the configuration this way.

* Implement support for dynamic memories in the pooling allocator This is a continuation of the thrust in #5207 for reducing page faults and lock contention when using the pooling allocator. To that end this commit implements support for efficient memory management in the pooling allocator when using wasm that is instrumented with bounds checks. The `MemoryImageSlot` type now avoids unconditionally shrinking memory back to its initial size during the `clear_and_remain_ready` operation, instead deferring optional resizing of memory to the subsequent call to `instantiate` when the slot is reused. The instantiation portion then takes the "memory style" as an argument which dictates whether the accessible memory must be precisely fit or whether it's allowed to exceed the maximum. This in effect enables skipping a call to `mprotect` to shrink the heap when dynamic memory checks are enabled. In terms of page fault and contention this should improve the situation by: * Fewer calls to `mprotect` since once a heap grows it stays grown and it never shrinks. This means that a write lock is taken within the kernel much more rarely from before (only asymptotically now, not N-times-per-instance). * Accessed memory after a heap growth operation will not fault if it was previously paged in by a prior instance and set to zero with `memset`. Unlike #5207 which requires a 6.0 kernel to see this optimization this commit enables the optimization for any kernel. The major cost of choosing this strategy is naturally the performance hit of the wasm itself. This is being looked at in PRs such as #5190 to improve Wasmtime's story here. This commit does not implement any new configuration options for Wasmtime but instead reinterprets existing configuration options. The pooling allocator no longer unconditionally sets `static_memory_bound_is_maximum` and then implements support necessary for this memory type. This other change to this commit is that the `Tunables::static_memory_bound` configuration option is no longer gating on the creation of a `MemoryPool` and it will now appropriately size to `instance_limits.memory_pages` if the `static_memory_bound` is to small. This is done to accomodate fuzzing more easily where the `static_memory_bound` will become small during fuzzing and otherwise the configuration would be rejected and require manual handling. The spirit of the `MemoryPool` is one of large virtual address space reservations anyway so it seemed reasonable to interpret the configuration this way. * Skip zero memory_size cases These are causing errors to happen when fuzzing and otherwise in theory shouldn't be too interesting to optimize for anyway since they likely aren't used in practice.

Rather than return just the `base + index`. (Note: I've chosen to use the nomenclature "index" for the dynamic operand and "offset" for the static immediate.) This move the addition of the `offset` into `heap_addr`, instead of leaving it for the subsequent memory operation, so that we can Spectre-guard the full address, and not allow speculative execution to read the first 4GiB of memory. Before this commit, we were effectively doing load(spectre_guard(base + index) + offset) Now we are effectively doing load(spectre_guard(base + index + offset)) Finally, this also corrects `heap_addr`'s documented semantics to say that it returns an address that will trap on access if `index + offset + access_size` is out of bounds for the given heap, rather than saying that the `heap_addr` itself will trap. This matches the implemented behavior for static memories, and after bytecodealliance#5190 lands (which is blocked on this commit) will also match the implemented behavior for dynamic memories.

…#5231) * Cranelift: Make `heap_addr` return calculated `base + index + offset` Rather than return just the `base + index`. (Note: I've chosen to use the nomenclature "index" for the dynamic operand and "offset" for the static immediate.) This move the addition of the `offset` into `heap_addr`, instead of leaving it for the subsequent memory operation, so that we can Spectre-guard the full address, and not allow speculative execution to read the first 4GiB of memory. Before this commit, we were effectively doing load(spectre_guard(base + index) + offset) Now we are effectively doing load(spectre_guard(base + index + offset)) Finally, this also corrects `heap_addr`'s documented semantics to say that it returns an address that will trap on access if `index + offset + access_size` is out of bounds for the given heap, rather than saying that the `heap_addr` itself will trap. This matches the implemented behavior for static memories, and after #5190 lands (which is blocked on this commit) will also match the implemented behavior for dynamic memories. * Update heap_addr docs * Factor out `offset + size` to a helper

fitzgen · 2022-11-10T16:59:49Z

@cfallin I've rebased this on top of main and the recent bounds checking related PRs I've landed and all tests are passing now and this should be good to land. Just wanted to flag you to see if you wanted to give it one last once over.

…thod

When both (1) "dynamic" memories that need explicit bounds checks and (2) spectre mitigations that perform bounds checks are enabled, reuse the same bounds checks between the two legalizations. This reduces the overhead of explicit bounds checks and spectre mitigations over using virtual memory guard pages with spectre mitigations from ~1.9-2.1x overhead to ~1.6-1.8x overhead. That is about a 14-19% speed up for when dynamic memories and spectre mitigations are enabled. <details> ``` execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 3422455129.47 ± 120159.49 (confidence = 99%) virtual-memory-guards.so is 2.09x to 2.09x faster than bounds-checks.so! [6563931659 6564063496.07 6564301535] bounds-checks.so [3141492675 3141608366.60 3141895249] virtual-memory-guards.so execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 338716136.87 ± 1.38 (confidence = 99%) virtual-memory-guards.so is 2.08x to 2.08x faster than bounds-checks.so! [651961494 651961495.47 651961497] bounds-checks.so [313245357 313245358.60 313245362] virtual-memory-guards.so execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 22742944.07 ± 331.73 (confidence = 99%) virtual-memory-guards.so is 1.87x to 1.87x faster than bounds-checks.so! [48841295 48841567.33 48842139] bounds-checks.so [26098439 26098623.27 26099479] virtual-memory-guards.so ``` </details> <details> ``` execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 2465900207.27 ± 146476.61 (confidence = 99%) virtual-memory-guards.so is 1.78x to 1.78x faster than de-duped-bounds-checks.so! [5607275431 5607442989.13 5607838342] de-duped-bounds-checks.so [3141445345 3141542781.87 3141711213] virtual-memory-guards.so execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 234253620.20 ± 2.33 (confidence = 99%) virtual-memory-guards.so is 1.75x to 1.75x faster than de-duped-bounds-checks.so! [547498977 547498980.93 547498985] de-duped-bounds-checks.so [313245357 313245360.73 313245363] virtual-memory-guards.so execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 16605659.13 ± 315.78 (confidence = 99%) virtual-memory-guards.so is 1.64x to 1.64x faster than de-duped-bounds-checks.so! [42703971 42704284.40 42704787] de-duped-bounds-checks.so [26098432 26098625.27 26099234] virtual-memory-guards.so ``` </details> <details> ``` execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 104462517.13 ± 7.32 (confidence = 99%) de-duped-bounds-checks.so is 1.19x to 1.19x faster than bounds-checks.so! [651961493 651961500.80 651961532] bounds-checks.so [547498981 547498983.67 547498989] de-duped-bounds-checks.so execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 956556982.80 ± 103034.59 (confidence = 99%) de-duped-bounds-checks.so is 1.17x to 1.17x faster than bounds-checks.so! [6563930590 6564019842.40 6564243651] bounds-checks.so [5607307146 5607462859.60 5607677763] de-duped-bounds-checks.so execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 6137307.87 ± 247.75 (confidence = 99%) de-duped-bounds-checks.so is 1.14x to 1.14x faster than bounds-checks.so! [48841303 48841472.93 48842000] bounds-checks.so [42703965 42704165.07 42704718] de-duped-bounds-checks.so ``` </details>

…nd spectre mitigations

fitzgen · 2022-11-11T00:22:26Z

Just realized @cfallin is going to be OoO for a few days, @jameysharp would you mind taking a look?

jameysharp

Since #5231 is merged, sure, this makes enough sense to me to sign off on it.

It's unfortunate that the extensive comments in cranelift/filetests/filetests/isa/x64/heap.clif haven't been maintained since @abrown wrote them in #4841. If you have a chance to fix them up before merging this I think that'd be great, but I wouldn't block merging on that.

fitzgen · 2022-11-15T16:48:04Z

It's unfortunate that the extensive comments in cranelift/filetests/filetests/isa/x64/heap.clif haven't been maintained since @abrown wrote them in #4841. If you have a chance to fix them up before merging this I think that'd be great, but I wouldn't block merging on that.

I plan on fixing up a lot of heap_addr related things, I'll add this to my list.

fitzgen requested a review from cfallin November 3, 2022 18:40

cfallin approved these changes Nov 3, 2022

View reviewed changes

github-actions bot added the cranelift Issues related to the Cranelift code generator label Nov 3, 2022

jameysharp mentioned this pull request Nov 4, 2022

Redefine select_spectre_guard #5206

Open

alexcrichton mentioned this pull request Nov 4, 2022

Implement support for dynamic memories in the pooling allocator #5208

Merged

fitzgen mentioned this pull request Nov 8, 2022

Cranelift: Make heap_addr return calculated base + index + offset #5231

Merged

fitzgen force-pushed the de-dupe-bounds-checks branch from d322bd6 to 8403248 Compare November 9, 2022 23:49

github-actions bot added the cranelift:wasm label Nov 10, 2022

fitzgen requested a review from cfallin November 10, 2022 16:59

fitzgen added 7 commits November 10, 2022 09:00

Cranelift: Add the DataFlowGraph::display_value_inst convenience me…

33d98f2

…thod

Cranelift: Add some trace! logs to some parts of legalization

04965b4

Update test expectations

b70425e

Add a test for deduplicating bounds checks between dynamic memories a…

ad8f659

…nd spectre mitigations

Define a struct for the Spectre comparison instead of using a tuple

e2b3722

More trace logging for heap legalization

7aa3cff

fitzgen force-pushed the de-dupe-bounds-checks branch from 8403248 to 7aa3cff Compare November 10, 2022 17:00

fitzgen requested review from jameysharp and removed request for cfallin November 11, 2022 00:22

jameysharp approved these changes Nov 14, 2022

View reviewed changes

fitzgen merged commit c2a7ea7 into bytecodealliance:main Nov 15, 2022

fitzgen deleted the de-dupe-bounds-checks branch November 15, 2022 16:47

fitzgen mentioned this pull request Mar 23, 2023

Further Improve Execution Speed with Explicit Bounds Checks #6094

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift: de-duplicate bounds checks in legalizations #5190

Cranelift: de-duplicate bounds checks in legalizations #5190

fitzgen commented Nov 3, 2022

cfallin left a comment

cfallin Nov 3, 2022

fitzgen Nov 3, 2022 •

edited

Loading

fitzgen commented Nov 4, 2022

alexcrichton commented Nov 4, 2022

cfallin commented Nov 4, 2022

fitzgen commented Nov 4, 2022

fitzgen commented Nov 4, 2022

fitzgen commented Nov 4, 2022

bjorn3 commented Nov 4, 2022

cfallin commented Nov 4, 2022

cfallin commented Nov 4, 2022

fitzgen commented Nov 4, 2022 •

edited

Loading

fitzgen commented Nov 4, 2022

cfallin commented Nov 4, 2022

fitzgen commented Nov 4, 2022 •

edited

Loading

cfallin commented Nov 4, 2022

fitzgen commented Nov 4, 2022

fitzgen commented Nov 10, 2022

fitzgen commented Nov 11, 2022

jameysharp left a comment

fitzgen commented Nov 15, 2022

Cranelift: de-duplicate bounds checks in legalizations #5190

Cranelift: de-duplicate bounds checks in legalizations #5190

Conversation

fitzgen commented Nov 3, 2022

virtual-memory-guards.so vs bounds-checks.so

virtual-memory-guards.so vs de-duped-bounds-checks.so

bounds-checks.so vs de-duped-bounds-checks.so

cfallin left a comment

Choose a reason for hiding this comment

cfallin Nov 3, 2022

Choose a reason for hiding this comment

fitzgen Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

fitzgen commented Nov 4, 2022

alexcrichton commented Nov 4, 2022

cfallin commented Nov 4, 2022

fitzgen commented Nov 4, 2022

fitzgen commented Nov 4, 2022

fitzgen commented Nov 4, 2022

bjorn3 commented Nov 4, 2022

cfallin commented Nov 4, 2022

cfallin commented Nov 4, 2022

fitzgen commented Nov 4, 2022 • edited Loading

fitzgen commented Nov 4, 2022

cfallin commented Nov 4, 2022

fitzgen commented Nov 4, 2022 • edited Loading

cfallin commented Nov 4, 2022

fitzgen commented Nov 4, 2022

fitzgen commented Nov 10, 2022

fitzgen commented Nov 11, 2022

jameysharp left a comment

Choose a reason for hiding this comment

fitzgen commented Nov 15, 2022

fitzgen Nov 3, 2022 •

edited

Loading

fitzgen commented Nov 4, 2022 •

edited

Loading

fitzgen commented Nov 4, 2022 •

edited

Loading