Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift: de-duplicate bounds checks in legalizations #5190

Merged
merged 7 commits into from
Nov 15, 2022

Conversation

fitzgen
Copy link
Member

@fitzgen fitzgen commented Nov 3, 2022

When both (1) "dynamic" memories that need explicit bounds checks and (2)
spectre mitigations that perform bounds checks are enabled, reuse the same
bounds checks between the two legalizations.

This reduces the overhead of explicit bounds checks and spectre mitigations over
using virtual memory guard pages with spectre mitigations from ~1.9-2.1x
overhead to ~1.6-1.8x overhead. That is about a 14-19% speed up for when dynamic
memories and spectre mitigations are enabled.

virtual-memory-guards.so vs bounds-checks.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3422455129.47 ± 120159.49 (confidence = 99%)

  virtual-memory-guards.so is 2.09x to 2.09x faster than bounds-checks.so!

  [6563931659 6564063496.07 6564301535] bounds-checks.so
  [3141492675 3141608366.60 3141895249] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 338716136.87 ± 1.38 (confidence = 99%)

  virtual-memory-guards.so is 2.08x to 2.08x faster than bounds-checks.so!

  [651961494 651961495.47 651961497] bounds-checks.so
  [313245357 313245358.60 313245362] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 22742944.07 ± 331.73 (confidence = 99%)

  virtual-memory-guards.so is 1.87x to 1.87x faster than bounds-checks.so!

  [48841295 48841567.33 48842139] bounds-checks.so
  [26098439 26098623.27 26099479] virtual-memory-guards.so

virtual-memory-guards.so vs de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 2465900207.27 ± 146476.61 (confidence = 99%)

  virtual-memory-guards.so is 1.78x to 1.78x faster than de-duped-bounds-checks.so!

  [5607275431 5607442989.13 5607838342] de-duped-bounds-checks.so
  [3141445345 3141542781.87 3141711213] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 234253620.20 ± 2.33 (confidence = 99%)

  virtual-memory-guards.so is 1.75x to 1.75x faster than de-duped-bounds-checks.so!

  [547498977 547498980.93 547498985] de-duped-bounds-checks.so
  [313245357 313245360.73 313245363] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 16605659.13 ± 315.78 (confidence = 99%)

  virtual-memory-guards.so is 1.64x to 1.64x faster than de-duped-bounds-checks.so!

  [42703971 42704284.40 42704787] de-duped-bounds-checks.so
  [26098432 26098625.27 26099234] virtual-memory-guards.so

bounds-checks.so vs de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 104462517.13 ± 7.32 (confidence = 99%)

  de-duped-bounds-checks.so is 1.19x to 1.19x faster than bounds-checks.so!

  [651961493 651961500.80 651961532] bounds-checks.so
  [547498981 547498983.67 547498989] de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 956556982.80 ± 103034.59 (confidence = 99%)

  de-duped-bounds-checks.so is 1.17x to 1.17x faster than bounds-checks.so!

  [6563930590 6564019842.40 6564243651] bounds-checks.so
  [5607307146 5607462859.60 5607677763] de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 6137307.87 ± 247.75 (confidence = 99%)

  de-duped-bounds-checks.so is 1.14x to 1.14x faster than bounds-checks.so!

  [48841303 48841472.93 48842000] bounds-checks.so
  [42703965 42704165.07 42704718] de-duped-bounds-checks.so

@fitzgen fitzgen requested a review from cfallin November 3, 2022 18:40
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@@ -46,6 +47,8 @@ fn imm_const(pos: &mut FuncCursor, arg: Value, imm: Imm64, is_signed: bool) -> V
/// Perform a simple legalization by expansion of the function, without
/// platform-specific transforms.
pub fn simple_legalize(func: &mut ir::Function, cfg: &mut ControlFlowGraph, isa: &dyn TargetIsa) {
trace!("Pre-legalization function:\n{}", func.display());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These trace!s seem a bit verbose to leave in -- big function bodies will be replicated two more times (in addition to the pre-opt, pre-lowering and lowered vcode forms). I don't feel too strongly about that, just a suggestion.

Copy link
Member Author

@fitzgen fitzgen Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the post-legalization logging (since you can see that as the input to the next pass) but I think that in general every pass should trace! log its input.

@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Nov 3, 2022
@fitzgen
Copy link
Member Author

fitzgen commented Nov 4, 2022

Trevor and I investigated the failing test on aarch64 and this ultimately led us
to the conclusion that this patch shouldn't actually land.

Fundamentally, this PR changes the semantics of heap_addr from "calculate the
base address of the given memory access or trap if it is out of bounds" to
"calculate the base address of the given memory access or return an address
that will trap when accessed
". And this semantic change only happens when
dynamic memories and spectre mitigations are enabled. In other configurations
heap_addr still has the old semantics.

This leads us into a bit of a mess:

  • If the result of an out of bounds heap_addr is never loaded from or stored
    to, then the trap never happens. Not even "eventually". This very much does
    not match heap_addr's documented semantics.

  • Piggybacking off the spectre mitigation, this PR use 0 as the address that
    will trap when accessed. However, I'm not sure that this will always work in
    the face of large offsets? There is a bit of code emitted from
    cranelift-wasm that does a heap_addr to get a base address and then adds
    offsets to that base. Need more investigation here.

  • Also according to heap_addr's documented semantics today, it would be valid
    to mark a load as notrap when given the result of a heap_addr because if
    we got to the load then heap_addr already verified that the address is in
    bounds for us:

    v0 = heap_addr ...
    v1 = load.i32 little heap notrap v0
    

    The notrap flag would become incorrect under these semantic changes.

    That said, I think heap_addr does have roughly the "or return an address
    that will trap when accessed" semantics when we are using static memories? So
    maybe we actually can make this semantic change to heap_addr and then this
    would actually make it more consistent across dynamic and static memories? And
    then the notrap flag would be invalid here.

I'm not sure how to resolve these issues and still deduplicate the bounds checks
(without going full value range analysis). Given that last point, it might
actually be better to change heap_addr's semantics to "or return an address
that will trap when accessed". We would need to be careful to check that all
libcalls that take an address do explicit bounds checks on that address rather
than assuming that the JIT code already did one via its heap_addr. This is a
little bit scary since it means that valid assumptions code could have been
written under no longer hold. (Although I don't think we generally make this
assumption in the code base and usually we take wasm addresses rather than the
results of heap_addrs for libcalls; need to take a census to be more sure
about this).

Interested in folks thoughts on this.


Separately, Trevor and I discovered the following issues / not necessarily
issues but maybe things we should tighten up:

  • validate_atomic_addr in libcalls.rs only validates a single byte address,
    not a full size of the access. This means that a bounds check could "succeed"
    when the first byte being accessed is the last byte of memory and then the
    other bytes of the access are out of bounds. In practice we don't have this
    bug, because all callers call validate_atomic_addr with base + size_of_access but this is a bit of a foot gun waiting to bite us. Would be
    nicer to pass the base and size_of_access into validate_atomic_addr,
    similar to the heap_addr instruction.

  • The heap access spectre mitigation--at least for dynamic memories, I haven't
    looked at static memories yet--only checks the first byte of the access, not
    the last byte. This means that speculative execution could do an i64 load
    beginning at the last byte of memory and bring (the cache line for) seven out
    of bounds bytes into cache.

  • The heap_addr instruction should have .can_trap(true) in the meta crate,
    according to its documented semantics. It doesn't currently have
    this. However, this would also prevent heap_addrs from being GVN'd or
    LICM'd, which we definitely want to happen, we just want to make sure that any
    load of a heap_addr'd address is still heap_addr'd. Open to suggestions
    here.

    Although if we do change heap_addr's semantics to "or returns an address
    that will trap when accessed" we don't need to mark it can_trap(true) and
    this problem goes away.

cc @cfallin @elliottt @jameysharp

@alexcrichton
Copy link
Member

Fundamentally, this PR changes the semantics of heap_addr from "calculate the
base address of the given memory access or trap if it is out of bounds" to
"calculate the base address of the given memory access or return an address
that will trap when accessed".

I think it changes the semantics for dynamic heaps, but this doesn't change the semantics for static heaps I think? In that sense all our libcalls should already be guarded against "this address is in bounds and should not segfault" I think?

@cfallin
Copy link
Member

cfallin commented Nov 4, 2022

Thanks for digging into this; a few thoughts:

  • I think that, as you suggest, we are already relying on the trap happening when the address is accessed, in the static-memory case. It makes the most sense to me to update the documentation for heap_addr in that light.

  • I was somewhat concerned about this point:

    There is a bit of code emitted from
    cranelift-wasm that does a heap_addr to get a base address and then adds
    offsets to that base. Need more investigation here.

    and looked into it a bit; are you referring to the code in prepare_addr? If so I think that should generally be safe: its optimization (using heap_addr with the same offset of 1 for every load with an actual offset potentially larger) is triggered only when it knows that the heap is using a large-enough guard region.

    Arguably such an optimization should be a rewrite rule in Cranelift instead -- translate all heap_addrs on a heap-with-guard-region to use offset=0 so they can be GVN'd together -- but I don't think this will cause a correctness issue in the short term, unless I'm misunderstanding or missing some other code...

  • Re: this

    We would need to be careful to check that all
    libcalls that take an address do explicit bounds checks on that address rather
    than assuming that the JIT code already did one via its heap_addr. This is a
    little bit scary since it means that valid assumptions code could have been
    written under no longer hold.

    if the address will trap when executed, then we're fine safety-wise (this won't be a backdoor around the heap sandbox) but I agree it will cause problems in other ways -- specifically it will result in a SIGSEGV in the runtime rather than in guest code, which will (usually) kill the process. So I agree we should verify that libcalls take Wasm addresses.

    I think this is an issue today though, as (as you noted) the guard-region case already relies on "actual load/store traps, not heap_addr". (That increases the importance of doing the above verification, to be clear.)


Overall, I think I'm not as pessimistic on this: it basically changes heap_addr to work in the dynamic-memory case like it does in the static-memory case, and Wasmtime is already well-tested with its page-protection-based bounds checking. As long as the address-0 page is unmapped (as it is on all of our platforms), we are basically using it as a one-page-sized guard region.

@fitzgen
Copy link
Member Author

fitzgen commented Nov 4, 2022

  • are you referring to the code in prepare_addr? If so I think that should generally be safe: its optimization (using heap_addr with the same offset of 1 for every load with an actual offset potentially larger) is triggered only when it knows that the heap is using a large-enough guard region.

Yeah, and I agree it is safe, it just makes me a little nervous to produce CLIF that has heap_addrs that aren't immediately followed by the associated memory access. As you allude to, I would be more comfortable having this kind of thing done in the mid-end.

@fitzgen
Copy link
Member Author

fitzgen commented Nov 4, 2022

I guess the biggest unknown in my mind is how large of a NULL guard page region we can rely on? Because this PR bakes in assumptions that the spectre guard already makes about 0 + offset will always trap for any 32-bit Wasm offset immediate.

For example, given

(module
  (memory (export "memory") 1)
  (func (export "f") (param i32) (result i32)
    local.get 0
    i32.load offset=0xfffffff0))

We lower that Wasm to this clif:

v4 = heap_addr.i64 heap0, v2, 0x8000_0000
v5 = iadd_imm v4, 0xffff_fff0
v6 = load.i32 little heap v5

And then (with static memories (but smaller than 4GiB guard region) + spectre guards) legalize that clif to this clif:

;; check if the pointer is larger than the static memory size (including both
;; wasm memory and guard pages, I think?)
v7 = uextend.i64 v2
v8 = icmp_imm ugt v7, 0x8000_0000
trapnz v8, heap_oob
v9 = iconst.i64 0x8000_0000
;; gv4 is the memory base
v10 = global_value.i64 gv4
v11 = iadd v10, v7
v12 = iconst.i64 0
v13 = icmp ugt v7, v9  ; v9 = 0x8000_0000
v4 = select_spectre_guard v13, v12, v11  ; v12 = 0
v5 = iadd_imm v4, 0xffff_fff0
v6 = load.i32 little heap v5

Do we guarantee a 4GiB-sized NULL guard region? Because doing the spectre guard on the base (v11) rather than base plus offset (v5) makes that assumption. This doesn't seem like something we can rely on, especially across platforms?

This becomes even more problematic when we look at the legalization of that lowered Wasm's initial clif, this time configured with dynamic memories and spectre guards, and using this patch which removes the dynamic memory's bounds check in favor of the spectre guard's bounds check:

;; extend wasm i32 pointer to native i64 pointer
v7 = uextend.i64 v2
;; gv4 is the base of memory
v8 = global_value.i64 gv4
;; overflow check for pointer + offset
v9 = iconst.i64 0xffff_0000
v10 = uadd_overflow_trap v7, v9, heap_oob  ; v9 = 0xffff_0000
;; gv5 is the length of memory
v11 = global_value.i64 gv5
v12 = iadd v11, v7
v13 = iconst.i64 0
v14 = icmp ugt v10, v8
v4 = select_spectre_guard v14, v13, v12  ; v13 = 0
v5 = iadd_imm v4, 0xffff_fff0
v6 = load.i32 little heap v5

We're baking in that same assumption. Now we aren't just letting speculative execution access the hopefully-null-guard-page-protected region of memory in 0..4GiB, but regular Wasm execution as well.

I think we can fix all this ambiguity by effectively doing select_spectre_guard(base + offset) instead of select_spectre_guard(base) + offset?

@fitzgen
Copy link
Member Author

fitzgen commented Nov 4, 2022

I think we can fix all this ambiguity by effectively doing select_spectre_guard(base + offset) instead of select_spectre_guard(base) + offset?

This requires changes to heap_addr where it will need to be given not the accumulated access size (wasm offset + size_of(type)) but those individual components, so that it can return the native address of the translated wasm address, so that we can spectre guard that rather than the native base.

@bjorn3
Copy link
Contributor

bjorn3 commented Nov 4, 2022

I don't think we can guarantee a 4GB null guard. Memory may be mapped within this range, whether explicitly using mmal or because the executable has a fixed position within this range. Only 4096 bytes is guaranteed on linux as this is the default value for vm.mmap_min_addr.

@cfallin
Copy link
Member

cfallin commented Nov 4, 2022

If I understand the code correctly, I think we handle all of this already -- first, here we get offset_guard_size from the heap spec, and in the return-NULL-to-trap configuration we should have a guard size of 0 as with any dynamically-bounds-checked memory (or maybe it's a small nominal amount, I don't remember, but certainly not 4GiB). Then, here we include that offset on the heap_addr, and it will bounds-check appropriately taking the offset into consideration.

Or I guess all of this is a long way of saying: I don't think we actually lower into that CLIF if dynamic memory is configured. This is subtly different from

This becomes even more problematic when we look at the legalization of that lowered Wasm's initial clif, this time configured with dynamic memories and spectre guards

I think -- if I understand your methodology correct, you took the CLIF generated by translation with static memories enabled, then compiled it with dynamic memories enabled? With the above-linked code I think doing the translation with dynamic memories enabled should result in code that does not assume a 4GiB guard at address 0.

@cfallin
Copy link
Member

cfallin commented Nov 4, 2022

(And to answer the immediate question, no, we definitely can't assume a 4GiB guard region at address 0! That would be an un-portable mess, as you observe.)

@fitzgen
Copy link
Member Author

fitzgen commented Nov 4, 2022

I think -- if I understand your methodology correct, you took the CLIF generated by translation with static memories enabled, then compiled it with dynamic memories enabled? With the above-linked code I think doing the translation with dynamic memories enabled should result in code that does not assume a 4GiB guard at address 0.

No, each different legalization example is generated from the original input Wasm with different flags passed to Wasmtime. I was not working with just the CLIF. Always using the original Wasm as input.

@fitzgen
Copy link
Member Author

fitzgen commented Nov 4, 2022

Then, here we include that offset on the heap_addr, and it will bounds-check appropriately taking the offset into consideration.

This still doesn't prevent speculated execution from reading 0..4GiB though, because the generated code ultimately does

spectre_guard(base) + offset

rather than

spectre_guard(base + offset)

@cfallin
Copy link
Member

cfallin commented Nov 4, 2022

Ah, OK, I am misreading your concern, sorry. I had thought you were saying that the actual bounds-check (the non-speculative behavior that should result in a trap) was wrong; and was very confused because the result of the uadd_overflow_trap was fed into the icmp, so all seemed to account for the offset.

But indeed, the select is picking the base prior to the offset addition. We should fix that by reordering the ops, as you suggest, and providing heap_addr with separate offset and size as you've noted.

Once we do that, are there any other issues with this approach?

@fitzgen
Copy link
Member Author

fitzgen commented Nov 4, 2022

I had thought you were saying that the actual bounds-check (the non-speculative behavior that should result in a trap) was wrong;

Today, actual execution is not affected by this.

After this PR as currently written, actual execution is affected by this when dynamic memories and spectre guards are both enabled.

But indeed, the select is picking the base prior to the offset addition. We should fix that by reordering the ops, as you suggest, and providing heap_addr with separate offset and size as you've noted.

Once we do that, are there any other issues with this approach?

Not that I am aware of.

Currently working on the heap_addr changes. Looks like this is maybe gonna touch more than I anticipated (just got all the errors from non-cranelift-codegen crates, such as the interpreter, out of rustc; we'll see how deep the rabbit hole goes).

@cfallin
Copy link
Member

cfallin commented Nov 4, 2022

Today, actual execution is not affected by this.

After this PR as currently written, actual execution is affected by this when dynamic memories and spectre guards are both enabled.

Right, sorry, makes sense now; not sure why I missed that!

@fitzgen
Copy link
Member Author

fitzgen commented Nov 4, 2022

not sure why I missed that!

No worries, this stuff is pretty subtle, it has basically taken me all day just to write these comments and read through the code enough to understand what is happening here :-p

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Nov 4, 2022
This is a continuation of the thrust in bytecodealliance#5207 for reducing page faults
and lock contention when using the pooling allocator. To that end this
commit implements support for efficient memory management in the pooling
allocator when using wasm that is instrumented with bounds checks.

The `MemoryImageSlot` type now avoids unconditionally shrinking memory
back to its initial size during the `clear_and_remain_ready` operation,
instead deferring optional resizing of memory to the subsequent call to
`instantiate` when the slot is reused. The instantiation portion then
takes the "memory style" as an argument which dictates whether the
accessible memory must be precisely fit or whether it's allowed to
exceed the maximum. This in effect enables skipping a call to `mprotect`
to shrink the heap when dynamic memory checks are enabled.

In terms of page fault and contention this should improve the situation
by:

* Fewer calls to `mprotect` since once a heap grows it stays grown and
  it never shrinks. This means that a write lock is taken within the
  kernel much more rarely from before (only asymptotically now, not
  N-times-per-instance).

* Accessed memory after a heap growth operation will not fault if it was
  previously paged in by a prior instance and set to zero with `memset`.
  Unlike bytecodealliance#5207 which requires a 6.0 kernel to see this optimization this
  commit enables the optimization for any kernel.

The major cost of choosing this strategy is naturally the performance
hit of the wasm itself. This is being looked at in PRs such as bytecodealliance#5190 to
improve Wasmtime's story here.

This commit does not implement any new configuration options for
Wasmtime but instead reinterprets existing configuration options. The
pooling allocator no longer unconditionally sets
`static_memory_bound_is_maximum` and then implements support necessary
for this memory type. This other change to this commit is that the
`Tunables::static_memory_bound` configuration option is no longer gating
on the creation of a `MemoryPool` and it will now appropriately size to
`instance_limits.memory_pages` if the `static_memory_bound` is to small.
This is done to accomodate fuzzing more easily where the
`static_memory_bound` will become small during fuzzing and otherwise the
configuration would be rejected and require manual handling. The spirit
of the `MemoryPool` is one of large virtual address space reservations
anyway so it seemed reasonable to interpret the configuration this way.
alexcrichton added a commit that referenced this pull request Nov 8, 2022
* Implement support for dynamic memories in the pooling allocator

This is a continuation of the thrust in #5207 for reducing page faults
and lock contention when using the pooling allocator. To that end this
commit implements support for efficient memory management in the pooling
allocator when using wasm that is instrumented with bounds checks.

The `MemoryImageSlot` type now avoids unconditionally shrinking memory
back to its initial size during the `clear_and_remain_ready` operation,
instead deferring optional resizing of memory to the subsequent call to
`instantiate` when the slot is reused. The instantiation portion then
takes the "memory style" as an argument which dictates whether the
accessible memory must be precisely fit or whether it's allowed to
exceed the maximum. This in effect enables skipping a call to `mprotect`
to shrink the heap when dynamic memory checks are enabled.

In terms of page fault and contention this should improve the situation
by:

* Fewer calls to `mprotect` since once a heap grows it stays grown and
  it never shrinks. This means that a write lock is taken within the
  kernel much more rarely from before (only asymptotically now, not
  N-times-per-instance).

* Accessed memory after a heap growth operation will not fault if it was
  previously paged in by a prior instance and set to zero with `memset`.
  Unlike #5207 which requires a 6.0 kernel to see this optimization this
  commit enables the optimization for any kernel.

The major cost of choosing this strategy is naturally the performance
hit of the wasm itself. This is being looked at in PRs such as #5190 to
improve Wasmtime's story here.

This commit does not implement any new configuration options for
Wasmtime but instead reinterprets existing configuration options. The
pooling allocator no longer unconditionally sets
`static_memory_bound_is_maximum` and then implements support necessary
for this memory type. This other change to this commit is that the
`Tunables::static_memory_bound` configuration option is no longer gating
on the creation of a `MemoryPool` and it will now appropriately size to
`instance_limits.memory_pages` if the `static_memory_bound` is to small.
This is done to accomodate fuzzing more easily where the
`static_memory_bound` will become small during fuzzing and otherwise the
configuration would be rejected and require manual handling. The spirit
of the `MemoryPool` is one of large virtual address space reservations
anyway so it seemed reasonable to interpret the configuration this way.

* Skip zero memory_size cases

These are causing errors to happen when fuzzing and otherwise in theory
shouldn't be too interesting to optimize for anyway since they likely
aren't used in practice.
fitzgen added a commit to fitzgen/wasmtime that referenced this pull request Nov 8, 2022
Rather than return just the `base + index`.

(Note: I've chosen to use the nomenclature "index" for the dynamic operand and
"offset" for the static immediate.)

This move the addition of the `offset` into `heap_addr`, instead of leaving it
for the subsequent memory operation, so that we can Spectre-guard the full
address, and not allow speculative execution to read the first 4GiB of memory.

Before this commit, we were effectively doing

    load(spectre_guard(base + index) + offset)

Now we are effectively doing

    load(spectre_guard(base + index + offset))

Finally, this also corrects `heap_addr`'s documented semantics to say that it
returns an address that will trap on access if `index + offset + access_size` is
out of bounds for the given heap, rather than saying that the `heap_addr` itself
will trap. This matches the implemented behavior for static memories, and after
bytecodealliance#5190 lands (which is blocked
on this commit) will also match the implemented behavior for dynamic memories.
fitzgen added a commit to fitzgen/wasmtime that referenced this pull request Nov 9, 2022
Rather than return just the `base + index`.

(Note: I've chosen to use the nomenclature "index" for the dynamic operand and
"offset" for the static immediate.)

This move the addition of the `offset` into `heap_addr`, instead of leaving it
for the subsequent memory operation, so that we can Spectre-guard the full
address, and not allow speculative execution to read the first 4GiB of memory.

Before this commit, we were effectively doing

    load(spectre_guard(base + index) + offset)

Now we are effectively doing

    load(spectre_guard(base + index + offset))

Finally, this also corrects `heap_addr`'s documented semantics to say that it
returns an address that will trap on access if `index + offset + access_size` is
out of bounds for the given heap, rather than saying that the `heap_addr` itself
will trap. This matches the implemented behavior for static memories, and after
bytecodealliance#5190 lands (which is blocked
on this commit) will also match the implemented behavior for dynamic memories.
fitzgen added a commit to fitzgen/wasmtime that referenced this pull request Nov 9, 2022
Rather than return just the `base + index`.

(Note: I've chosen to use the nomenclature "index" for the dynamic operand and
"offset" for the static immediate.)

This move the addition of the `offset` into `heap_addr`, instead of leaving it
for the subsequent memory operation, so that we can Spectre-guard the full
address, and not allow speculative execution to read the first 4GiB of memory.

Before this commit, we were effectively doing

    load(spectre_guard(base + index) + offset)

Now we are effectively doing

    load(spectre_guard(base + index + offset))

Finally, this also corrects `heap_addr`'s documented semantics to say that it
returns an address that will trap on access if `index + offset + access_size` is
out of bounds for the given heap, rather than saying that the `heap_addr` itself
will trap. This matches the implemented behavior for static memories, and after
bytecodealliance#5190 lands (which is blocked
on this commit) will also match the implemented behavior for dynamic memories.
fitzgen added a commit that referenced this pull request Nov 9, 2022
…#5231)

* Cranelift: Make `heap_addr` return calculated `base + index + offset`

Rather than return just the `base + index`.

(Note: I've chosen to use the nomenclature "index" for the dynamic operand and
"offset" for the static immediate.)

This move the addition of the `offset` into `heap_addr`, instead of leaving it
for the subsequent memory operation, so that we can Spectre-guard the full
address, and not allow speculative execution to read the first 4GiB of memory.

Before this commit, we were effectively doing

    load(spectre_guard(base + index) + offset)

Now we are effectively doing

    load(spectre_guard(base + index + offset))

Finally, this also corrects `heap_addr`'s documented semantics to say that it
returns an address that will trap on access if `index + offset + access_size` is
out of bounds for the given heap, rather than saying that the `heap_addr` itself
will trap. This matches the implemented behavior for static memories, and after
#5190 lands (which is blocked
on this commit) will also match the implemented behavior for dynamic memories.

* Update heap_addr docs

* Factor out `offset + size` to a helper
@fitzgen fitzgen requested a review from cfallin November 10, 2022 16:59
@fitzgen
Copy link
Member Author

fitzgen commented Nov 10, 2022

@cfallin I've rebased this on top of main and the recent bounds checking related PRs I've landed and all tests are passing now and this should be good to land. Just wanted to flag you to see if you wanted to give it one last once over.

When both (1) "dynamic" memories that need explicit bounds checks and (2)
spectre mitigations that perform bounds checks are enabled, reuse the same
bounds checks between the two legalizations.

This reduces the overhead of explicit bounds checks and spectre mitigations over
using virtual memory guard pages with spectre mitigations from ~1.9-2.1x
overhead to ~1.6-1.8x overhead. That is about a 14-19% speed up for when dynamic
memories and spectre mitigations are enabled.

<details>

```
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3422455129.47 ± 120159.49 (confidence = 99%)

  virtual-memory-guards.so is 2.09x to 2.09x faster than bounds-checks.so!

  [6563931659 6564063496.07 6564301535] bounds-checks.so
  [3141492675 3141608366.60 3141895249] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 338716136.87 ± 1.38 (confidence = 99%)

  virtual-memory-guards.so is 2.08x to 2.08x faster than bounds-checks.so!

  [651961494 651961495.47 651961497] bounds-checks.so
  [313245357 313245358.60 313245362] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 22742944.07 ± 331.73 (confidence = 99%)

  virtual-memory-guards.so is 1.87x to 1.87x faster than bounds-checks.so!

  [48841295 48841567.33 48842139] bounds-checks.so
  [26098439 26098623.27 26099479] virtual-memory-guards.so
```

</details>

<details>

```
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 2465900207.27 ± 146476.61 (confidence = 99%)

  virtual-memory-guards.so is 1.78x to 1.78x faster than de-duped-bounds-checks.so!

  [5607275431 5607442989.13 5607838342] de-duped-bounds-checks.so
  [3141445345 3141542781.87 3141711213] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 234253620.20 ± 2.33 (confidence = 99%)

  virtual-memory-guards.so is 1.75x to 1.75x faster than de-duped-bounds-checks.so!

  [547498977 547498980.93 547498985] de-duped-bounds-checks.so
  [313245357 313245360.73 313245363] virtual-memory-guards.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 16605659.13 ± 315.78 (confidence = 99%)

  virtual-memory-guards.so is 1.64x to 1.64x faster than de-duped-bounds-checks.so!

  [42703971 42704284.40 42704787] de-duped-bounds-checks.so
  [26098432 26098625.27 26099234] virtual-memory-guards.so
```

</details>

<details>

```
execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 104462517.13 ± 7.32 (confidence = 99%)

  de-duped-bounds-checks.so is 1.19x to 1.19x faster than bounds-checks.so!

  [651961493 651961500.80 651961532] bounds-checks.so
  [547498981 547498983.67 547498989] de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 956556982.80 ± 103034.59 (confidence = 99%)

  de-duped-bounds-checks.so is 1.17x to 1.17x faster than bounds-checks.so!

  [6563930590 6564019842.40 6564243651] bounds-checks.so
  [5607307146 5607462859.60 5607677763] de-duped-bounds-checks.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 6137307.87 ± 247.75 (confidence = 99%)

  de-duped-bounds-checks.so is 1.14x to 1.14x faster than bounds-checks.so!

  [48841303 48841472.93 48842000] bounds-checks.so
  [42703965 42704165.07 42704718] de-duped-bounds-checks.so
```

</details>
@fitzgen
Copy link
Member Author

fitzgen commented Nov 11, 2022

Just realized @cfallin is going to be OoO for a few days, @jameysharp would you mind taking a look?

@fitzgen fitzgen requested review from jameysharp and removed request for cfallin November 11, 2022 00:22
Copy link
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since #5231 is merged, sure, this makes enough sense to me to sign off on it.

It's unfortunate that the extensive comments in cranelift/filetests/filetests/isa/x64/heap.clif haven't been maintained since @abrown wrote them in #4841. If you have a chance to fix them up before merging this I think that'd be great, but I wouldn't block merging on that.

@fitzgen fitzgen merged commit c2a7ea7 into bytecodealliance:main Nov 15, 2022
@fitzgen fitzgen deleted the de-dupe-bounds-checks branch November 15, 2022 16:47
@fitzgen
Copy link
Member Author

fitzgen commented Nov 15, 2022

It's unfortunate that the extensive comments in cranelift/filetests/filetests/isa/x64/heap.clif haven't been maintained since @abrown wrote them in #4841. If you have a chance to fix them up before merging this I think that'd be great, but I wouldn't block merging on that.

I plan on fixing up a lot of heap_addr related things, I'll add this to my list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:wasm cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants