Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiler-builtins: Int trait functions are not inlined on wasm #73135

Closed
CryZe opened this issue Jun 8, 2020 · 5 comments · Fixed by #73136
Closed

compiler-builtins: Int trait functions are not inlined on wasm #73135

CryZe opened this issue Jun 8, 2020 · 5 comments · Fixed by #73136
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. O-wasm Target: WASM (WebAssembly), http://webassembly.org/ P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another.

Comments

@CryZe
Copy link
Contributor

CryZe commented Jun 8, 2020

I'm currently doing some benchmarks and I've seen that __multi3 is taking a really long time to calculate on wasm. Turns out that none of the helper functions it uses are inlined at all, so it does a ton of unnecessary calls and misses a lot of potential optimizations:

(func $compiler_builtins::int::mul::__multi3::h134a3ac9dcea74e0 (type $t39) (param $p0 i32) (param $p1 i64) (param $p2 i64) (param $p3 i64) (param $p4 i64)
    (local $l5 i32) (local $l6 i64) (local $l7 i64) (local $l8 i64)
    global.get $g0
    i32.const 16
    i32.sub
    local.tee $l5
    global.set $g0
    local.get $p1
    local.get $p2
    call $<i128_as_compiler_builtins::int::LargeInt>::low::h654cf3febd0eecf8
    i64.const 4294967295
    i64.and
    local.get $p3
    local.get $p4
    call $<i128_as_compiler_builtins::int::LargeInt>::low::h654cf3febd0eecf8
    i64.const 4294967295
    i64.and
    call $<i64_as_compiler_builtins::int::Int>::wrapping_mul::h558368f41dd93d09
    local.set $l6
    local.get $p1
    local.get $p2
    call $<i128_as_compiler_builtins::int::LargeInt>::low::h654cf3febd0eecf8
    i64.const 32
    i64.shr_u
    local.get $p3
    local.get $p4
    call $<i128_as_compiler_builtins::int::LargeInt>::low::h654cf3febd0eecf8
    i64.const 4294967295
    i64.and
    call $<i64_as_compiler_builtins::int::Int>::wrapping_mul::h558368f41dd93d09
    local.get $l6
    i64.const 32
    i64.shr_u
    i64.add
    local.tee $l7
    i64.const 32
    i64.shr_u
    call $<i64_as_compiler_builtins::int::Int>::from_unsigned::h7efafe9ef60e0310
    local.set $l8
    local.get $l5
    local.get $p3
    local.get $p4
    call $<i128_as_compiler_builtins::int::LargeInt>::low::h654cf3febd0eecf8
    i64.const 32
    i64.shr_u
    local.get $p1
    local.get $p2
    call $<i128_as_compiler_builtins::int::LargeInt>::low::h654cf3febd0eecf8
    i64.const 4294967295
    i64.and
    call $<i64_as_compiler_builtins::int::Int>::wrapping_mul::h558368f41dd93d09
    local.get $l7
    i64.const 4294967295
    i64.and
    i64.add
    local.tee $l7
    i64.const 32
    i64.shl
    local.get $l6
    i64.const 4294967295
    i64.and
    i64.or
    local.get $l8
    local.get $l7
    i64.const 32
    i64.shr_u
    call $<i64_as_compiler_builtins::int::Int>::from_unsigned::h7efafe9ef60e0310
    i64.add
    local.get $p1
    local.get $p2
    call $<i128_as_compiler_builtins::int::LargeInt>::low::h654cf3febd0eecf8
    i64.const 32
    i64.shr_u
    local.get $p3
    local.get $p4
    call $<i128_as_compiler_builtins::int::LargeInt>::low::h654cf3febd0eecf8
    i64.const 32
    i64.shr_u
    call $<i64_as_compiler_builtins::int::Int>::wrapping_mul::h558368f41dd93d09
    call $<i64_as_compiler_builtins::int::Int>::from_unsigned::h7efafe9ef60e0310
    i64.add
    local.get $p1
    local.get $p2
    call $<i128_as_compiler_builtins::int::LargeInt>::high::hc703bd80007672ce
    local.get $p3
    local.get $p4
    call $<i128_as_compiler_builtins::int::LargeInt>::low::h654cf3febd0eecf8
    call $<i64_as_compiler_builtins::int::Int>::from_unsigned::h7efafe9ef60e0310
    call $<i64_as_compiler_builtins::int::Int>::wrapping_mul::h558368f41dd93d09
    call $<i64_as_compiler_builtins::int::Int>::wrapping_add::h9472fc521f362e13
    local.get $p1
    local.get $p2
    call $<i128_as_compiler_builtins::int::LargeInt>::low::h654cf3febd0eecf8
    call $<i64_as_compiler_builtins::int::Int>::from_unsigned::h7efafe9ef60e0310
    local.get $p3
    local.get $p4
    call $<i128_as_compiler_builtins::int::LargeInt>::high::hc703bd80007672ce
    call $<i64_as_compiler_builtins::int::Int>::wrapping_mul::h558368f41dd93d09
    call $<i64_as_compiler_builtins::int::Int>::wrapping_add::h9472fc521f362e13
    call $<i128_as_compiler_builtins::int::LargeInt>::from_parts::h8ea13ebf67bbf227
    local.get $l5
    i64.load
    local.set $p3
    local.get $p0
    local.get $l5
    i32.const 8
    i32.add
    i64.load
    i64.store offset=8
    local.get $p0
    local.get $p3
    i64.store
    local.get $l5
    i32.const 16
    i32.add
    global.set $g0)

The same is probably true for the Float trait.

@CryZe
Copy link
Contributor Author

CryZe commented Jun 8, 2020

It seems like rust-lang/compiler-builtins#349 initially had a lot more inline annotations but removed them later on. That may have regressed the performance on wasm, but I'm not entirely sure.

This makes me wonder if compiler-builtins is even compiled with LTO at all as it supposedly is meant to be?!

@CryZe
Copy link
Contributor Author

CryZe commented Jun 8, 2020

Looks like this is indeed a regression that started with Rust 1.44, which switched from compiler-builtins 0.1.25 to 0.1.27, which is where this PR is part of. These are the instructions in 1.43:

(func $__multi3 (type $t40) (param $p0 i32) (param $p1 i64) (param $p2 i64) (param $p3 i64) (param $p4 i64)
    (local $l5 i64) (local $l6 i64)
    local.get $p0
    local.get $p3
    i64.const 32
    i64.shr_u
    local.tee $l5
    local.get $p1
    i64.const 32
    i64.shr_u
    local.tee $l6
    i64.mul
    local.get $p3
    local.get $p2
    i64.mul
    i64.add
    local.get $p4
    local.get $p1
    i64.mul
    i64.add
    local.get $p3
    i64.const 4294967295
    i64.and
    local.tee $p3
    local.get $p1
    i64.const 4294967295
    i64.and
    local.tee $p1
    i64.mul
    local.tee $p4
    i64.const 32
    i64.shr_u
    local.get $p3
    local.get $l6
    i64.mul
    i64.add
    local.tee $p3
    i64.const 32
    i64.shr_u
    i64.add
    local.get $p3
    i64.const 4294967295
    i64.and
    local.get $l5
    local.get $p1
    i64.mul
    i64.add
    local.tee $p3
    i64.const 32
    i64.shr_u
    i64.add
    i64.store offset=8
    local.get $p0
    local.get $p3
    i64.const 32
    i64.shl
    local.get $p4
    i64.const 4294967295
    i64.and
    i64.or
    i64.store)

cc @tmiasko

@alexcrichton
Copy link
Member

There's been a few various changes around codegen units handling and compiler-builtins recently, and it looks like this is caused by compiling this crate with -C codegen-units=1. The compiler ignores this setting for compiler-builtins but elsewhere in the compiler disables ThinLTO with only one CGU.

@alexcrichton alexcrichton transferred this issue from rust-lang/compiler-builtins Jun 8, 2020
@alexcrichton alexcrichton changed the title Int trait functions are not inlined on wasm compiler-builtins: Int trait functions are not inlined on wasm Jun 8, 2020
@alexcrichton alexcrichton added O-wasm Target: WASM (WebAssembly), http://webassembly.org/ regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Jun 8, 2020
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Jun 8, 2020
@alexcrichton
Copy link
Member

Ok I've transferred this to the rust-lang/rust repository since this isn't a bug with compiler-builtins itself I believe but rather how we build compiler-builtins.

I believe this is an accidental regression from #70846 because only part of the compiler knows that compiler-builtins is built with more than one CGU, the rest of the compiler thinks it's one CGU (respecting CLI flags). I'll be opening a PR shortly to fix this.

@LeSeulArtichaut LeSeulArtichaut added I-slow Issue: Problems and improvements with respect to performance of generated code. A-codegen Area: Code generation labels Jun 9, 2020
CryZe added a commit to CryZe/livesplit-core that referenced this issue Jun 10, 2020
We use a Vec here now because a HashMap would require hashing the
comparison and then comparing the comparison with the string at the
index calculated from the hash. This means at least two full iterations
over the string are necessary, with one of them being somewhat expensive
due to the hashing. Most of the time it is faster to just iterate the
few comparisons we have and compare them directly. Most will be rejected
right away as the first byte doesn't even match, so in the end you'll
end up with less than two full iterations over the string. In fact most
of the time Personal Best will be the first in the list and that's the
one we most often want to look up anyway.

One additional reason for doing this is that the ahash that was
calculated for the HashMap uses 128-bit multiplications which regressed
a lot in Rust 1.44 for targets where the `compiler-builtins` helpers
were used. rust-lang/rust#73135

We could potentially look into interning our comparisons in the future
which could yield even better performance.
CryZe added a commit to CryZe/livesplit-core that referenced this issue Jun 10, 2020
We use a Vec here now because a HashMap would require hashing the
comparison and then comparing the comparison with the string at the
index calculated from the hash. This means at least two full iterations
over the string are necessary, with one of them being somewhat expensive
due to the hashing. Most of the time it is faster to just iterate the
few comparisons we have and compare them directly. Most will be rejected
right away as the first byte doesn't even match, so in the end you'll
end up with less than two full iterations over the string. In fact most
of the time Personal Best will be the first in the list and that's the
one we most often want to look up anyway.

One additional reason for doing this is that the ahash that was
calculated for the HashMap uses 128-bit multiplications which regressed
a lot in Rust 1.44 for targets where the `compiler-builtins` helpers
were used. rust-lang/rust#73135

We could potentially look into interning our comparisons in the future
which could yield even better performance.
@spastorino
Copy link
Member

Assigning P-medium as discussed as part of the Prioritization Working Group process and removing I-prioritize.

@spastorino spastorino added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Jun 10, 2020
@crlf0710 crlf0710 added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jun 11, 2020
Manishearth added a commit to Manishearth/rust that referenced this issue Jun 16, 2020
…ins, r=Mark-Simulacrum

Change how compiler-builtins gets many CGUs

This commit intends to fix an accidental regression from rust-lang#70846. The
goal of rust-lang#70846 was to build compiler-builtins with a maximal number of
CGUs to ensure that each module in the source corresponds to an object
file. This high degree of control for compiler-builtins is desirable to
ensure that there's at most one exported symbol per CGU, ideally
enabling compiler-builtins to not conflict with the system libgcc as
often.

In rust-lang#70846, however, only part of the compiler understands that
compiler-builtins is built with many CGUs. The rest of the compiler
thinks it's building with `sess.codegen_units()`. Notably the
calculation of `sess.lto()` consults `sess.codegen_units()`, which when
there's only one CGU it disables ThinLTO. This means that
compiler-builtins is built without ThinLTO, which is quite harmful to
performance! This is the root of the cause from rust-lang#73135 where intrinsics
were found to not be inlining trivial functions.

The fix applied in this commit is to remove the special-casing of
compiler-builtins in the compiler. Instead the build system is now
responsible for special-casing compiler-builtins. It doesn't know
exactly how many CGUs will be needed but it passes a large number that
is assumed to be much greater than the number of source-level modules
needed. After reading the various locations in the compiler source, this
seemed like the best solution rather than adding more and more special
casing in the compiler for compiler-builtins.

Closes rust-lang#73135
Manishearth added a commit to Manishearth/rust that referenced this issue Jun 16, 2020
…ins, r=Mark-Simulacrum

Change how compiler-builtins gets many CGUs

This commit intends to fix an accidental regression from rust-lang#70846. The
goal of rust-lang#70846 was to build compiler-builtins with a maximal number of
CGUs to ensure that each module in the source corresponds to an object
file. This high degree of control for compiler-builtins is desirable to
ensure that there's at most one exported symbol per CGU, ideally
enabling compiler-builtins to not conflict with the system libgcc as
often.

In rust-lang#70846, however, only part of the compiler understands that
compiler-builtins is built with many CGUs. The rest of the compiler
thinks it's building with `sess.codegen_units()`. Notably the
calculation of `sess.lto()` consults `sess.codegen_units()`, which when
there's only one CGU it disables ThinLTO. This means that
compiler-builtins is built without ThinLTO, which is quite harmful to
performance! This is the root of the cause from rust-lang#73135 where intrinsics
were found to not be inlining trivial functions.

The fix applied in this commit is to remove the special-casing of
compiler-builtins in the compiler. Instead the build system is now
responsible for special-casing compiler-builtins. It doesn't know
exactly how many CGUs will be needed but it passes a large number that
is assumed to be much greater than the number of source-level modules
needed. After reading the various locations in the compiler source, this
seemed like the best solution rather than adding more and more special
casing in the compiler for compiler-builtins.

Closes rust-lang#73135
Manishearth added a commit to Manishearth/rust that referenced this issue Jun 16, 2020
…ins, r=Mark-Simulacrum

Change how compiler-builtins gets many CGUs

This commit intends to fix an accidental regression from rust-lang#70846. The
goal of rust-lang#70846 was to build compiler-builtins with a maximal number of
CGUs to ensure that each module in the source corresponds to an object
file. This high degree of control for compiler-builtins is desirable to
ensure that there's at most one exported symbol per CGU, ideally
enabling compiler-builtins to not conflict with the system libgcc as
often.

In rust-lang#70846, however, only part of the compiler understands that
compiler-builtins is built with many CGUs. The rest of the compiler
thinks it's building with `sess.codegen_units()`. Notably the
calculation of `sess.lto()` consults `sess.codegen_units()`, which when
there's only one CGU it disables ThinLTO. This means that
compiler-builtins is built without ThinLTO, which is quite harmful to
performance! This is the root of the cause from rust-lang#73135 where intrinsics
were found to not be inlining trivial functions.

The fix applied in this commit is to remove the special-casing of
compiler-builtins in the compiler. Instead the build system is now
responsible for special-casing compiler-builtins. It doesn't know
exactly how many CGUs will be needed but it passes a large number that
is assumed to be much greater than the number of source-level modules
needed. After reading the various locations in the compiler source, this
seemed like the best solution rather than adding more and more special
casing in the compiler for compiler-builtins.

Closes rust-lang#73135
RalfJung added a commit to RalfJung/rust that referenced this issue Jun 19, 2020
…ins, r=Mark-Simulacrum

Change how compiler-builtins gets many CGUs

This commit intends to fix an accidental regression from rust-lang#70846. The
goal of rust-lang#70846 was to build compiler-builtins with a maximal number of
CGUs to ensure that each module in the source corresponds to an object
file. This high degree of control for compiler-builtins is desirable to
ensure that there's at most one exported symbol per CGU, ideally
enabling compiler-builtins to not conflict with the system libgcc as
often.

In rust-lang#70846, however, only part of the compiler understands that
compiler-builtins is built with many CGUs. The rest of the compiler
thinks it's building with `sess.codegen_units()`. Notably the
calculation of `sess.lto()` consults `sess.codegen_units()`, which when
there's only one CGU it disables ThinLTO. This means that
compiler-builtins is built without ThinLTO, which is quite harmful to
performance! This is the root of the cause from rust-lang#73135 where intrinsics
were found to not be inlining trivial functions.

The fix applied in this commit is to remove the special-casing of
compiler-builtins in the compiler. Instead the build system is now
responsible for special-casing compiler-builtins. It doesn't know
exactly how many CGUs will be needed but it passes a large number that
is assumed to be much greater than the number of source-level modules
needed. After reading the various locations in the compiler source, this
seemed like the best solution rather than adding more and more special
casing in the compiler for compiler-builtins.

Closes rust-lang#73135
@bors bors closed this as completed in d6156e8 Jun 19, 2020
Mark-Simulacrum pushed a commit to Mark-Simulacrum/rust that referenced this issue Jul 10, 2020
This commit intends to fix an accidental regression from rust-lang#70846. The
goal of rust-lang#70846 was to build compiler-builtins with a maximal number of
CGUs to ensure that each module in the source corresponds to an object
file. This high degree of control for compiler-builtins is desirable to
ensure that there's at most one exported symbol per CGU, ideally
enabling compiler-builtins to not conflict with the system libgcc as
often.

In rust-lang#70846, however, only part of the compiler understands that
compiler-builtins is built with many CGUs. The rest of the compiler
thinks it's building with `sess.codegen_units()`. Notably the
calculation of `sess.lto()` consults `sess.codegen_units()`, which when
there's only one CGU it disables ThinLTO. This means that
compiler-builtins is built without ThinLTO, which is quite harmful to
performance! This is the root of the cause from rust-lang#73135 where intrinsics
were found to not be inlining trivial functions.

The fix applied in this commit is to remove the special-casing of
compiler-builtins in the compiler. Instead the build system is now
responsible for special-casing compiler-builtins. It doesn't know
exactly how many CGUs will be needed but it passes a large number that
is assumed to be much greater than the number of source-level modules
needed. After reading the various locations in the compiler source, this
seemed like the best solution rather than adding more and more special
casing in the compiler for compiler-builtins.

Closes rust-lang#73135
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. O-wasm Target: WASM (WebAssembly), http://webassembly.org/ P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants