Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pathological time spent in register allocation for large function #3523

Closed
girazoki opened this issue Nov 15, 2021 · 13 comments
Closed

Pathological time spent in register allocation for large function #3523

girazoki opened this issue Nov 15, 2021 · 13 comments
Labels
cranelift:goal:compile-time Focus area: how fast Cranelift can compile or how much memory it uses. performance

Comments

@girazoki
Copy link

girazoki commented Nov 15, 2021

Original Issue

We have implemented this repo https://github.com/girazoki/wasmtime-try to try something that has been happening us with some latest code change we have done at moonbeam. The code change is moonbeam-foundation/moonbeam#989.

Without the PR, wasmtime takes around 15 seconds to compile the wasm. However, if we include the change (removal of the trait) wasmtime takes less than a second to compile. We wonder how these differences can be so big with such a small code change.

Both runtimes are in https://github.com/girazoki/wasmtime-try to try them out
Any help would be appreciated, thanks!

Current status

Modules with way-too-long-compile-times:

@alexcrichton
Copy link
Member

Running the script from your repository this file doesn't exist, but when corrected to moonbase_fast_runtime.wasm both modules compile in ~16s for me locally.

Can you detail a bit more what you're doing, what's fast, and what's slow?

@peterhuene
Copy link
Member

peterhuene commented Nov 15, 2021

The repro has a bug where it isn't loading the "fast" wasm file (named moonbase_fast_runtime.wasm instead of moonbase_runtime_fast.wasm).

With that fixed, I see:

Importing ./moonbase_runtime_slow.wasm...
imported in 16830 ms
Importing ./moonbase_runtime_fast.wasm...
imported in 16952 ms

@girazoki
Copy link
Author

Sorry it was my bad, updated fast. Now you should be able to see the difference, I updated the fast one.

The context is: We have the moonbase runtime https://github.com/PureStake/moonbeam/blob/master/runtime/moonbase/src/lib.rs that adds https://github.com/PureStake/moonbeam/blob/57fdb3f18fe0a48471bfcad2c37d89708a3a274f/runtime/moonbase/src/lib.rs#L1023. This is moonbase_slow

When we inserted this change, we realized an increase on the wasm compilation time. So we are trying to monitor the reason why this happens. For that reason why created this PR moonbeam-foundation/moonbeam#989, with the code commented, compiled it and measured its wasm compilation time. This is moonbase_fast

For me the first takes 16 secs, while the latter takes around 1. And I am not able to picture the reason of the difference

@notlesh
Copy link

notlesh commented Nov 15, 2021

FYI, some debugging I've done indicates that we get into one invocation of compile_function() (here) which takes the entire duration of the compilation (that is, if it takes 22 seconds to compile the entire binary, this one job takes 22 seconds itself).

@alexcrichton
Copy link
Member

Thanks for the update, I can indeed reproduce the large discrepancy in compile times. What's happening here is that something pathological is being hit in Cranelift, and it's definitely something that we should fix!

Specifically in this wasm file the <xcm_executor::XcmExecutor<Config> as xcm::v2::traits::ExecuteXcm<<Config as xcm_executor::config::Config>::Call>>::execute_xcm_in_credit function takes 16s to compile. In that wasm file it's function index 1198 and the mangled name is _ZN155_$LT$xcm_executor..XcmExecutor$LT$Config$GT$$u20$as$u20$xcm..v2..traits..ExecuteXcm$LT$$LT$Config$u20$as$u20$xcm_executor..config..Config$GT$..Call$GT$$GT$21execute_xcm_in_credit17hf41c2b7a319ac14dE.

Extracting just that one function to a wasm module yields extract.wasm.gz and this can be reproduced locally with wamtime compile extract.wasm taking 16-17s locally. The breakdown I see is:

$ RUST_LOG=wasmtime_cranelift ./target/release/wasmtime compile extract.wasm 
 DEBUG wasmtime_cranelift::compiler > FuncIndex(989) translated in 16.363414899s
 TRACE wasmtime_cranelift::compiler > FuncIndex(989) timing info
======== ========  ==================================
   Total     Self  Pass
-------- --------  ----------------------------------
   0.320    0.320  Translate WASM function
  16.039    0.025  Compilation passes
   0.001    0.001  Control flow graph
   0.002    0.002  Dominator tree
   0.000    0.000  Loop analysis
   0.001    0.001  Pre-legalization rewriting
   0.004    0.004  Dead code elimination
   0.012    0.012  Global value numbering
   0.036    0.027  Loop invariant code motion
   0.000    0.000  Remove unreachable blocks
   0.246    0.246  Remove constant phi-nodes
   0.188    0.188  VCode lowering
   0.022    0.022  VCode post-register allocation finalization
   0.026    0.026  VCode emission
   0.000    0.000  VCode emission finalization
  15.478   15.478  Register allocation
   0.004    0.004  Binary machine code emission
   0.008    0.008  Layout full renumbering
======== ========  ==================================

There are known pitfalls with our register allocator at this time and plans to move to a better register allocator. My guess is that this is likely to fall into that category of "probably will get fixed", but @cfallin do you think it's worthwhile digging in more here to figure out why something is triggering the slowdown here?

@girazoki and @notlesh in the meantime for your specific use case if you'd like to avoid this slowdown I'd recommend taking a look at <xcm_executor::XcmExecutor<Config> as xcm::v2::traits::ExecuteXcm<<Config as xcm_executor::config::Config>::Call>>::execute_xcm_in_credit and trying to simplify it a bit. It's a huge function in the wasm and probably has a ton of stuff #[inline]'d into it which greatly increases the size of the function. It may be the case that adding #[inline(never)] or otherwise removing #[inline] annotations in a few places could improve wasmtime compile times. Nevertheless we should still fix this in Wasmtime, but that's at least a possibility for fixing things in the near future if you're interested.

@alexcrichton
Copy link
Member

alexcrichton commented Nov 15, 2021

This is a slightly smaller minimization with data strings and import strings all removed as well which makes navigating the *.wat a bit easier, and still takes ~16s to compile the one function within.

@alexcrichton alexcrichton changed the title Huge increase in wasmtime compilation Pathological time spent in register allocation for large function Nov 15, 2021
@cfallin
Copy link
Member

cfallin commented Nov 15, 2021

@alexcrichton That's a good question; ordinarily we'd definitely want to chase something like this, but if the RA2 transition happens soon-ish then the payoff of the time it would take to understand the issue and update the old regalloc, and be confident in the fix, doesn't seem worth it. Especially as many of the details of regalloc.rs have fallen out of my mental L2 cache by now...

These sorts of things do put pressure on the transition, though, so hopefully the moving pieces needed for that will come together soon!

@alexcrichton
Copy link
Member

@adamrk given your interest on this issue you might be interested in this one as well (detailed here) but if not no worries!

@crystalin
Copy link

crystalin commented Nov 16, 2021

Not sure if that would help, but here are the perf (you can open in https://www.speedscope.app/ ) for when it is loading the wasm: https://drive.google.com/file/d/1LSid_h9vs8LGtbaQ8z41CD3vn4-y4g2U/view?usp=sharing

(Here is where most of the work goes it seems)
image (1)

@alexcrichton alexcrichton added cranelift:goal:compile-time Focus area: how fast Cranelift can compile or how much memory it uses. performance labels Nov 19, 2021
@adamrk
Copy link
Contributor

adamrk commented Nov 26, 2021

Sorry, just noticed this. Yeah, I'll take a look at it.

@JoshOrndorff
Copy link

@alexcrichton We have hit what appears to be the same issue in the Moonbeam project again. This time it is in this wasm file.

Following your example, @girazoki, explored this:

RUST_LOG=wasmtime_cranelift wasmtime compile moonbase_runtime.wasm

...

 DEBUG wasmtime_cranelift::compiler > FuncIndex(1276) translated in 13.571936365s
 TRACE wasmtime_cranelift::compiler > FuncIndex(1276) timing info
======== ========  ==================================
   Total     Self  Pass
-------- --------  ----------------------------------
   0.510    0.510  Translate WASM function
  13.056    0.030  Compilation passes
   0.001    0.001  Control flow graph
   0.002    0.002  Dominator tree
   0.000    0.000  Loop analysis
   0.001    0.001  Pre-legalization rewriting
   0.005    0.005  Dead code elimination
   0.009    0.009  Global value numbering
   0.044    0.033  Loop invariant code motion
   0.000    0.000  Remove unreachable blocks
   0.210    0.210  Remove constant phi-nodes
   0.222    0.222  VCode lowering
   0.060    0.060  VCode post-register allocation finalization
   0.029    0.029  VCode emission
   0.000    0.000  VCode emission finalization
  12.443   12.443  Register allocation
   0.006    0.006  Binary machine code emission
   0.009    0.009  Layout full renumbering
======== ========  ==================================

How did you figure out the human-readale or even the mangled name of the function to know where to put the #[inline(never)]?

@alexcrichton
Copy link
Member

Oh nice, thanks for that! I was able to find the function-in-question with:

RUST_LOG=wasmtime_cranelift=debug ./target/release/wasmtime compile moonbase_runtime.compact.wasm

which prints out:

...
 DEBUG wasmtime_cranelift::compiler > FuncIndex(1776) translated in 70.436892ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(679) translated in 69.31585ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(2050) translated in 61.266995ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(1574) translated in 68.276648ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(1622) translated in 75.956103ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(1983) translated in 85.31272ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(1347) translated in 282.235891ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(1276) translated in 19.01582367s

which points to function 1276 as the offender here.

Using wasm-opt I got:

$ wasm-opt --extract-function-index=1276 ./moonbase_runtime.compact.wasm -o extract.wasm
extracting <xcm_executor::XcmExecutor<Config>\20as\20xcm::v2::traits::ExecuteXcm<<Config\20as\20xcm_executor::config::Config>::Call>>::execute_xcm_in_credit::hf77207caf74f0f13

which I can confirm the extracted wasm file takes ~20s to compile, quite too long!

@alexcrichton
Copy link
Member

I'm going to close this given the discussion on #4060. The new regalloc probably improves things here but these sorts of outliers are inevitable with Cranelift's design goals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:goal:compile-time Focus area: how fast Cranelift can compile or how much memory it uses. performance
Projects
None yet
Development

No branches or pull requests

8 participants