Pathological time spent in register allocation for large function #3523

girazoki · 2021-11-15T19:16:25Z

Original Issue

We have implemented this repo https://github.com/girazoki/wasmtime-try to try something that has been happening us with some latest code change we have done at moonbeam. The code change is moonbeam-foundation/moonbeam#989.

Without the PR, wasmtime takes around 15 seconds to compile the wasm. However, if we include the change (removal of the trait) wasmtime takes less than a second to compile. We wonder how these differences can be so big with such a small code change.

Both runtimes are in https://github.com/girazoki/wasmtime-try to try them out
Any help would be appreciated, thanks!

Current status

Modules with way-too-long-compile-times:

alexcrichton · 2021-11-15T19:27:35Z

Running the script from your repository this file doesn't exist, but when corrected to moonbase_fast_runtime.wasm both modules compile in ~16s for me locally.

Can you detail a bit more what you're doing, what's fast, and what's slow?

peterhuene · 2021-11-15T19:27:55Z

The repro has a bug where it isn't loading the "fast" wasm file (named moonbase_fast_runtime.wasm instead of moonbase_runtime_fast.wasm).

With that fixed, I see:

Importing ./moonbase_runtime_slow.wasm...
imported in 16830 ms
Importing ./moonbase_runtime_fast.wasm...
imported in 16952 ms

girazoki · 2021-11-15T19:54:21Z

Sorry it was my bad, updated fast. Now you should be able to see the difference, I updated the fast one.

The context is: We have the moonbase runtime https://github.com/PureStake/moonbeam/blob/master/runtime/moonbase/src/lib.rs that adds https://github.com/PureStake/moonbeam/blob/57fdb3f18fe0a48471bfcad2c37d89708a3a274f/runtime/moonbase/src/lib.rs#L1023. This is moonbase_slow

When we inserted this change, we realized an increase on the wasm compilation time. So we are trying to monitor the reason why this happens. For that reason why created this PR moonbeam-foundation/moonbeam#989, with the code commented, compiled it and measured its wasm compilation time. This is moonbase_fast

For me the first takes 16 secs, while the latter takes around 1. And I am not able to picture the reason of the difference

notlesh · 2021-11-15T20:03:35Z

FYI, some debugging I've done indicates that we get into one invocation of compile_function() (here) which takes the entire duration of the compilation (that is, if it takes 22 seconds to compile the entire binary, this one job takes 22 seconds itself).

alexcrichton · 2021-11-15T21:05:05Z

Thanks for the update, I can indeed reproduce the large discrepancy in compile times. What's happening here is that something pathological is being hit in Cranelift, and it's definitely something that we should fix!

Specifically in this wasm file the <xcm_executor::XcmExecutor<Config> as xcm::v2::traits::ExecuteXcm<<Config as xcm_executor::config::Config>::Call>>::execute_xcm_in_credit function takes 16s to compile. In that wasm file it's function index 1198 and the mangled name is _ZN155_$LT$xcm_executor..XcmExecutor$LT$Config$GT$$u20$as$u20$xcm..v2..traits..ExecuteXcm$LT$$LT$Config$u20$as$u20$xcm_executor..config..Config$GT$..Call$GT$$GT$21execute_xcm_in_credit17hf41c2b7a319ac14dE.

Extracting just that one function to a wasm module yields extract.wasm.gz and this can be reproduced locally with wamtime compile extract.wasm taking 16-17s locally. The breakdown I see is:

$ RUST_LOG=wasmtime_cranelift ./target/release/wasmtime compile extract.wasm 
 DEBUG wasmtime_cranelift::compiler > FuncIndex(989) translated in 16.363414899s
 TRACE wasmtime_cranelift::compiler > FuncIndex(989) timing info
======== ========  ==================================
   Total     Self  Pass
-------- --------  ----------------------------------
   0.320    0.320  Translate WASM function
  16.039    0.025  Compilation passes
   0.001    0.001  Control flow graph
   0.002    0.002  Dominator tree
   0.000    0.000  Loop analysis
   0.001    0.001  Pre-legalization rewriting
   0.004    0.004  Dead code elimination
   0.012    0.012  Global value numbering
   0.036    0.027  Loop invariant code motion
   0.000    0.000  Remove unreachable blocks
   0.246    0.246  Remove constant phi-nodes
   0.188    0.188  VCode lowering
   0.022    0.022  VCode post-register allocation finalization
   0.026    0.026  VCode emission
   0.000    0.000  VCode emission finalization
  15.478   15.478  Register allocation
   0.004    0.004  Binary machine code emission
   0.008    0.008  Layout full renumbering
======== ========  ==================================

There are known pitfalls with our register allocator at this time and plans to move to a better register allocator. My guess is that this is likely to fall into that category of "probably will get fixed", but @cfallin do you think it's worthwhile digging in more here to figure out why something is triggering the slowdown here?

@girazoki and @notlesh in the meantime for your specific use case if you'd like to avoid this slowdown I'd recommend taking a look at <xcm_executor::XcmExecutor<Config> as xcm::v2::traits::ExecuteXcm<<Config as xcm_executor::config::Config>::Call>>::execute_xcm_in_credit and trying to simplify it a bit. It's a huge function in the wasm and probably has a ton of stuff #[inline]'d into it which greatly increases the size of the function. It may be the case that adding #[inline(never)] or otherwise removing #[inline] annotations in a few places could improve wasmtime compile times. Nevertheless we should still fix this in Wasmtime, but that's at least a possibility for fixing things in the near future if you're interested.

alexcrichton · 2021-11-15T21:10:59Z

This is a slightly smaller minimization with data strings and import strings all removed as well which makes navigating the *.wat a bit easier, and still takes ~16s to compile the one function within.

cfallin · 2021-11-15T21:12:29Z

@alexcrichton That's a good question; ordinarily we'd definitely want to chase something like this, but if the RA2 transition happens soon-ish then the payoff of the time it would take to understand the issue and update the old regalloc, and be confident in the fix, doesn't seem worth it. Especially as many of the details of regalloc.rs have fallen out of my mental L2 cache by now...

These sorts of things do put pressure on the transition, though, so hopefully the moving pieces needed for that will come together soon!

alexcrichton · 2021-11-15T21:16:02Z

@adamrk given your interest on this issue you might be interested in this one as well (detailed here) but if not no worries!

crystalin · 2021-11-16T14:11:15Z

Not sure if that would help, but here are the perf (you can open in https://www.speedscope.app/ ) for when it is loading the wasm: https://drive.google.com/file/d/1LSid_h9vs8LGtbaQ8z41CD3vn4-y4g2U/view?usp=sharing

(Here is where most of the work goes it seems)

adamrk · 2021-11-26T22:57:04Z

Sorry, just noticed this. Yeah, I'll take a look at it.

JoshOrndorff · 2021-12-15T11:31:59Z

@alexcrichton We have hit what appears to be the same issue in the Moonbeam project again. This time it is in this wasm file.

Following your example, @girazoki, explored this:

RUST_LOG=wasmtime_cranelift wasmtime compile moonbase_runtime.wasm

...

 DEBUG wasmtime_cranelift::compiler > FuncIndex(1276) translated in 13.571936365s
 TRACE wasmtime_cranelift::compiler > FuncIndex(1276) timing info
======== ========  ==================================
   Total     Self  Pass
-------- --------  ----------------------------------
   0.510    0.510  Translate WASM function
  13.056    0.030  Compilation passes
   0.001    0.001  Control flow graph
   0.002    0.002  Dominator tree
   0.000    0.000  Loop analysis
   0.001    0.001  Pre-legalization rewriting
   0.005    0.005  Dead code elimination
   0.009    0.009  Global value numbering
   0.044    0.033  Loop invariant code motion
   0.000    0.000  Remove unreachable blocks
   0.210    0.210  Remove constant phi-nodes
   0.222    0.222  VCode lowering
   0.060    0.060  VCode post-register allocation finalization
   0.029    0.029  VCode emission
   0.000    0.000  VCode emission finalization
  12.443   12.443  Register allocation
   0.006    0.006  Binary machine code emission
   0.009    0.009  Layout full renumbering
======== ========  ==================================

How did you figure out the human-readale or even the mangled name of the function to know where to put the #[inline(never)]?

alexcrichton · 2021-12-15T17:49:20Z

Oh nice, thanks for that! I was able to find the function-in-question with:

RUST_LOG=wasmtime_cranelift=debug ./target/release/wasmtime compile moonbase_runtime.compact.wasm

which prints out:

...
 DEBUG wasmtime_cranelift::compiler > FuncIndex(1776) translated in 70.436892ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(679) translated in 69.31585ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(2050) translated in 61.266995ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(1574) translated in 68.276648ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(1622) translated in 75.956103ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(1983) translated in 85.31272ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(1347) translated in 282.235891ms
 DEBUG wasmtime_cranelift::compiler > FuncIndex(1276) translated in 19.01582367s

which points to function 1276 as the offender here.

Using wasm-opt I got:

$ wasm-opt --extract-function-index=1276 ./moonbase_runtime.compact.wasm -o extract.wasm
extracting <xcm_executor::XcmExecutor<Config>\20as\20xcm::v2::traits::ExecuteXcm<<Config\20as\20xcm_executor::config::Config>::Call>>::execute_xcm_in_credit::hf77207caf74f0f13

which I can confirm the extracted wasm file takes ~20s to compile, quite too long!

alexcrichton · 2022-04-26T15:01:28Z

I'm going to close this given the discussion on #4060. The new regalloc probably improves things here but these sorts of outliers are inevitable with Cranelift's design goals.

alexcrichton changed the title ~~Huge increase in wasmtime compilation~~ Pathological time spent in register allocation for large function Nov 15, 2021

This was referenced Nov 16, 2021

Prohibit inline functions in XCM VM moonbeam-foundation/polkadot#17

Closed

Prohibit inline XCM VM functions moonbeam-foundation/polkadot#18

Merged

alexcrichton added cranelift:goal:compile-time Focus area: how fast Cranelift can compile or how much memory it uses. performance labels Nov 19, 2021

JoshOrndorff mentioned this issue Dec 14, 2021

Nimbus 0.9 and Sign blocks in dev service moonbeam-foundation/moonbeam#958

Merged

alexcrichton mentioned this issue Apr 21, 2022

Timeout in regalloc2 with large fuzz input and epochs enabled #4060

Closed

alexcrichton closed this as completed Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pathological time spent in register allocation for large function #3523

Pathological time spent in register allocation for large function #3523

girazoki commented Nov 15, 2021 •

edited by alexcrichton

Loading

alexcrichton commented Nov 15, 2021

peterhuene commented Nov 15, 2021 •

edited

Loading

girazoki commented Nov 15, 2021

notlesh commented Nov 15, 2021

alexcrichton commented Nov 15, 2021

alexcrichton commented Nov 15, 2021 •

edited

Loading

cfallin commented Nov 15, 2021

alexcrichton commented Nov 15, 2021

crystalin commented Nov 16, 2021 •

edited

Loading

adamrk commented Nov 26, 2021

JoshOrndorff commented Dec 15, 2021

alexcrichton commented Dec 15, 2021

alexcrichton commented Apr 26, 2022

Pathological time spent in register allocation for large function #3523

Pathological time spent in register allocation for large function #3523

Comments

girazoki commented Nov 15, 2021 • edited by alexcrichton Loading

Original Issue

Current status

alexcrichton commented Nov 15, 2021

peterhuene commented Nov 15, 2021 • edited Loading

girazoki commented Nov 15, 2021

notlesh commented Nov 15, 2021

alexcrichton commented Nov 15, 2021

alexcrichton commented Nov 15, 2021 • edited Loading

cfallin commented Nov 15, 2021

alexcrichton commented Nov 15, 2021

crystalin commented Nov 16, 2021 • edited Loading

adamrk commented Nov 26, 2021

JoshOrndorff commented Dec 15, 2021

alexcrichton commented Dec 15, 2021

alexcrichton commented Apr 26, 2022

girazoki commented Nov 15, 2021 •

edited by alexcrichton

Loading

peterhuene commented Nov 15, 2021 •

edited

Loading

alexcrichton commented Nov 15, 2021 •

edited

Loading

crystalin commented Nov 16, 2021 •

edited

Loading