Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seemingly-quadratic behavior compiling function with many empty loops #3441

Closed
alexcrichton opened this issue Oct 11, 2021 · 4 comments · Fixed by bytecodealliance/regalloc.rs#129 or #3466

Comments

@alexcrichton
Copy link
Member

A fuzz-generated test recently found times out during compilation. This single-function module looks like:

(module 
  (func
    loop end
    loop end
    loop end
    ;; ... ~20k more times
    loop end
    loop end))

Compiling this module yields:

 DEBUG wasmtime_cranelift::compiler > FuncIndex(0) translated in 1.248188995s
 TRACE wasmtime_cranelift::compiler > FuncIndex(0) timing info
======== ========  ==================================
   Total     Self  Pass
-------- --------  ----------------------------------
   0.006    0.006  Translate WASM function
   0.624    0.007  Compilation passes
   0.003    0.003  Control flow graph
   0.003    0.003  Dominator tree
   0.000    0.000  Loop analysis
   0.000    0.000  Pre-legalization rewriting
   0.000    0.000  Dead code elimination
   0.001    0.001  Global value numbering
   0.002    0.000  Loop invariant code motion
   0.000    0.000  Remove unreachable blocks
   0.001    0.001  Remove constant phi-nodes
   0.005    0.005  VCode lowering
   0.000    0.000  VCode post-register allocation finalization
   0.270    0.270  VCode emission
   0.000    0.000  VCode emission finalization
   0.333    0.333  Register allocation
   0.000    0.000  Binary machine code emission
======== ========  ==================================

Something here seems accidentally quadratic perhaps? Whatever the reason this function probably shouldn't take upwards of a second to compile.

@adamrk
Copy link
Contributor

adamrk commented Oct 19, 2021

Took a look at this and it seems like there is quadratic behavior coming from regalloc here. When calculating the loop depth it's iterating over each block and then traversing the entire dominance tree each time. Opened this PR which I think fixes that issue.

Here are the results I see from that.
Baseline:

 DEBUG wasmtime_cranelift::compiler > FuncIndex(0) translated in 1.950180269s
 TRACE wasmtime_cranelift::compiler > FuncIndex(0) timing info
======== ========  ==================================
   Total     Self  Pass
-------- --------  ----------------------------------
   0.014    0.014  Translate WASM function
   1.936    0.017  Compilation passes
   0.008    0.008  Control flow graph
   0.007    0.007  Dominator tree
   0.001    0.001  Loop analysis
   0.001    0.001  Pre-legalization rewriting
   0.000    0.000  Dead code elimination
   0.001    0.001  Global value numbering
   0.004    0.000  Loop invariant code motion
   0.000    0.000  Remove unreachable blocks
   0.001    0.001  Remove constant phi-nodes
   0.014    0.014  VCode lowering
   0.002    0.002  VCode post-register allocation finalization
   0.621    0.621  VCode emission
   0.000    0.000  VCode emission finalization
   1.264    1.264  Register allocation
   0.000    0.000  Binary machine code emission
======== ========  ==================================

With the change:

 DEBUG wasmtime_cranelift::compiler > FuncIndex(0) translated in 632.089765ms
 TRACE wasmtime_cranelift::compiler > FuncIndex(0) timing info
======== ========  ==================================
   Total     Self  Pass
-------- --------  ----------------------------------
   0.013    0.013  Translate WASM function
   0.619    0.015  Compilation passes
   0.007    0.007  Control flow graph
   0.008    0.008  Dominator tree
   0.001    0.001  Loop analysis
   0.001    0.001  Pre-legalization rewriting
   0.000    0.000  Dead code elimination
   0.001    0.001  Global value numbering
   0.005    0.000  Loop invariant code motion
   0.000    0.000  Remove unreachable blocks
   0.001    0.001  Remove constant phi-nodes
   0.011    0.011  VCode lowering
   0.002    0.002  VCode post-register allocation finalization
   0.554    0.554  VCode emission
   0.000    0.000  VCode emission finalization
   0.018    0.018  Register allocation
   0.000    0.000  Binary machine code emission
======== ========  ==================================

@cfallin
Copy link
Member

cfallin commented Oct 20, 2021

Ah, GitHub interpreted my "fixes ..." over on the regalloc.rs side over-optimistically; this isn't actually fixed in Cranelift until we update to regalloc.rs 0.0.32, so I'll reopen this.

@cfallin cfallin reopened this Oct 20, 2021
cfallin added a commit to cfallin/wasmtime that referenced this issue Oct 20, 2021
It appears that some allocation heuristics have changed slightly since
0.0.31, so some of the golden-output filetests are updated as well.
Ideally we would rely more on runtests rather than golden-compilation
tests; but for now this is sufficient. (I'm not sure exactly what in
regalloc.rs changed to alter these heuristics; it's actually been almost
a year since the 0.0.31 release with several refactorings and tweaks
merged since then.)

Fixes bytecodealliance#3441.
@alexcrichton
Copy link
Member Author

Thanks @adamrk for the fix here!

@cfallin I was also slightly worried on this that vcode emission is a pretty big time sink of 600ms in these timings compared to 13ms creation of the clif, do you think that's also worth looking into and/or opening a separate issue for?

@cfallin
Copy link
Member

cfallin commented Oct 21, 2021

@alexcrichton I spent some time profiling this and it appears that there is indeed some suboptimal behavior in the branch simplification; specifically all the empty loops cause thousands of labels to alias and this wreaks havoc on things. I'll create an issue and see if I can add a limiter of some sort.

cfallin added a commit to cfallin/wasmtime that referenced this issue Oct 21, 2021
Fixes bytecodealliance#3468.

If a program has many instances of the pattern "goto next; next:" in a
row (i.e., no-op branches to the fallthrough address), the branch
simplification in `MachBuffer` would remove them all, as expected.
However, in order to work correctly, the algorithm needs to track all
labels that alias the current buffer tail, so that they can be adjusted
later if another branch chomp occurs.

When many thousands of this branch-to-next pattern occur, many thousands
of labels will reference the current buffer tail, and this list of
thousands of labels will be shuffled between the branch metadata struct
and the "labels at tail" struct as branches are appended and then
chomped immediately.

It's possible that with smarter data structure design, we could somehow
share the list of labels -- e.g., a single array of all labels, in order
they are bound, with ranges of indices in this array used to represent
lists of labels (actually, that seems like a better design in general);
but let's leave that to future optimization work.

For now, we can avoid the quadratic behavior by just "giving up" if the
list is too long; it's always valid to not optimize a branch. It is very
unlikely that the "normal" case will have more than 100 "goto next"
branches in a row, so this should not have any perf impact; if it does,
we will leave 1 out of every 100 such branches un-optimized in a long
sequence of thousands.

This takes total compilation time down on my machine from ~300ms to
~72ms for the `foo.wasm` case in bytecodealliance#3441. For reference, the old backend
(now removed), built from arbitrarily-chosen-1-year-old commit
`c7fcc344`, takes 158ms, so we're ~twice as fast, which is what I would
expect.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Nov 15, 2021
This commit disables the `MachBuffer::check_label_branch_invariants`
debug check on the fuzzers due to it causing timeouts with the test case
from bytecodealliance#3441. Fuzzing leads to a 20-30x slowdown of executed code and
locally the fuzz time it takes to instantiate bytecodealliance#3441 drops from 3 minutes
to 6 seconds disabling this function. Note that this should still be
executed during our testing on CI since it's still enabled for debug
assertions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants