Seemingly-quadratic behavior compiling function with many empty loops #3441

alexcrichton · 2021-10-11T16:59:32Z

A fuzz-generated test recently found times out during compilation. This single-function module looks like:

(module 
  (func
    loop end
    loop end
    loop end
    ;; ... ~20k more times
    loop end
    loop end))

Compiling this module yields:

 DEBUG wasmtime_cranelift::compiler > FuncIndex(0) translated in 1.248188995s
 TRACE wasmtime_cranelift::compiler > FuncIndex(0) timing info
======== ========  ==================================
   Total     Self  Pass
-------- --------  ----------------------------------
   0.006    0.006  Translate WASM function
   0.624    0.007  Compilation passes
   0.003    0.003  Control flow graph
   0.003    0.003  Dominator tree
   0.000    0.000  Loop analysis
   0.000    0.000  Pre-legalization rewriting
   0.000    0.000  Dead code elimination
   0.001    0.001  Global value numbering
   0.002    0.000  Loop invariant code motion
   0.000    0.000  Remove unreachable blocks
   0.001    0.001  Remove constant phi-nodes
   0.005    0.005  VCode lowering
   0.000    0.000  VCode post-register allocation finalization
   0.270    0.270  VCode emission
   0.000    0.000  VCode emission finalization
   0.333    0.333  Register allocation
   0.000    0.000  Binary machine code emission
======== ========  ==================================

Something here seems accidentally quadratic perhaps? Whatever the reason this function probably shouldn't take upwards of a second to compile.

The text was updated successfully, but these errors were encountered:

adamrk · 2021-10-19T20:17:01Z

Took a look at this and it seems like there is quadratic behavior coming from regalloc here. When calculating the loop depth it's iterating over each block and then traversing the entire dominance tree each time. Opened this PR which I think fixes that issue.

Here are the results I see from that.
Baseline:

 DEBUG wasmtime_cranelift::compiler > FuncIndex(0) translated in 1.950180269s
 TRACE wasmtime_cranelift::compiler > FuncIndex(0) timing info
======== ========  ==================================
   Total     Self  Pass
-------- --------  ----------------------------------
   0.014    0.014  Translate WASM function
   1.936    0.017  Compilation passes
   0.008    0.008  Control flow graph
   0.007    0.007  Dominator tree
   0.001    0.001  Loop analysis
   0.001    0.001  Pre-legalization rewriting
   0.000    0.000  Dead code elimination
   0.001    0.001  Global value numbering
   0.004    0.000  Loop invariant code motion
   0.000    0.000  Remove unreachable blocks
   0.001    0.001  Remove constant phi-nodes
   0.014    0.014  VCode lowering
   0.002    0.002  VCode post-register allocation finalization
   0.621    0.621  VCode emission
   0.000    0.000  VCode emission finalization
   1.264    1.264  Register allocation
   0.000    0.000  Binary machine code emission
======== ========  ==================================

With the change:

 DEBUG wasmtime_cranelift::compiler > FuncIndex(0) translated in 632.089765ms
 TRACE wasmtime_cranelift::compiler > FuncIndex(0) timing info
======== ========  ==================================
   Total     Self  Pass
-------- --------  ----------------------------------
   0.013    0.013  Translate WASM function
   0.619    0.015  Compilation passes
   0.007    0.007  Control flow graph
   0.008    0.008  Dominator tree
   0.001    0.001  Loop analysis
   0.001    0.001  Pre-legalization rewriting
   0.000    0.000  Dead code elimination
   0.001    0.001  Global value numbering
   0.005    0.000  Loop invariant code motion
   0.000    0.000  Remove unreachable blocks
   0.001    0.001  Remove constant phi-nodes
   0.011    0.011  VCode lowering
   0.002    0.002  VCode post-register allocation finalization
   0.554    0.554  VCode emission
   0.000    0.000  VCode emission finalization
   0.018    0.018  Register allocation
   0.000    0.000  Binary machine code emission
======== ========  ==================================

cfallin · 2021-10-20T21:57:36Z

Ah, GitHub interpreted my "fixes ..." over on the regalloc.rs side over-optimistically; this isn't actually fixed in Cranelift until we update to regalloc.rs 0.0.32, so I'll reopen this.

It appears that some allocation heuristics have changed slightly since 0.0.31, so some of the golden-output filetests are updated as well. Ideally we would rely more on runtests rather than golden-compilation tests; but for now this is sufficient. (I'm not sure exactly what in regalloc.rs changed to alter these heuristics; it's actually been almost a year since the 0.0.31 release with several refactorings and tweaks merged since then.) Fixes bytecodealliance#3441.

alexcrichton · 2021-10-21T14:11:45Z

Thanks @adamrk for the fix here!

@cfallin I was also slightly worried on this that vcode emission is a pretty big time sink of 600ms in these timings compared to 13ms creation of the clif, do you think that's also worth looking into and/or opening a separate issue for?

cfallin · 2021-10-21T18:47:03Z

@alexcrichton I spent some time profiling this and it appears that there is indeed some suboptimal behavior in the branch simplification; specifically all the empty loops cause thousands of labels to alias and this wreaks havoc on things. I'll create an issue and see if I can add a limiter of some sort.

Fixes bytecodealliance#3468. If a program has many instances of the pattern "goto next; next:" in a row (i.e., no-op branches to the fallthrough address), the branch simplification in `MachBuffer` would remove them all, as expected. However, in order to work correctly, the algorithm needs to track all labels that alias the current buffer tail, so that they can be adjusted later if another branch chomp occurs. When many thousands of this branch-to-next pattern occur, many thousands of labels will reference the current buffer tail, and this list of thousands of labels will be shuffled between the branch metadata struct and the "labels at tail" struct as branches are appended and then chomped immediately. It's possible that with smarter data structure design, we could somehow share the list of labels -- e.g., a single array of all labels, in order they are bound, with ranges of indices in this array used to represent lists of labels (actually, that seems like a better design in general); but let's leave that to future optimization work. For now, we can avoid the quadratic behavior by just "giving up" if the list is too long; it's always valid to not optimize a branch. It is very unlikely that the "normal" case will have more than 100 "goto next" branches in a row, so this should not have any perf impact; if it does, we will leave 1 out of every 100 such branches un-optimized in a long sequence of thousands. This takes total compilation time down on my machine from ~300ms to ~72ms for the `foo.wasm` case in bytecodealliance#3441. For reference, the old backend (now removed), built from arbitrarily-chosen-1-year-old commit `c7fcc344`, takes 158ms, so we're ~twice as fast, which is what I would expect.

This commit disables the `MachBuffer::check_label_branch_invariants` debug check on the fuzzers due to it causing timeouts with the test case from bytecodealliance#3441. Fuzzing leads to a 20-30x slowdown of executed code and locally the fuzz time it takes to instantiate bytecodealliance#3441 drops from 3 minutes to 6 seconds disabling this function. Note that this should still be executed during our testing on CI since it's still enabled for debug assertions.

adamrk mentioned this issue Oct 19, 2021

Speed up the dominance check in calc_loop_depths bytecodealliance/regalloc.rs#128

Merged

cfallin mentioned this issue Oct 20, 2021

Release 0.0.32. bytecodealliance/regalloc.rs#129

Merged

cfallin closed this as completed in bytecodealliance/regalloc.rs#129 Oct 20, 2021

cfallin reopened this Oct 20, 2021

cfallin mentioned this issue Oct 20, 2021

Update to regalloc.rs 0.0.32. #3466

Merged

cfallin closed this as completed in #3466 Oct 20, 2021

cfallin mentioned this issue Oct 21, 2021

Suboptimal behavior in branch simplification (MachBuffer) when thousands of labels alias the same offset #3468

Closed

cfallin mentioned this issue Oct 21, 2021

Avoid quadratic behavior in pathological label-alias case in MachBuffer. #3469

Merged

This was referenced Nov 15, 2021

Disable check_label_branch_invariants in fuzzing #3520

Merged

Pathological time spent in register allocation for large function #3523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seemingly-quadratic behavior compiling function with many empty loops #3441

Seemingly-quadratic behavior compiling function with many empty loops #3441

alexcrichton commented Oct 11, 2021

adamrk commented Oct 19, 2021

cfallin commented Oct 20, 2021

alexcrichton commented Oct 21, 2021

cfallin commented Oct 21, 2021

Seemingly-quadratic behavior compiling function with many empty loops #3441

Seemingly-quadratic behavior compiling function with many empty loops #3441

Comments

alexcrichton commented Oct 11, 2021

adamrk commented Oct 19, 2021

cfallin commented Oct 20, 2021

alexcrichton commented Oct 21, 2021

cfallin commented Oct 21, 2021