Enable MemorySSA in MemCpyOpt #82806

nikic · 2021-03-05T17:17:14Z

LLVM 12 ships with an implementation of MemCpyOpt which is based on MSSA instead of MDA. This implementation can eliminate memcpys across blocks, and as such fixes many (but not all) failures to eliminate redundant memcpys for Rust code. Unfortunately this was only enabled by default shortly after LLVM 12 was cut. This backports the enablement to our LLVM fork.

Perf results: https://perf.rust-lang.org/compare.html?start=8fd946c63a6c3aae9788bd459d278cb2efa77099&end=0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8

There are improvements on check and debug builds, which indicate that rustc itself has become faster. For opt builds this is, on average, a very minor improvement as well, although there is one significant outlier with deep-vector-opt. This benchmark creates ~140000 zero stores, which are now coalesced into a memset slightly later, resulting in longer compile-time for intermediate passes.

nikic · 2021-03-05T17:17:24Z

@bors try @rust-timer queue

rust-timer · 2021-03-05T17:17:25Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2021-03-05T17:17:37Z

⌛ Trying commit 91b2cf71dd4a87c341bc07409a345857ac687981 with merge dd2f66f571385888e89e2f017de490bbc855c499...

bors · 2021-03-05T18:05:51Z

☀️ Try build successful - checks-actions
Build commit: dd2f66f571385888e89e2f017de490bbc855c499 (dd2f66f571385888e89e2f017de490bbc855c499)

rust-timer · 2021-03-05T18:05:53Z

Queued dd2f66f571385888e89e2f017de490bbc855c499 with parent 8fd946c, future comparison URL.

rust-timer · 2021-03-05T19:39:41Z

Finished benchmarking try commit (dd2f66f571385888e89e2f017de490bbc855c499): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

nikic · 2021-03-05T19:56:27Z

@bors try @rust-timer queue

rust-timer · 2021-03-05T19:56:29Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2021-03-05T19:56:36Z

⌛ Trying commit b022c9dbfa16eb4c64ff19b23e7daf85112bb8ae with merge 0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8...

bors · 2021-03-05T20:57:02Z

☀️ Try build successful - checks-actions
Build commit: 0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8 (0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8)

rust-timer · 2021-03-05T20:57:03Z

Queued 0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8 with parent 8fd946c, future comparison URL.

rust-timer · 2021-03-05T23:32:34Z

Finished benchmarking try commit (0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

nikic · 2021-03-05T23:53:32Z

LLVM 12 ships with an implementation of MemCpyOpt which is based on MSSA instead of MDA. This implementation can eliminate memcpys across blocks, and as such fixes many (but not all) failures to eliminate redundant memcpys for Rust code. Unfortunately this was only enabled by default shortly after LLVM 12 was cut. I think it may be worthwhile to backport the enablement for our purposes. There's two ways to do this, with corresponding perf results:

Enable using -enable-memcpyopt-memoryssa: https://perf.rust-lang.org/compare.html?start=8fd946c63a6c3aae9788bd459d278cb2efa77099&end=dd2f66f571385888e89e2f017de490bbc855c499

Enable by backporting upstream change: https://perf.rust-lang.org/compare.html?start=8fd946c63a6c3aae9788bd459d278cb2efa77099&end=0628b91ce17035fb5b6a1a99a4f2ab9ab69be7a8

The difference is that the first enables MSSA but keeps the current position of the pass in the pipeline. This has compile-time cost, because it requires an extra MSSA calculation in most cases. The second one (which is how it was done upstream) also slightly adjusts the position where MemCpyOpt is run, so an existing MSSA calculation can be reused.

In terms of results, we can see improvements on check/debug builds in both cases, which indicates that rustc itself is faster. For opt builds the second version is faster, which is most clearly visible for the boostrap timings, although there is one outlier with deep-vector-opt.

r? @nagisa

nagisa · 2021-03-06T00:03:49Z

The message should be probably copied over to the first message so that it ends up in the merge commit description.

The results all look pretty good. Sadly the deep-vector-opt has regressed significantly in wall-time (by ~12%, roughly matching the regression in instruction counts) too, so it'd be good to look into how representative the test case is of a typical Rust code. I suspect its going to have a huge number of memcpys or humongous basic blocks or very many of them.

I'm also curious how large is the improvement on the runtime of the code. We see from the compiler time improvements that there is some, but the compiler isn't always a very representative workload. Significant improvements in e.g. even the deep-vector-opt runtime could justify the compile time regressions. Alas I don't think this is a piece of information we can get out of perf :(

nikic · 2021-03-06T10:35:47Z

I've looked into what happens in the deep-vector-opt case: The input file is https://github.com/rust-lang/rustc-perf/blob/master/collector/benchmarks/deep-vector/src/main.rs which basically compiles down to 140000 GEPs + stores of zero. The reason for the compile-time regression in this case is that that we now have an additional InstCombine run prior to MemCpyOpt, which tries to (futilely) fold all those instructions. Previously the stores got combined into a memset a bit earlier, reducing the amount of work InstCombine does. The end result is the same in both cases (the function is optimized away entirely). I'm personally not particularly bothered by this, as this is a degenerate input, and the regression is a non-pathological one (linear).

As to runtime impact, I believe that unnecessary memcpys are a pretty common problem in Rust, but I don't have a way to quantify it beyond the check build results. @jrmuizel might have some insight here, I believe he reported a lot of memcpy optimization problems affecting servo.

jrmuizel · 2021-03-06T14:05:08Z

https://github.com/jrmuizel/memcpy-find should be a useful tool for evaluating this. I'll try to do some before and after on some projects with this branch if I can find the time.

jrmuizel · 2021-03-06T14:10:29Z

I confirmed, as expected, that this does fix #56172, which was a reduced test case from a relatively hot path in WebRender.

nagisa · 2021-03-06T15:17:29Z

compiles down to 140000 GEPs + stores of zero

Yeah, that's exactly what I had hoped for. I think we're pretty comfortable taking a comptime perf hit here given the runtime perf improvements we're hoping to get here, even if they can't be quantified too well.

I still think it might make sense to track this kind of thing, but I'm not super concerned about it either.

I'm happy to r+ this once the PR is adjusted to the point where it can be landed.

jrmuizel · 2021-03-06T19:19:31Z

Building wrench with this branch brings the number of memcpy's down from 4206 to 3611 and the sum of sizes from 167776 to 133643

nikic · 2021-03-10T13:09:16Z

Upstream report: https://bugs.llvm.org/show_bug.cgi?id=49509

This updates the LLVM submodule to pick up a backported patch to enable MemorySSA-based MemCpyOpt, which is capable of optimizing away memcpy's across basic blocks.

nikic · 2021-03-11T13:30:42Z

Picked up the fix for the PowerPC hang. This also includes the fix for #80810 now, as it landed in the submodule in the meantime.

@bors r=nagisa

bors · 2021-03-11T13:30:44Z

📌 Commit 623ca84 has been approved by nagisa

nagisa · 2021-03-11T18:00:01Z

@bors p=1 Lets get this out of the pipeline sooner so that we can start landing other LLVM backports for miscompiles we've encountered.

bors · 2021-03-11T18:15:03Z

⌛ Testing commit 623ca84 with merge 4a8b6f7...

bors · 2021-03-11T20:55:57Z

☀️ Test successful - checks-actions
Approved by: nagisa
Pushing 4a8b6f7 to master...

glandium · 2021-03-15T08:46:02Z

While the issues that have been closed are indeed fixed to some extent, I'm not particularly convinced they're intrinsically fixed. It feels like you could still be unlucky with the optimizer. And that's not mentioning that the problems still exist with unoptimized builds. Or with cranelift, I suppose.

bstrie · 2021-03-15T20:07:19Z

@glandium What more could be done? Indeed, you might still get unlucky with the optimizer, but the closed issues in question can also be generally characterized as "I got unlucky with the optimizer". I'm not sure that a bulletproof resolution exists.

nikic · 2021-03-15T20:36:08Z

@bstrie There's two dimensions to this problem: The optimization problem, which is mostly resolved by this change, and the language semantics problem, which is not. Rust has neither placement new nor guaranteed NRVO, both of which may be desirable to provide hard guarantees that certain copies do not occur (e.g. that if you box a value, no stack allocation will be created even in completely unoptimized builds). I believe the closed issues all refer to optimization problems and we have separate issues to track the language semantics.

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 5, 2021

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Mar 5, 2021

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 5, 2021

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 5, 2021

rust-highfive assigned nagisa Mar 5, 2021

This was referenced Mar 10, 2021

Manually initialize GcBox contents post-allocation to reduce memory copying ruffle-rs/gc-arena#2

Closed

Manually initialize GcBox contents post-allocation to reduce memory copying kyren/gc-arena#14

Closed

Enable MemorySSA-based MemCpyOpt

623ca84

This updates the LLVM submodule to pick up a backported patch to enable MemorySSA-based MemCpyOpt, which is capable of optimizing away memcpy's across basic blocks.

nikic force-pushed the memcpyopt-mssa branch from 2c7a450 to 623ca84 Compare March 11, 2021 13:29

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 11, 2021

bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 11, 2021

bors merged commit 4a8b6f7 into rust-lang:master Mar 11, 2021

rustbot added this to the 1.52.0 milestone Mar 11, 2021

bors mentioned this pull request Mar 11, 2021

Allow qualified paths in struct construction (both expressions and patterns) #80080

Merged

cuviper mentioned this pull request Mar 12, 2021

Cross-compiling Rust to s390x yields a faulty toolchain #80810

Closed

nikic mentioned this pull request Mar 12, 2021

An extra memcpy with -Zmir-opt-level=2 #77613

Closed

jrmuizel mentioned this pull request Mar 12, 2021

Unnecessary memcpy caused by ordering of unwrap #56172

Closed

This was referenced Mar 12, 2021

Huge stack allocation is generated when assigning a huge piece of memory to a reference #72211

Open

Box::new(uninitialized) creates a big alloca #58201

Closed

This was referenced Mar 12, 2021

Unnecessary memcpy when returning a struct #57077

Closed

Stack overflow with Boxed array #53827

Open

Uninitialized bytes get copied to the buffer when creating an uninitialized tuple #83087

Closed

nikic mentioned this pull request Mar 13, 2021

Bad codegen for cloning a boxed array #41160

Closed

erikdesjardins mentioned this pull request Mar 13, 2021

Rust will memmov |self| when passed by value to inline function #42763

Open

hudson-ayers mentioned this pull request Mar 16, 2021

RFC: Fixing unnecessary stack use on all platforms tock/tock#2425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable MemorySSA in MemCpyOpt #82806

Enable MemorySSA in MemCpyOpt #82806

nikic commented Mar 5, 2021 •

edited

Loading

nikic commented Mar 5, 2021

rust-timer commented Mar 5, 2021

bors commented Mar 5, 2021

bors commented Mar 5, 2021

rust-timer commented Mar 5, 2021

rust-timer commented Mar 5, 2021

nikic commented Mar 5, 2021

rust-timer commented Mar 5, 2021

bors commented Mar 5, 2021

bors commented Mar 5, 2021

rust-timer commented Mar 5, 2021

rust-timer commented Mar 5, 2021

nikic commented Mar 5, 2021

nagisa commented Mar 6, 2021 •

edited

Loading

nikic commented Mar 6, 2021

jrmuizel commented Mar 6, 2021

jrmuizel commented Mar 6, 2021

nagisa commented Mar 6, 2021

jrmuizel commented Mar 6, 2021

nikic commented Mar 10, 2021

nikic commented Mar 11, 2021

bors commented Mar 11, 2021

nagisa commented Mar 11, 2021

bors commented Mar 11, 2021

bors commented Mar 11, 2021

glandium commented Mar 15, 2021

bstrie commented Mar 15, 2021 •

edited

Loading

nikic commented Mar 15, 2021

Enable MemorySSA in MemCpyOpt #82806

Enable MemorySSA in MemCpyOpt #82806

Conversation

nikic commented Mar 5, 2021 • edited Loading

nikic commented Mar 5, 2021

rust-timer commented Mar 5, 2021

bors commented Mar 5, 2021

bors commented Mar 5, 2021

rust-timer commented Mar 5, 2021

rust-timer commented Mar 5, 2021

nikic commented Mar 5, 2021

rust-timer commented Mar 5, 2021

bors commented Mar 5, 2021

bors commented Mar 5, 2021

rust-timer commented Mar 5, 2021

rust-timer commented Mar 5, 2021

nikic commented Mar 5, 2021

nagisa commented Mar 6, 2021 • edited Loading

nikic commented Mar 6, 2021

jrmuizel commented Mar 6, 2021

jrmuizel commented Mar 6, 2021

nagisa commented Mar 6, 2021

jrmuizel commented Mar 6, 2021

nikic commented Mar 10, 2021

nikic commented Mar 11, 2021

bors commented Mar 11, 2021

nagisa commented Mar 11, 2021

bors commented Mar 11, 2021

bors commented Mar 11, 2021

glandium commented Mar 15, 2021

bstrie commented Mar 15, 2021 • edited Loading

nikic commented Mar 15, 2021

nikic commented Mar 5, 2021 •

edited

Loading

nagisa commented Mar 6, 2021 •

edited

Loading

bstrie commented Mar 15, 2021 •

edited

Loading