Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cranelift: Delete scalar {u,s}loadNN and storeNN instructions #8785

Closed
wants to merge 5 commits into from

Conversation

afonso360
Copy link
Contributor

@afonso360 afonso360 commented Jun 12, 2024

👋 Hey,

This is a WIP PR to delete the scalar load+extend instructions. I'm still working on cleaning up the s390x backend, but I saw there was some discussion on today's meeting regarding this. (Unfortunately these now clash with some other stuff and I'm unable to attend)

I haven't done any benchmarking, and I just noticed there are some cases where we are not correctly fusing the load+extend. Once that is working correctly I'll try to run sightglass on it.

CC: #6056

Instead of using the `istore8` special store opcodes, perform a ireduce and then store.
These instructions are essentially a combination of a `ireduce+store`, so remove the special store, and start recognizing that idiom in the backends.

Since `ireduce` is a no-op from the point of view of the backends, we can simply delete those lowering rules. they don't add any novel instructions other than just separately lowering both `ireduce` and `store`.

There are a lot of s390x tests delete here, I've checked and there also exist equivalent `store` based tests for those, so no point in keeping them around.
These are going to be removed in follow up commits
@afonso360 afonso360 added the cranelift Issues related to the Cranelift code generator label Jun 12, 2024
@afonso360 afonso360 requested review from a team as code owners June 12, 2024 18:42
@afonso360 afonso360 requested review from elliottt and removed request for a team June 12, 2024 18:42
@afonso360 afonso360 marked this pull request as draft June 12, 2024 18:42
@alexcrichton
Copy link
Member

For context I brought this up in the Cranelift meeting today as I was curious about the genesis of the instructions. My assumption was that these were added under the assumption that the pattern of load+extend or ireduce+store would be so common that having fused ops in Cranelift would reduce the size of the resident IR during compilation and perhaps have other various memory/compile-time benefits. The conclusion though was that while this was probably the predicted purpose of the instructions no one was aware of any benchmarking one way or another to show the impact.

I think it'd be worthwhile to put this through sightglass to ensure there's not, for example, a 10% slowdown compiling spidermonkey, but otherwise I think we're all in the abstract all for cleanups that simplify the IR.

@afonso360
Copy link
Contributor Author

Ran some quick benchmarks on this, doesn't seem to make too much of a difference, except for spidermonkey where there is a slight compile time regression.

I should also note that this version suffers from the issue described in #8787, but it looks like that doesn't affect larger programs too much? It does affect the test cases which slightly bothers me.

Sightglass results
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.10s
     Running `target/debug/sightglass-cli benchmark --engine ./main.so --engine ./delete-special-mem.so --iterations-per-process 10 --processes 2 -- benchmarks/spidermonkey/benchmark.wasm ./benchmarks/pulldown-cmark/benchmark.wasm benchmarks/bz2/benchmark.wasm benchmarks/regex/benchmark.wasm`
.

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 250476630.15 ± 157064794.89 (confidence = 99%)

  main.so is 1.01x to 1.03x faster than delete-special-mem.so!

  [14521885186 14938969886.45 15446520641] delete-special-mem.so
  [14558278944 14688493256.30 15396563040] main.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [528672 687515.20 1726784] delete-special-mem.so
  [528352 761440.00 2136608] main.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [200704 226676.80 266368] delete-special-mem.so
  [196128 242470.40 326496] main.so

instantiation :: cycles :: ./benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [317088 364465.60 469152] delete-special-mem.so
  [314976 380755.20 560736] main.so

compilation :: cycles :: ./benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [590328960 645099580.55 788561440] delete-special-mem.so
  [603675135 673634577.60 887706080] main.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [124808576 134569985.05 199907890] delete-special-mem.so
  [122697376 130622858.45 157331714] main.so

instantiation :: cycles :: benchmarks/regex/benchmark.wasm

  No difference in performance.

  [513856 599899.20 1005408] delete-special-mem.so
  [459328 615880.00 1086016] main.so

compilation :: cycles :: benchmarks/regex/benchmark.wasm

  No difference in performance.

  [1356884384 1438567822.70 1639913438] delete-special-mem.so
  [1371897536 1467483544.05 1656487360] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [260681040 299165913.20 481644877] delete-special-mem.so
  [261796512 294465843.00 491744160] main.so

execution :: cycles :: ./benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [9724864 10999916.80 13853504] delete-special-mem.so
  [10061056 10919562.45 12255328] main.so

execution :: cycles :: benchmarks/regex/benchmark.wasm

  No difference in performance.

  [284849472 298366723.55 319585712] delete-special-mem.so
  [281981504 296739732.85 332067262] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [1437751471 1491664753.05 1615952144] delete-special-mem.so
  [1434106066 1487088603.50 1577413794] main.so

@cfallin
Copy link
Member

cfallin commented Jun 13, 2024

Thanks for running those benchmarks! IMHO, a 1-3% compile time regression is unfortunately a bit too significant to take for a pure "cleanliness win" change (others may disagree, happy to discuss; in the CL meeting this morning I gave a clean 1% as an example of a number I'd personally be fine with, 10% as an example clearly not, 3% is somewhere in the middle).

I wonder if it gets any better if we do handle merging better, along the lines of #8787 -- the compile slowdown could arise because of larger VCode on average rather than larger CLIF, as well. Worth trying to resolve that first then coming back to benchmark this again perhaps?

@github-actions github-actions bot added cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift:wasm labels Jun 13, 2024
@afonso360
Copy link
Contributor Author

IMHO, a 1-3% compile time regression is unfortunately a bit too significant to take for a pure "cleanliness win" change

Yeah, I don't think it's worth it if we have this regression.

I wonder if it gets any better if we do handle merging better, along the lines of #8787 -- the compile slowdown could arise because of larger VCode on average rather than larger CLIF, as well. Worth trying to resolve that first then coming back to benchmark this again perhaps?

Maybe, I'm not too familiar with how we do elaboration on egraphs so it might be slightly harder for me to pick that up.

Reading #6154 got me interested in the jump-threading pass idea, even though I think moving the extends next to the loads would be more effective in this case. I probably won't have time to look at it now but maybe later.

@afonso360
Copy link
Contributor Author

By the way, I was doing some unrelated benchmarking and found out that sightglass is not consistent enough on my machine to give me confidence that the result above is real / meaningful.

I'm going to try to run the benchmarks again on a more stable setup when I get some time, although I suspect it's probably going to amount to the same.

@fitzgen
Copy link
Member

fitzgen commented Jun 17, 2024

@afonso360 you could try measuring instructions retired instead of cycles, pass something like -m insts-retired to sightglass. It should be much more consistent, and this change isn't the kind of thing where we instructions retired is a poor proxy for performance (as opposed to when measuring changes that are intended to improve cache locality, for example).

@afonso360
Copy link
Contributor Author

afonso360 commented Jun 20, 2024

I re-ran the benchmarks with -m insts-retired and some of the steps in cpu-isolation.md. It's still not 100% reproducible across runs, but it's a lot better.

However the results are now completely different from the results above.

Sightglass Results
afonso@fedora:~/git/sightglass$ taskset --cpu-list 3 cargo run -- benchmark --engine ./delete-special-mem-main.so --engine ./delete-special-mem.so --iterations-per-process 10 --processes 1 -m insts-retired -- benchmarks/spidermonkey/benchmark.wasm ./benchmarks/pulldown-cmark/benchmark.wasm benchmarks/bz2/benchmark.wasm benchmarks/regex/benchmark.wasm
warning: virtual workspace defaulting to `resolver = "1"` despite one or more workspace members being on edition 2021 which implies `resolver = "2"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
note: for more details see https://doc.rust-lang.org/cargo/reference/resolver.html#resolver-versions
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.37s
     Running `target/debug/sightglass-cli benchmark --engine ./delete-special-mem-main.so --engine ./delete-special-mem.so --iterations-per-process 10 --processes 1 -m insts-retired -- benchmarks/spidermonkey/benchmark.wasm ./benchmarks/pulldown-cmark/benchmark.wasm benchmarks/bz2/benchmark.wasm benchmarks/regex/benchmark.wasm`

execution :: instructions-retired :: benchmarks/regex/benchmark.wasm

  Δ = 53721614.20 ± 3668.44 (confidence = 99%)

  -main.so is 1.07x to 1.07x faster than .so!

  [737583027 737586621.20 737590770] -main.so
  [791305114 791308235.40 791313975] .so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 15985756.30 ± 2.10 (confidence = 99%)

  -main.so is 1.07x to 1.07x faster than .so!

  [227672476 227672479.20 227672481] -main.so
  [243658234 243658235.50 243658239] .so

instantiation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 2971.00 ± 2002.10 (confidence = 99%)

  -main.so is 1.02x to 1.11x faster than .so!

  [43088 45173.90 49561] -main.so
  [46621 48144.90 49943] .so

execution :: instructions-retired :: ./benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 976271.20 ± 336.81 (confidence = 99%)

  -main.so is 1.05x to 1.05x faster than .so!

  [19771797 19771883.20 19772465] -main.so
  [20747941 20748154.40 20748599] .so

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 29346100.10 ± 77173.30 (confidence = 99%)

  -main.so is 1.01x to 1.01x faster than .so!

  [2660295835 2660354378.00 2660420572] -main.so
  [2689612009 2689700478.10 2689780017] .so

compilation :: instructions-retired :: benchmarks/regex/benchmark.wasm

  Δ = 80572.40 ± 75315.03 (confidence = 99%)

  -main.so is 1.00x to 1.00x faster than .so!

  [37695341 37800338.60 37888633] -main.so
  [37747142 37880911.00 37960870] .so

instantiation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [46462 71324.50 168586] -main.so
  [46932 47315.60 47940] .so

instantiation :: instructions-retired :: ./benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [37923 60559.60 69184] -main.so
  [63367 68729.70 72940] .so

instantiation :: instructions-retired :: benchmarks/regex/benchmark.wasm

  No difference in performance.

  [35647 36528.00 37098] -main.so
  [35728 41068.30 60408] .so

compilation :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [4079658 4110764.30 4134315] -main.so
  [4079041 4097484.70 4118287] .so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [374729187 375180105.70 375606520] -main.so
  [375041594 375590062.10 376019380] .so

compilation :: instructions-retired :: ./benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [15772628 15798740.50 15873683] -main.so
  [15784192 15808475.30 15841131] .so

Weirdly there is now no longer any compilation difference, but there are execution differences.

I'm going to close this for now, however I'm planning on redoing these benchmarks when we have a jump threading pass, which I'm currently working on (slowly).

@afonso360 afonso360 closed this Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift:wasm cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants