Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

riscv64: Improve codegen for icmp #7203

Merged
merged 20 commits into from
Oct 16, 2023

Conversation

alexcrichton
Copy link
Member

This PR is another stab towards improving icmp along the lines of #6112 and #7113. This tries to draw on both of those PRs a bit to end up with something that's usable by the majority of the backend for use with branches.

I've attempted to break things up piecemeal here so the commits in theory are understandable commit-by-commit. This additionally doesn't implement every single optimization I could think of, mostly just a few that seemed to be the low-hanging fruit ones.

Skip generating extra sign-extension instructions in this case because
the materialization of a constant will implicitly sign-extend into the
entire register.
Try to reserve `lower_*` as taking a thing and producing a `*Reg`.
Rename this helper to `is_nonzero_cmp` to represent how it's testing for
nonzero and producing a comparison.
* `FCmp` -> `FloatCompare`
* `emit_fcmp` -> `fcmp_to_float_compare`
* `lower_fcmp` -> `lower_float_compare`

Make some room for upcoming integer comparison functions.
This is only used by one lowering so inline its definition directly.
This commit is the first in a few steps to reimplement bytecodealliance#6112 and bytecodealliance#7113.
The `Icmp` pseudo-instruction is removed here and necessary
functionality is all pushed into ISLE lowerings. This enables deleting
the `lower_br_icmp` helper in `emit.rs` as it's no longer necessary,
meaning that all conditional branches should ideally be generated in
lowering rather than pseudo-instructions to benefit from various
optimizations.

Currently the lowering is the bare-bones minimum to get things working.
This involved creating a helper to lower an `IntegerCompare` into an
`XReg` via negation/swapping args/etc. In generated code this removes
branches and makes `icmp` a straight-line instruction for non-128-bit
arguments.
Use the `x0` register which is always zero where possible and avoid
unnecessary `xor`s against this register.
@alexcrichton alexcrichton requested a review from a team as a code owner October 10, 2023 10:26
@alexcrichton alexcrichton requested review from cfallin and removed request for a team October 10, 2023 10:26
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language labels Oct 10, 2023
@github-actions
Copy link

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Matches the sign-extension that happens at the hardware layer.
Copy link
Contributor

@afonso360 afonso360 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice! Thank you! ❤️ Having the commits separated neatly really helped review this.

It all looks correct to me. But I'm not entirely sure I didn't miss anything, so a second pair of eyes would be helpful!

I'm also going to run the fuzzer on this for a few days to check if anything comes up. (Although I don't think that should prevent merging this)

@@ -2136,6 +2145,9 @@
(decl pure partial val_already_extended (ExtendOp Value) bool)
(rule 0 (val_already_extended _ v @ (value_type $I64)) $true)

;; Immediates are always sign extended to their full 64-bits
(rule 1 (val_already_extended (ExtendOp.Signed) (iconst _)) $true)
Copy link
Contributor

@afonso360 afonso360 Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is entirely correct, but it probably doesn't matter.

Due to #6922 we don't generate all 32bit constants from lui+addi so there are some that fallback to the general load from const pool path.

That path uses the ld instruction to load the full 8 bytes, but we never sign extend that value while lowering. There is a chance that in some of the 32bit constants that we miss we may not sign extend them.

I've built a small test to check which constants fail to generate, and the lowest one is 0x7FFF_F801 so all the negatives are covered. So I don't think there's anything wrong here.

It might be worth ensuring that the value is sign extended before emitting it to the constant pool anyway. Or switching to lw that only loads 4 bytes and sign extends the load result. Just to make sure that this doesn't happen in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good eye here, I'll dig into this a bit more and possibly omit this now that there's more constant-related optimizations for comparisons anyway. I'll probably leave this out for an initial attempt.

@afonso360
Copy link
Contributor

Fuzzing has turned up the following testcase:

test interpret
test run
target riscv64gc

function %a() -> i8 {
block0:
    v9 = iconst.i8 128
    v11 = iconst.i8 0
    v17 = icmp uge v9, v11  ; v9 = 128, v11 = 0
    return v17
}

; run: %a() == 1

@alexcrichton
Copy link
Member Author

Do you have native RISC-V hardware that you're fuzzing on? Or are you running fuzzing through QEMU emulation? If the latter I think I'll try to throw some compute at this as well to help weed out more too.

@afonso360
Copy link
Contributor

afonso360 commented Oct 11, 2023

I've been running this natively, but honestly it's not much faster than running it on QEMU on my desktop, so it might help. This example took like 8 hours to show up.

For comparison, Natively I'm getting about 50 execs/s/core * 4 cores. (with -max_len=128)

Edit: Also, if you're going to run the fuzzer, you might want to pull #7208 in, that triggered fairly quickly for me.

@alexcrichton
Copy link
Member Author

Ok so I tried a bit of fuzzing locally and it found a number of crashes but I was unfortunately unable to reproduce outside of the fuzzer to diagnose further. The test cases all reproduce on main however so I don't think they're related to this PR and at least one of them reproduced after I applied the addi4spn fix too. Reproduction has been difficult due to the fact that these tests are using libcalls I believe. One issue to fix was that clif-util test doesn't support libcalls in its test interpret mode, but while that was easy enough to fix I hit a panic where relocations failed to be appplied because I assume the libcall target was >4G away as an assertion in Reloc::RiscvCallPlt was tripped.

In any case I remove the iconst-is-always-extended logic and otherwise I'll likely be relying on the fuzzing your doing for this as I've not been too successful locally. It did run for a few hours though without finding other bugs which seems good!

Additionally I removed the rules that modify an immediate with +1 since that was the cause of the issue fuzzing found above. I'll need to look more at that later, but figure this may be good to get in without blocking on that.

@afonso360
Copy link
Contributor

Wow, that's interesting, I had noticed the clif-util test can't run libcalls, but had forgotten about it since It doesn't cause issues for me that often. But I'm going to try to investigate the other ones.

In any case, I'm going to start fuzzing these latest changes.

@alexcrichton
Copy link
Member Author

crashes.tar.gz are the crashes that fuzzing found locally for me

@afonso360
Copy link
Contributor

This one showed up today:

test interpret
test run
target riscv64gc

function %a() -> i8  {
block0:
    v11 = iconst.i8 196
    v17 = icmp ugt v11, v11
    return v17
}

; run: %a() == 0

@alexcrichton
Copy link
Member Author

Ooh nice catch, more signed-vs-unsigned handling that needed fixing (fuzzing is great!)

@afonso360
Copy link
Contributor

This has been fuzzing continuously for the past 36 hours without any crashes 🎉 I'm going to consider that good enough.

@afonso360 afonso360 added this pull request to the merge queue Oct 16, 2023
Merged via the queue into bytecodealliance:main with commit 4f49393 Oct 16, 2023
25 checks passed
@alexcrichton alexcrichton deleted the rv64-icmp branch October 16, 2023 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants