Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AArch64: Add test cases for callee-saved SIMD & FP registers #2228

Merged
merged 1 commit into from
Oct 2, 2020

Conversation

akirilov-arm
Copy link
Contributor

This PR adds a test case for a function that saves clobbered vector registers on the stack.

Its purpose is to demonstrate that Cranelift deviates from the Procedure Call Standard for the Arm 64-bit Architecture (AAPCS64). The AAPCS64 specifies that a callee is only responsible for saving the bottom 64 bits of a vector register, so it is not possible to preserve a vector value across a function call by keeping it in a callee-saved register (let's ignore the Scalable Vector Extension for the time being, but keep in mind that it makes the rules more complicated).

Due to another deviation from AAPCS64 (the whole vector register is saved instead of only the bottom half), Cranelift effectively implements its own custom calling convention, so there are no correctness issues as long as the generated code calls only functions with the same behaviour.

The wider question is - are we striving for strict AAPCS64 compliance? That's in the non-baldrdash case, of course. I can also open an issue if a longer discussion would be necessary.

cc @cfallin @julian-seward1

@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Sep 24, 2020
@bjorn3
Copy link
Contributor

bjorn3 commented Sep 24, 2020

The default call call conv for aarch64 is systemv, not aapcs. There are some differences between the two in the float handling on at least arm32. For example https://github.com/rust-lang/rust/blob/85fbf49ce0e2274d0acf798f6e703747674feec3/compiler/rustc_target/src/abi/call/arm.rs#L91. I don't know if this is also the case on aarch64 and if so if this is one example.

@github-actions
Copy link

Subscribe to Label Action

cc @bnjbvr

This issue or pull request has been labeled: "cranelift"

Thus the following users have been cc'd because of the following labels:

  • bnjbvr: cranelift

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@akirilov-arm
Copy link
Contributor Author

AArch32 is a completely different case, in which there are many ABI variants indeed.

@cfallin
Copy link
Member

cfallin commented Sep 24, 2020

Thanks @akirilov-arm -- it seems you're right that we could save half of our stack space and memory traffic for FP/vec clobber-saves. This should be a straightforward change in the ABI code; I want to verify first that this won't break SpiderMonkey (I think not, as every register is caller-save, IIRC) then I'll create a PR.

@akirilov-arm
Copy link
Contributor Author

@cfallin I might have misunderstood you, but let me rephrase: There are 2 issues - one on the caller and one on the callee side. You are talking about the latter, but I am more concerned about the former. Consider the test case in the PR - the compiler is stashing the vectors (and they are full, 128-bit vectors) before the call to %g1 in v8, v9, and v10, respectively, which is not going to work in the general case, assuming AAPCS64 compliance.

For comparison, here is what GCC and LLVM are doing in an equivalent situation.

@cfallin
Copy link
Member

cfallin commented Sep 26, 2020

Ah, right, this is a bigger issue with respect to callsites; I had missed that half of it, sorry. So because we don't reason about half-clobbers currently (or overlapping registers), we need to treat all vector registers as clobbers on the caller side. I'll create a patch for this first thing Monday.

@cfallin
Copy link
Member

cfallin commented Sep 29, 2020

So on further consideration (and after doing the one-line patch suggested and studying its effect), the solution is actually more complex than I had suggested above:

  • We need to consider the "half caller-saves" as caller-saves when we generate calls, or else we incorrectly use registers that will be clobbered, as you say.
  • At the same time, we need to consider the "half caller-saves" as callee-saves when we generate prologues and epilogues, because if we modify the low half we really do need to save them per the ABI.

The problem is how these interact: any function call at all now forces a bunch of clobber saves/restores in the caller's prologue and epilogue, because the half-clobbers from the call become full-clobbers (due to the imprecision of not reasoning about half-registers).

In essence, we need a way to "pass through" the half-clobbers from callsite to prologue/epilogue, as long as callee and caller have the same ABI; only true clobbers from the rest of the function body should alter the prologue/epilogue.

I can think of a few hacks, e.g. add the full-clobbers at callsites (as defs) to avoid regalloc using the registers, but compute our own clobber-set by scanning over every instruction except callsites; but this feels very error-prone and fragile compared to simply doing the right thing and reasoning through overlapping registers.

@julian-seward1, thoughts on this re: regalloc and how difficult it would be to support half-defs / half-clobbers?

@akirilov-arm
Copy link
Contributor Author

@cfallin I realized that I should expand my changes and add test cases with f32 and f64 values, so do not merge anything, please.

In essence, we need a way to "pass through" the half-clobbers from callsite to prologue/epilogue, as long as callee and caller have the same ABI...

Do you mean that the caller lowering should pass somehow the clobber information to the callee lowering? I thought that lowering passes for different functions were more or less independent, and indeed happened in different worker threads...

Unfortunately I can't say anything the rest of your comments because I am missing some information - in particular, your quick fix and its effects (expanding the test cases should help with the latter).

@cfallin
Copy link
Member

cfallin commented Sep 29, 2020

Do you mean that the caller lowering should pass somehow the clobber information to the callee lowering? I thought that lowering passes for different functions were more or less independent, and indeed happened in different worker threads...

No, what I mean is more conceptual: if the callee clobbers some set of registers C, then by calling it, the caller clobbers at least C as well (that's what I meant by "pass through"; it's something that intrinsically happens, not something we are doing; sorry, it was unclear). So if the caller and callee have the same ABI, then I think we can effectively hack around this limitation by:

  • Adopting conservative definitions of caller- and callee-save registers at callsite generation and prologue generation respectively, so the half-and-half vector registers in question are both caller- and callee-save, BUT

  • Ignoring the call instructions' defs/clobbers when computing the saved-clobbers list for prologue generation, because (if same ABI) anything it clobbers, we are also allowed to clobber without saving.

So this will require a small change to regalloc.rs, namely a single bit per instruction to indicate "exclude mods/defs from clobbers". @julian-seward1 and @bnjbvr, thoughts on this?

@akirilov-arm akirilov-arm changed the title AArch64: Add a test case for vector callee-saved registers AArch64: Add test cases for callee-saved SIMD & FP registers Sep 30, 2020
@akirilov-arm
Copy link
Contributor Author

@cfallin This PR is now ready. IMHO we should merge it (provided that there are no suggestions for improvement) because it should make it a little bit easier to experiment with the ABI and register allocator implementations while trying to devise a solution for the wider issue.

@cfallin
Copy link
Member

cfallin commented Oct 2, 2020

Yes, agreed; I'll do that and then create a separate issue to track the fix (conforming with AAPCS half-caller/half-callee-save convention).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants