Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple kmask registers to be allocated and cleanup some codegen around them #89059

Merged
merged 19 commits into from
Jul 25, 2023

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Jul 17, 2023

Various pessimizations around the kmask handling were found which were causing suboptimal performance numbers for various algorithms updated to support V512.

To address this, the PR ensures that kmask registers are properly supported in the register allocator and that the core patterns necessary are recognized around them to get efficient codegen.

The asmdiffs, particularly for Windows x64 are very positive.

The TP impact, however, is not great and comes from LSRA needing additional checks in many paths to handle the new register kind. I don't think there is a trivial fix for this as an additional register file simply requires more checks/handling.

The Arm64 TP impact comes from the fix to calleeSaveRegs/callerSaveRegs called out here: https://github.com/dotnet/runtime/pull/89059/files#r1269918053. Namely they were doing an IsType rather than a UsesReg check and were technically returning the wrong CALLEE_SAVED set for TYP_STRUCT. Its unclear why this is viewed as more expensive given that the two checks are doing similar checks (IsType does ((varTypeClassification[TypeGet(vt)] & (VTF_INT)) != 0);, while UsesReg does varTypeRegister[TypeGet(vt)] == VTR_INT. This is really just the difference between test; jcc vs cmp; jcc

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 17, 2023
@ghost ghost assigned tannergooding Jul 17, 2023
Comment on lines +1992 to +2004
#if defined(TARGET_XARCH)
// xarch has mask registers available when EVEX is supported

if (compiler->canUseEvexEncoding())
{
for (unsigned int i = 0; i < lsraRegOrderMskSize; i++)
{
regNumber reg = lsraRegOrderMsk[i];
RegRecord* curr = &physRegs[reg];
curr->regOrder = (unsigned char)i;
}
}
#endif // TARGET_XARCH
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of the TP regression comes from this. LSRA is very sensitive to the additional cost from building these extra 8 registers.

Comment on lines 896 to 898
#if defined(TARGET_XARCH)
killMask &= ~RBM_MSK_CALLEE_TRASH;
#endif // TARGET_XARCH
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps interestingly, having this vs not having it makes a difference to the integer registers allocated.

That, in general, seems odd and one would expect that the kill mask including or excluding floating/mask registers shouldn't impact how integer registers are allocated (particularly when there are no floating-point or mask registers actually in use).

@@ -92,7 +92,9 @@
#define REG_MASK_FIRST REG_K0
#define REG_MASK_LAST REG_K7

#define RBM_ALLMASK RBM_K1
#define RBM_ALLMASK_INIT (0)
#define RBM_ALLMASK_EVEX (RBM_K1 | RBM_K2 | RBM_K3 | RBM_K4 | RBM_K5 | RBM_K6 | RBM_K7)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notably this doesn't include K0 because it is a bit special.

In particular, while K0 can be used as an input or an output of many of the kmask instructions, it cannot be used as a predicate to many of the SIMD instructions because it instead represents "no predication".

We should likely ensure K0 can still be used longer term, but for this PR it's unnecessary.

Comment on lines +10774 to +10783
case NI_AVX512F_Add:
case NI_AVX512BW_Add:
case NI_AVX512F_And:
case NI_AVX512DQ_And:
case NI_AVX512F_AndNot:
case NI_AVX512DQ_AndNot:
case NI_AVX512F_Or:
case NI_AVX512DQ_Or:
case NI_AVX512F_Xor:
case NI_AVX512DQ_Xor:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This covers the core bitwise patterns around kmask registers. It notably doesn't factor in knot, kshift, or kxnor but those are somewhat less common and can be handled in the future.

Comment on lines +11846 to +11847
// We want to ensure that we get a TYP_MASK local to
// ensure the relevant optimizations can kick in
Copy link
Member Author

@tannergooding tannergooding Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures we don't typically end up with a local hiding the ConvertMaskToVector and results in overall better codegen since most cases allow it to be elided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is not needed for ConvertVectorToMask?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We just care about seeing the TYP_MASK node since it needs special handling. TYP_SIMD nodes are already well handled throughout.

@tannergooding tannergooding marked this pull request as ready for review July 21, 2023 20:09
@tannergooding
Copy link
Member Author

Do you mind sending a separate PR for this change? It is unrelated to the kmask register change and it will be easier to review and spot that asmdiff/TP is not impacted for arm64 because of this change.

I extracted it to an isolated PR, but its worth noting it is not completely unrelated to the kmask register change. We need to check and handle if the register type is TYP_MASK in these methods.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, the TP regression is coming from creating the register mask registers (as you pointed out), but its impact should be similar on Minopts vs. Regular. Do we know why MinOpts shows higher TP regression? I am also don't understand why replacing the varTypeIsFloating with varTypeUsesIntReg would cause it. Did you find out the assembly difference between the two?

(numbers are similar for linux/x64):

image

Also, do we know why asmdiffs for linux/x64 is way less (858 bytes) than windows/x64 (11K bytes)?

Could you confirm why clang shows TP regression on arm64?

image

DEF_TP(MASK ,"mask" , TYP_MASK, 8, 8, 8, 2, 8, VTR_MASK, availableMaskRegs, VTF_ANY)
DEF_TP(SIMD32 ,"simd32" , TYP_SIMD32, 32,32, 32, 8,16, VTR_FLOAT, availableDoubleRegs, RBM_FLT_CALLEE_SAVED, RBM_FLT_CALLEE_TRASH, VTF_S|VTF_VEC)
DEF_TP(SIMD64 ,"simd64" , TYP_SIMD64, 64,64, 64, 16,16, VTR_FLOAT, availableDoubleRegs, RBM_FLT_CALLEE_SAVED, RBM_FLT_CALLEE_TRASH, VTF_S|VTF_VEC)
DEF_TP(MASK ,"mask" , TYP_MASK, 8, 8, 8, 2, 8, VTR_MASK, availableMaskRegs, RBM_MSK_CALLEE_SAVED, RBM_MSK_CALLEE_TRASH, VTF_S)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, was this mistakenly marked as VTF_ANY instead of VTF_S?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is strictly a struct and should be involved in struct like copying and other optimizations.

src/coreclr/jit/hwintrinsiccodegenxarch.cpp Show resolved Hide resolved
@@ -1927,11 +2002,16 @@ GenTree* Lowering::LowerHWIntrinsicCmpOp(GenTreeHWIntrinsic* node, genTreeOps cm

default:
{
unreached();
maskIntrinsicId = NI_AVX512F_NotMask;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate how all the default cases sets NI_AVX512F_NotMask? Can we add some assert to make sure that we are setting this for correct intrinsicid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a couple lengthy comment above this code elaborating on the overall needs and the needs of this path in particular.

For the cases where we have a partial mask, we need to invert the comparison. This is because matches become 1 and bits are otherwise 0, including any bits that are unused in the comparison.

Consider for example if we had Vector128<int> and so we have 4-bits in the mask, but the mask instruction always checks at least 8-bits. If we did CompareEqual and all elements matched, we'd still get 0b0000_1111 and so we couldn't easily check "are all matching". However, if we invert the comparison and do CompareNotEqual we get 0b0000_0000 for all matches and can then trivially check "are all matching".

This is meant to handle the general case of "other mask intrinsic" where we don't have a well known inverse. Notably I think I forgot a bit of code here, we need to either invert just the lower n bits or always set the upper 4-bits. Will fix to ensure just that is happening

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lsrabuild.cpp Outdated Show resolved Hide resolved
@@ -3968,6 +3998,12 @@ int LinearScan::BuildReturn(GenTree* tree)
{
buildInternalIntRegisterDefForNode(tree, dstRegMask);
}
#if defined(TARGET_XARCH) && defined(FEATURE_SIMD)
else if (varTypeUsesMaskReg(dstType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check be guarded for AVX512 platforms only to reduce the impact of TP on non-AVX512 xarch platforms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

varTypeUsesMaskReg is basically doing a dstType == TYP_MASK check (and we could actually optimize it to that given we only have the 1 mask type today).

Checking comp->canUseEvexEncoding would just add an additional branch and not really save us much (we'd be trading one comparison/branch for another). It may even and potentially slow down the actual execution due to more branches being in a small code window which can pessimize the branch predictor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that's what I thought, because it is still a runtime check and will cost us. In LSRA, I think this check would have increased the TP regression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It will increase the cost to get a floating-point internal register. But, that cost basically "has" to exist; we're either going to have else if (comp->canUseEvexEncoding && varTypeUsesMaskReg(dstType)) -or- else if (varTypeUsesMaskReg(dstType)) -or- else if (varTypeUsesFloatReg(dstType))

The first one is a memory access, branch, comparison, and branch

The second one is currently just a comparison and branch.

The third is currently a memory access, comparison, and branch

So what we have currently is the cheapest we can make it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially have a small lookup table that maps type to a fnptr representing the buildInternalReg call. However, that's potentially more expensive in terms of actual execution cost.

I'd think there are better ways to win the TP back instead.

@ghost ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Jul 24, 2023
@tannergooding
Copy link
Member Author

Just to confirm, the TP regression is coming from creating the register mask registers (as you pointed out), but its impact should be similar on Minopts vs. Regular. Do we know why MinOpts shows higher TP regression? I am also don't understand why replacing the varTypeIsFloating with varTypeUsesIntReg would cause it. Did you find out the assembly difference between the two?

They won't and shouldn't be similar as the number of total instructions executed for minopts vs fullopts has never been close. fullopts is itself executing a lot more overall code to do optimizations, CSE, folding, inlining, etc. So the number of "
"new" instructions introduced to LSRA represents an overall smaller percentage of the work done.

You can see the following, for example, where fullopts is executing 183.66x the number of instructions:

Context Collection Base # instructions Diff # instructions PDIFF
minopts libraries.pmi.windows.x64.checked.mch 1,446,462,124 1,465,926,185 +1.35%
fullopts libraries.pmi.windows.x64.checked.mch 265,659,020,750 266,361,349,309 +0.26%

@tannergooding
Copy link
Member Author

Also, do we know why asmdiffs for linux/x64 is way less (858 bytes) than windows/x64 (11K bytes)?

Linux x64 doesn't include aspnet while Windows x64 does. For Windows this represents -10,803 bytes of asm.

Could you confirm why clang shows TP regression on arm64?

TP is a measurement of instructions executed, not strictly a measurement of perf. Clang prefers to emit more instructions by default, generating code that it believes will take less time to execute and so it will almost always show a worse TP regression than MSVC. In practice the code will take similar amounts of time to execute end to end. There are a great number of scenarios where longer codegen (more instructions) are better than shorter codegen (less instructions), as well as many scenarios where a change may not be on a hot path, may get removed via PGO, may actually execute more code but result in faster execution time, etc. The only way to know for sure if a change like this is good or bad is to actually measure the average execution time for various minopts vs fullopts scenarios.

}

// Not all of the callee trash values are constant, so don't declare this as a method local static
Copy link
Member

@kunalspathak kunalspathak Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for this PR, but should we refactor this at other places as well, if similar situation exists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other places are currently all constant, from what I saw at least. If we have any other non-constant places introduced in this PR then we should update them

@kunalspathak
Copy link
Member

so it will almost always show a worse TP regression than MSVC

I meant that for arm64, this should have shown improvement regardless of msvc or clang, and wanted to see the diff (if possible) to understand why arm64 has regression only on clang.

@tannergooding
Copy link
Member Author

I meant that for arm64, this should have shown improvement regardless of msvc or clang, and wanted to see the diff (if possible) to understand why arm64 has regression only on clang.

That's making an assumption that Clang will generate the same code as MSVC, when they can frequently differ. In this case it was because the varTypeCalleeTrashRegs wasn't constant due to RBM_FLT_CALLEE_TRASH and RBM_MSK_CALLEE_TRASH weren't constant and the number of instructions required to support that is significantly more when using Clang.

This should be resolved now that I've changed it to be an array that's initialized when the LSRA is created instead.

@kunalspathak
Copy link
Member

Seems there is a new test failure:

Assert failure(PID 4303 [0x000010cf], Thread: 4380 [0x111c]): Assertion failed '(reinterpret_cast<size_t>(addr) & (byteSize - 1)) == 0' in 'System.Linq.Enumerable:Sum[long,long](System.Collections.Generic.IEnumerable`1[long]):long' during 'Emit code' (IL size 92; hash 0x698aae1b; Tier1)

    File: /__w/1/s/src/coreclr/jit/emitxarch.cpp Line: 14085
    Image: /datadisks/disk1/work/A3DE08FB/p/dotnet

@tannergooding
Copy link
Member Author

Seems there is a new test failure:

Yes, I'm looking into it, but I'm not convinced its related. It also showed up on an previously passing commit and there hasn't been any change to the constant handling for the code bit in question (the only change was from using vpand to using vptest, both of which expect a 32-byte aligned constant).

@tannergooding
Copy link
Member Author

Found the issue. There was an edge case under DPGO where we were getting a op_equality(and(x, cns), zero) and where the and(x, cns) had been transformed into and(x, broadcast(cns)) to take advantage of embedded broadcast. In that scenario, we shouldn't transform to ptest(x, broadcast(cns)) but should rather keep it is tmp = and(x, cns); ptest(tmp, tmp) (as that is better than tmp = broadcast(cns); ptest(x, tmp))

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants