Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Vector2/3/4, Quaternion, Plane, Vector<T>, and Vector64/128/256/512<T> to be implemented in managed where trivially possible #102301

Merged
merged 11 commits into from
Jun 5, 2024

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented May 16, 2024

This is a basic prototype for #102275 and covers the all vectorized types where the handling can be trivially done in managed (such as by deferring to another intrinsic API).

As part of that, it also removes MethodImpl(MethodImplOptions.AggressiveInlining) from various APIs where the method IL size is approx. less than the always inline threshold for RyuJIT (this was estimated by examining a few functions and finding that 2 or less calls with no complex logic typically fits the need).

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 16, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding tannergooding changed the title Change Vector4, Quaternion, and Plane to be implemented entirely in managed Change Vector2/3/4, Quaternion, Plane, Vector<T>, and Vector64/128/256/512<T> to be implemented in managed where trivially possible May 16, 2024
@tannergooding

This comment was marked as outdated.

@tannergooding
Copy link
Member Author

Reopened now that #102702, #102827, and #102973 has gone in. We should no longer see pessimizations caused from needing constants during import and so we should see fewer diffs overall and ideally be in a position to consider getting this merged and building on it more.

@build-analysis build-analysis bot mentioned this pull request Jun 3, 2024
@tannergooding
Copy link
Member Author

Things are already looking much better than before.

minopts size regressions are expected since it would now require inlining to get the optimized codegen. This is something that needs to be discussed around the overall impact, but I expect its fine since its T0 code.

fullopts has almost no changes and we are doing the right thing for the most part. There are some more complex methods where we run into limits for the default inlining heuristics given that we no longer get the [Intrinsic] profitability boost. We could solve this by still treating APIs in System.Runtime.Intrinsics as [Intrinsic], regardless of whether or not they're annotated. This gives us the profitability boost without needing to also have the implementation in managed.

There is an assert around V512.CreateScalar that needs to be looked at and some mono failures, which at a glance looks to be a legitimate bug in Mono.

/// <returns><paramref name="value" /> reinterpreted as a new <see cref="Plane" />.</returns>
internal static Plane AsPlane(this Vector4 value)
{
#if MONO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this perf problem or functionality problem?

Either way, it is likely going to be fixed as side-effect of #102988 .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a functionality problem that I've not dug into yet. The previous tested commit had seen some failures related to Quaternion/Vector4/Plane and their equality tests, but only on Mono.

My current guess is that there's something subtly incorrect in the Mono handling that breaks for SIMD here and it will need a fix before BitCast can be used, but I'd like to try and ensure CoreCLR is clean without regressions before I do any more in depth Mono changes.

@tannergooding
Copy link
Member Author

@MihuBot

@tannergooding
Copy link
Member Author

tannergooding commented Jun 5, 2024

One regression is in System.Drawing.RectangleF:ToVector4()

-       vmovups  xmm0, xmmword ptr [rdi]
+       vmovss   xmm0, dword ptr [rdi]
+       vinsertps xmm0, xmm0, dword ptr [rdi+0x04], 16
+       vinsertps xmm0, xmm0, dword ptr [rdi+0x08], 32
+       vinsertps xmm0, xmm0, dword ptr [rdi+0x0C], 48

This one is because while we have fgMorphCombineSIMDFieldStores, we don't have an fgMorphCombineSIMDFieldLoads instead we only had some logic as part of NI_Vector4_Create that looked for consecutive field accesses. This one shouldn't be a blocker and can be pretty easily handled in a follow up PR.


There's a small regression in System.IO.Hashing.Crc32:UpdateVectorized:

 G_M1352_IG02:
        mov      rax, rsi
        mov      ecx, edx
-       vmovups  xmm0, xmmword ptr [rax]
        cmp      ecx, 128
        jge      SHORT G_M1352_IG04
- 						;; size=17 bbWeight=1 PerfScore 5.75
+						;; size=13 bbWeight=1 PerfScore 1.75
 G_M1352_IG03:
-       vmovd    xmm1, rdi
-       vpxor    xmm0, xmm1, xmm0
+       vmovd    xmm0, rdi
+       vpxor    xmm0, xmm0, xmmword ptr [rax]
        jmp      G_M1352_IG07
        align    [0 bytes for IG05]
- 						;; size=13 bbWeight=0.50 PerfScore 2.17
+						;; size=13 bbWeight=0.50 PerfScore 3.50
 G_M1352_IG04:
+       vmovups  xmm0, xmmword ptr [rsi]
        vmovups  xmm1, xmmword ptr [rsi+0x10]

Nothing major, just a place where we don't share a load anymore.


In general the JIT ends up needing to create many additional impAppendStmt, Inlining Arg, spilled call-like call argument, and Inline ldloca(s) first use temp locals. Most of these end up as zero-ref and so there's potential for the JIT to optimize things better around such scenarios to avoid additional or unnecessary overhead (throughput wise).

Most of the corelib regressions are in the xplat APIs that don't have any accelerated implementation today, like Dot<short> (lack of handling for multiply since it would have been fairly complex IR to support). Others are cases like Vector512.OnesComplement which would be fixed if we had it defer to op_OnesComplement rather than Vector256.OnesComplement (an easy fix in a follow up PR).

All in all, I think we are in a position where taking this is feasible and it doesn't look to regress any meaningful scenarios, only the already unaccelerated edge cases. -- Mono is being left "as is" for now since their inliner isn't as robust as the RyuJIT one, but we should be able to independently investigate removing the Mono support for the xplat APIs as well so they only need to support the platform specific APIs.

CC. @dotnet/jit-contrib

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean up in jit! 👍

@lewing
Copy link
Member

lewing commented Jun 11, 2024

wasm regressions here dotnet/perf-autofiling-issues#36108 (aot) and improvements dotnet/perf-autofiling-issues#36093 (interp)

cc @radekdoulik

@tannergooding
Copy link
Member Author

Likely regressions:

These should be resolved with #103177

wasm regressions here dotnet/perf-autofiling-issues#36108 (aot)

Little bit surprised that wasm aot regressed since the Mono handling hadn't been touched and the AOT support should have already been accelerated for these APIs. Is the support for PackedSIMD separate from the mainline Mono support around MonoLLVM?

improvements dotnet/perf-autofiling-issues#36093 (interp)

👍, fairly significant improvements at that. I know that interp hadn't accelerated the Vector2/3/4 APIs but had accelerated the Vector128 APIs, so lets it do less work and still get the benefits.

@lewing
Copy link
Member

lewing commented Jun 12, 2024

Likely regressions:

These should be resolved with #103177

wasm regressions here dotnet/perf-autofiling-issues#36108 (aot)

Little bit surprised that wasm aot regressed since the Mono handling hadn't been touched and the AOT support should have already been accelerated for these APIs. Is the support for PackedSIMD separate from the mainline Mono support around MonoLLVM?

I find it surprising too, @radekdoulik can you take a look at our codegen here please

improvements dotnet/perf-autofiling-issues#36093 (interp)

👍, fairly significant improvements at that. I know that interp hadn't accelerated the Vector2/3/4 APIs but had accelerated the Vector128 APIs, so lets it do less work and still get the benefits.

Yes, nice improvements

@matouskozak
Copy link
Member

matouskozak commented Jun 14, 2024

Possibly related noticeable size improvements on Mono iOS HelloWorld dotnet/perf-autofiling-issues#35768

@github-actions github-actions bot locked and limited conversation to collaborators Jul 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants