Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SpanHelpers.ClearWithReferences for arm64 #93346

Closed
wants to merge 6 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Oct 11, 2023

Follow up to #93214, now for Span<GC>.Clear();

I noticed that the existing implementation is slightly sub-optimal on arm: it's relying on complex addressing modes + is not aligned to 16 bytes. Turns out, 16-byte alignment makes a lot of sense even if we don't use SIMD - stp (two 8-byte stores) benefits from it as well.

Standalone benchmark (for simpler tests)

Apple M2 Max (also tested on Ampere):

| Method     | len  | Mean       | Ratio |
|----------- |----- |-----------:|------:|-
| Clear_Base | 1    |   1.199 ns |  1.00 |
| Clear_PR   | 1    |   1.186 ns |  0.99 |
|            |      |            |       |
| Clear_Base | 5    |   1.215 ns |  1.00 |
| Clear_PR   | 5    |   1.215 ns |  1.00 |
|            |      |            |       |
| Clear_Base | 8    |   1.617 ns |  1.00 |
| Clear_PR   | 8    |   1.269 ns |  0.78 |
|            |      |            |       |
| Clear_Base | 16   |   2.405 ns |  1.00 |
| Clear_PR   | 16   |   2.061 ns |  0.86 |
|            |      |            |       |
| Clear_Base | 64   |   8.727 ns |  1.00 |
| Clear_PR   | 64   |   5.415 ns |  0.62 |
|            |      |            |       |
| Clear_Base | 512  |  70.081 ns |  1.00 |
| Clear_PR   | 512  |  35.482 ns |  0.51 |
|            |      |            |       |
| Clear_Base | 1024 | 140.080 ns |  1.00 |
| Clear_PR   | 1024 |  72.254 ns |  0.52 |
|            |      |            |       |
| Clear_Base | 4096 | 559.551 ns |  1.00 |
| Clear_PR   | 4096 | 333.439 ns |  0.60 |

Ryzen 7950X:

| Method     | len  | Mean        | Ratio |
|----------- |----- |------------:|------:|-
| Clear_Base | 1    |   0.3854 ns |  1.00 |
| Clear_PR   | 1    |   0.3798 ns |  0.99 |
|            |      |             |       |
| Clear_Base | 5    |   0.5310 ns |  1.00 |
| Clear_PR   | 5    |   0.3907 ns |  0.74 |
|            |      |             |       |
| Clear_Base | 8    |   0.8995 ns |  1.00 |
| Clear_PR   | 8    |   0.5264 ns |  0.59 |
|            |      |             |       |
| Clear_Base | 16   |   1.4398 ns |  1.00 |
| Clear_PR   | 16   |   1.2039 ns |  0.84 |
|            |      |             |       |
| Clear_Base | 64   |   5.1603 ns |  1.00 |
| Clear_PR   | 64   |   5.1023 ns |  0.99 |
|            |      |             |       |
| Clear_Base | 512  |  46.0601 ns |  1.00 |
| Clear_PR   | 512  |  45.9100 ns |  1.00 |
|            |      |             |       |
| Clear_Base | 1024 |  94.9471 ns |  1.00 |
| Clear_PR   | 1024 |  94.7812 ns |  1.00 |
|            |      |             |       |
| Clear_Base | 4096 | 380.4915 ns |  1.00 |
| Clear_PR   | 4096 | 375.3246 ns |  0.99 |

For X64, it slightly improves small inputs. It seems like if we pin the input and use SIMD we can see >2X improvement for large inputs but I was not sure about doing pinning (from GC's point of view, atomicity is fine) here and left as is, cc @jkotas

@ghost
Copy link

ghost commented Oct 11, 2023

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Follow up to #93214, now for Span<GC>.Clear();

I noticed that the existing implementation is slightly sub-optimal on arm: it's relying on complex addressing modes + is not aligned to 16 bytes. Turns out, 16-byte alignment makes a lot of sense even if we don't use SIMD - stp (two 8-byte stores) benefits from it as well.

Standalone benchmark (for simpler tests)

Apple M2 Max (also tested on Ampere):

| Method     | len  | Mean       | Ratio |
|----------- |----- |-----------:|------:|-
| Clear_Base | 1    |   1.199 ns |  1.00 |
| Clear_PR   | 1    |   1.186 ns |  0.99 |
|            |      |            |       |
| Clear_Base | 5    |   1.215 ns |  1.00 |
| Clear_PR   | 5    |   1.215 ns |  1.00 |
|            |      |            |       |
| Clear_Base | 8    |   1.617 ns |  1.00 |
| Clear_PR   | 8    |   1.269 ns |  0.78 |
|            |      |            |       |
| Clear_Base | 16   |   2.405 ns |  1.00 |
| Clear_PR   | 16   |   2.061 ns |  0.86 |
|            |      |            |       |
| Clear_Base | 64   |   8.727 ns |  1.00 |
| Clear_PR   | 64   |   5.415 ns |  0.62 |
|            |      |            |       |
| Clear_Base | 512  |  70.081 ns |  1.00 |
| Clear_PR   | 512  |  35.482 ns |  0.51 |
|            |      |            |       |
| Clear_Base | 1024 | 140.080 ns |  1.00 |
| Clear_PR   | 1024 |  72.254 ns |  0.52 |
|            |      |            |       |
| Clear_Base | 4096 | 559.551 ns |  1.00 |
| Clear_PR   | 4096 | 333.439 ns |  0.60 |

Ryzen 7950X:

| Method     | len  | Mean        | Ratio |
|----------- |----- |------------:|------:|-
| Clear_Base | 1    |   0.3854 ns |  1.00 |
| Clear_PR   | 1    |   0.3798 ns |  0.99 |
|            |      |             |       |
| Clear_Base | 5    |   0.5310 ns |  1.00 |
| Clear_PR   | 5    |   0.3907 ns |  0.74 |
|            |      |             |       |
| Clear_Base | 8    |   0.8995 ns |  1.00 |
| Clear_PR   | 8    |   0.5264 ns |  0.59 |
|            |      |             |       |
| Clear_Base | 16   |   1.4398 ns |  1.00 |
| Clear_PR   | 16   |   1.2039 ns |  0.84 |
|            |      |             |       |
| Clear_Base | 64   |   5.1603 ns |  1.00 |
| Clear_PR   | 64   |   5.1023 ns |  0.99 |
|            |      |             |       |
| Clear_Base | 512  |  46.0601 ns |  1.00 |
| Clear_PR   | 512  |  45.9100 ns |  1.00 |
|            |      |             |       |
| Clear_Base | 1024 |  94.9471 ns |  1.00 |
| Clear_PR   | 1024 |  94.7812 ns |  1.00 |
|            |      |             |       |
| Clear_Base | 4096 | 380.4915 ns |  1.00 |
| Clear_PR   | 4096 | 375.3246 ns |  0.99 |

For X64, it slightly improves small inputs. It seems like if we pin the input and use SIMD we can see >2X improvement for large inputs but I was not sure about doing pinning (from GC's point of view, atomicity is fine) here and left as is, cc @jkotas

Author: EgorBo
Assignees: EgorBo
Labels:

area-System.Memory

Milestone: -


Write1:
Debug.Assert(pointerSizeLength >= 1);
// This switch is expected to be lowered to a jump table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works great for microbenchmarks, but it works poorly for real work loads where the jump target is poorly predicated. We used to have switch table like this in managed memcopy originally, but it was later removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works great for microbenchmarks, but it works poorly for real work loads where the jump target is poorly predicated. We used to have switch table like this in managed memcopy originally, but it was later removed.

Should I revert to the previous shape? I was mostly interested in aligning to 16 bytes (it's the reason I am seeing nice results on arm64) and slightly different addressing mode in the main loop, to make it foldable into stp

Copy link
Member Author

@EgorBo EgorBo Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, presumably, PGO can re-shuffle the switch (or convert to if-else) based on profile. However, in the real world it's likely has a mixed profile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I revert to the previous shape?

Yes, unless you can prove that the switch with jump table is better for real workloads.

For reference, dotnet/coreclr#9786 is the PR that got rid of the switch from memory copy. It went through extensive validation.

Copy link
Member

@tannergooding tannergooding Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works great for microbenchmarks, but it works poorly for real work loads where the jump target is poorly predicated. We used to have switch table like this in managed memcopy originally, but it was later removed.

@jkotas, I'm not sure what you mean here, could you elaborate?

The hardware optimization manuals and underlying native memcpy algorithms across all 3 major implementations (MSVC, GCC, Clang; for x64 and Arm64) use/recommend a jump table explicitly for this reason. Because an unconditional and unpredictable branch, is better than a tight tree of conditional and still unpredictable branches.

The premise is that a fallthrough jump table allows a tight, unrolled version of handling for the trailing elements to exist. This then requires 1 unpredicted branch to decide which of the immediately following and equidistant switch cases needs to be jumped to. These cases are often already in the instruction/decode cache due to being in the same or following 32-byte code window.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dotnet/coreclr#9786 was done by Intel engineers to address bottlenecks observed in real workloads. Yes, it goes contrary to the guidance you have quoted.

Possible explanation: Conditional jumps and indirect jumps use two different predictors. The relative performance of conditional jumps vs. indirect jump depends on relative pressure on each predictor. .NET code tends to have more indirect jumps than an average code. It means that the indirect jump predictor sees more pressure in .NET code and it is more profitable to depend on indirect jump predictor in .NET memcpy implementation.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 11, 2023

Codegen on arm64:

New codegen:
; Method Prog:ClearWithReferences2(byref,ulong) (FullOpts)
G_M64195_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
						;; size=8 bbWeight=8 PerfScore 12.00

G_M64195_IG02:  ;; offset=0x0008
            mov     x2, x1
            cmp     x2, #8
            bhi     G_M64195_IG12
						;; size=12 bbWeight=8 PerfScore 16.00

G_M64195_IG03:  ;; offset=0x0014
            cmp     w2, #8
            bhi     G_M64195_IG12
            mov     w1, w2
            adr     x2, [@RWD00]
            ldr     w2, [x2, x1, LSL #2]
            adr     x3, [G_M64195_IG02]
            add     x2, x2, x3
            br      x2
						;; size=32 bbWeight=4 PerfScore 30.00

G_M64195_IG04:  ;; offset=0x0034
            str     xzr, [x0]
            b       G_M64195_IG15
            align   [0 bytes for IG13]
            align   [0 bytes]
            align   [0 bytes]
            align   [0 bytes]
						;; size=8 bbWeight=0.50 PerfScore 1.00

G_M64195_IG05:  ;; offset=0x003C
            stp     xzr, xzr, [x0]
            b       G_M64195_IG15
						;; size=8 bbWeight=0.50 PerfScore 1.00

G_M64195_IG06:  ;; offset=0x0044
            stp     xzr, xzr, [x0]
            str     xzr, [x0, #0x10]
            b       G_M64195_IG15
						;; size=12 bbWeight=0.50 PerfScore 1.50

G_M64195_IG07:  ;; offset=0x0050
            stp     xzr, xzr, [x0]
            stp     xzr, xzr, [x0, #0x10]
            b       G_M64195_IG15
						;; size=12 bbWeight=0.50 PerfScore 1.50

G_M64195_IG08:  ;; offset=0x005C
            stp     xzr, xzr, [x0]
            stp     xzr, xzr, [x0, #0x10]
            str     xzr, [x0, #0x20]
            b       G_M64195_IG15
						;; size=16 bbWeight=0.50 PerfScore 2.00

G_M64195_IG09:  ;; offset=0x006C
            stp     xzr, xzr, [x0]
            stp     xzr, xzr, [x0, #0x10]
            stp     xzr, xzr, [x0, #0x20]
            b       G_M64195_IG15
						;; size=16 bbWeight=0.50 PerfScore 2.00

G_M64195_IG10:  ;; offset=0x007C
            stp     xzr, xzr, [x0]
            stp     xzr, xzr, [x0, #0x10]
            stp     xzr, xzr, [x0, #0x20]
            str     xzr, [x0, #0x30]
            b       G_M64195_IG15
						;; size=20 bbWeight=0.50 PerfScore 2.50

G_M64195_IG11:  ;; offset=0x0090
            stp     xzr, xzr, [x0]
            stp     xzr, xzr, [x0, #0x10]
            stp     xzr, xzr, [x0, #0x20]
            stp     xzr, xzr, [x0, #0x30]
            b       G_M64195_IG15
						;; size=20 bbWeight=0.50 PerfScore 2.50

G_M64195_IG12:  ;; offset=0x00A4
            lsr     x2, x1, #3
						;; size=4 bbWeight=4 PerfScore 4.00

G_M64195_IG13:  ;; offset=0x00A8
            stp     xzr, xzr, [x0]
            stp     xzr, xzr, [x0, #0x10]
            stp     xzr, xzr, [x0, #0x20]
            stp     xzr, xzr, [x0, #0x30]
            add     x0, x0, #64
            sub     x2, x2, #1
            cbnz    x2, G_M64195_IG13
						;; size=28 bbWeight=32 PerfScore 192.00

G_M64195_IG14:  ;; offset=0x00C4
            and     x1, x1, #7
            b       G_M64195_IG02
						;; size=8 bbWeight=4 PerfScore 6.00

G_M64195_IG15:  ;; offset=0x00CC
            ldp     fp, lr, [sp], #0x10
            ret     lr
						;; size=8 bbWeight=1 PerfScore 2.00
RWD00  	dd	000000C4h ; case G_M64195_IG15
       	dd	0000002Ch ; case G_M64195_IG04
       	dd	00000034h ; case G_M64195_IG05
       	dd	0000003Ch ; case G_M64195_IG06
       	dd	00000048h ; case G_M64195_IG07
       	dd	00000054h ; case G_M64195_IG08
       	dd	00000064h ; case G_M64195_IG09
       	dd	00000074h ; case G_M64195_IG10
       	dd	00000088h ; case G_M64195_IG11
; Total bytes of code: 212
Previous codegen:
; Method Prog:ClearWithReferences(byref,ulong) (FullOpts)
G_M45361_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
						;; size=8 bbWeight=1 PerfScore 1.50

G_M45361_IG02:  ;; offset=0x0008
            cmp     x1, #8
            blo     G_M45361_IG04
            align   [0 bytes for IG03]
            align   [0 bytes]
            align   [0 bytes]
            align   [0 bytes]
						;; size=8 bbWeight=1 PerfScore 1.50

G_M45361_IG03:  ;; offset=0x0010
            lsl     x2, x1, #3
            add     x3, x0, x2
            str     xzr, [x3, #-0x08]
            add     x3, x0, x2
            str     xzr, [x3, #-0x10]
            add     x3, x0, x2
            str     xzr, [x3, #-0x18]
            add     x3, x0, x2
            str     xzr, [x3, #-0x20]
            add     x3, x0, x2
            str     xzr, [x3, #-0x28]
            add     x3, x0, x2
            str     xzr, [x3, #-0x30]
            add     x3, x0, x2
            str     xzr, [x3, #-0x38]
            add     x2, x0, x2
            str     xzr, [x2, #-0x40]
            sub     x1, x1, #8
            cmp     x1, #8
            bhs     G_M45361_IG03
						;; size=80 bbWeight=4 PerfScore 60.00

G_M45361_IG04:  ;; offset=0x0060
            cmp     x1, #4
            bhs     G_M45361_IG07
						;; size=8 bbWeight=1 PerfScore 1.50

G_M45361_IG05:  ;; offset=0x0068
            cmp     x1, #2
            bhs     G_M45361_IG08
            cbnz    x1, G_M45361_IG09
						;; size=12 bbWeight=0.50 PerfScore 1.25

G_M45361_IG06:  ;; offset=0x0074
            ldp     fp, lr, [sp], #0x10
            ret     lr
						;; size=8 bbWeight=0.50 PerfScore 1.00

G_M45361_IG07:  ;; offset=0x007C
            stp     xzr, xzr, [x0, #0x10]
            lsl     x2, x1, #3
            add     x3, x0, x2
            str     xzr, [x3, #-0x18]
            add     x2, x0, x2
            str     xzr, [x2, #-0x10]
						;; size=24 bbWeight=0.50 PerfScore 2.50

G_M45361_IG08:  ;; offset=0x0094
            str     xzr, [x0, #0x08]
            lsl     x2, x1, #3
            add     x1, x0, x2
            str     xzr, [x1, #-0x08]
						;; size=16 bbWeight=0.50 PerfScore 1.75

G_M45361_IG09:  ;; offset=0x00A4
            str     xzr, [x0]
						;; size=4 bbWeight=0.50 PerfScore 0.50

G_M45361_IG10:  ;; offset=0x00A8
            ldp     fp, lr, [sp], #0x10
            ret     lr
						;; size=8 bbWeight=0.50 PerfScore 1.00
; Total bytes of code: 176

I can tune JIT to fold

stp     xzr, xzr, [x0]
stp     xzr, xzr, [x0, #..]

to SIMD in the newly added LowerIndirCoelsceStore, but it seems to be fast as is.

@EgorBo EgorBo marked this pull request as draft October 11, 2023 22:59
@EgorBo
Copy link
Member Author

EgorBo commented Oct 13, 2023

I have mixed results with this change, I'll wait for fixes for #93382 (needed for the remainder) and #76067 (comment) (needed for the main loop) and try again after

@ghost ghost locked as resolved and limited conversation to collaborators Nov 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants