Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: extend forward sub to handle adjacent assignments #75565

Closed
wants to merge 1 commit into from

Conversation

AndyAyersMS
Copy link
Member

Allow forward sub for multiple use/def locals if we can prove that a given def has just one adjacent use, eg the first def of X in:

X = A
X = X + B

Allow forward sub for multiple use/def locals if we can prove that a given
def has just one adjacent use, eg the first def of `X` in:

```
X = A
X = X + B
```
@ghost ghost assigned AndyAyersMS Sep 13, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 13, 2022
@AndyAyersMS AndyAyersMS reopened this Sep 13, 2022
@AndyAyersMS
Copy link
Member Author

@dotnet/jit-contrib FYI

Meant to open this as a draft, but apparently there's no way to convert it once you make the wrong choice.

This addresses #75206 and more.

Some sizeable regressions where we fold up a sequence of calls into a tree, then have a hard time generating decent code. For example (from coreclr_tests.pmi)

;; testout1:Main():int

 from BB01
    [000037]:  [000038] is only use of [000036] (V00)  -- fwd subbing [000035]; new next stmt is
STMT00008 ( 0x042[E-] ... 0x049 )
               [000042] -ACXG------                         *  ASG       double
               [000041] D------N---                         +--*  LCL_VAR   double V00 loc0         
               [000040] --CXG------                         \--*  ADD       double
               [000035] --CXG------                            +--*  ADD       double
               [000030] --CXG------                            |  +--*  ADD       double
               [000025] --CXG------                            |  |  +--*  ADD       double
               [000020] --CXG------                            |  |  |  +--*  ADD       double
               [000015] --CXG------                            |  |  |  |  +--*  ADD       double
               [000010] --CXG------                            |  |  |  |  |  +--*  ADD       double
               [000005] --CXG------                            |  |  |  |  |  |  +--*  ADD       double
               [000000] -----------                            |  |  |  |  |  |  |  +--*  CNS_DBL   double 0.0000000000000000
               [000004] --CXG------                            |  |  |  |  |  |  |  \--*  CALL      double testout1.Sub_Funclet_0
               [000009] --CXG------                            |  |  |  |  |  |  \--*  CALL      double testout1.Sub_Funclet_1
               [000014] --CXG------                            |  |  |  |  |  \--*  CALL      double testout1.Sub_Funclet_2
               [000019] --CXG------                            |  |  |  |  \--*  CALL      double testout1.Sub_Funclet_3
               [000024] --CXG------                            |  |  |  \--*  CALL      double testout1.Sub_Funclet_4
               [000029] --CXG------                            |  |  \--*  CALL      double testout1.Sub_Funclet_5
               [000034] --CXG------                            |  \--*  CALL      double testout1.Sub_Funclet_6
               [000039] --CXG------                            \--*  CALL      double testout1.Sub_Funclet_7

FwdSub will stop growing trees after a while as we have complexity limits in place. But we probably need to find a way to throttle this better, both from a CQ standpoint and also because it has quadratic complexity as each pair of trees is walked each time.

Or if the current cutoffs actually do prevent really bad TP regression, look deeper into what it is downstream that can't cope.

@AndyAyersMS
Copy link
Member Author

TP impact ~0.5% is high for the small benefit this currently brings.

I'm a bit surprised since the first early out check is for consecutive assignments to the same local—would not expect us to actually be looking more deeply at very many cases.

@jakobbotsch
Copy link
Member

jakobbotsch commented Sep 14, 2022

TP impact ~0.5% is high for the small benefit this currently brings.

I'm a bit surprised since the first early out check is for consecutive assignments to the same local—would not expect us to actually be looking more deeply at very many cases.

Don't we end up calling expensive gtComplexityExceeds and gtUpdateStmtSideEffects now for many more trees, whereas before it was only called for trees assigning locals with ref count 2? Should the mustProveSingleUse case be moved earlier?

@AndyAyersMS
Copy link
Member Author

TP impact ~0.5% is high for the small benefit this currently brings.
I'm a bit surprised since the first early out check is for consecutive assignments to the same local—would not expect us to actually be looking more deeply at very many cases.

Don't we end up calling expensive gtComplexityExceeds and gtUpdateStmtSideEffects now for many more trees, whereas before it was only called for trees assigning locals with ref count 2? Should the mustProveSingleUse case be moved earlier?

Good point, moving the bail out earlier seems to mitigate the TP impact. Also gives me ideas for further reduction (we are redundantly running gtUpdateStmtSideEffects when we backtrack).

Trying to figure out where things go south with some of the regressions. Seems like forward subbing a call into a call argument can create physreg conflicts that are hard for morph/lsra to sort out.

@AndyAyersMS
Copy link
Member Author

Here's a simple example where problems arise:

using System;
using System.Runtime.CompilerServices;

public class X
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static double F() => 3.0;

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static double Left(double x)
    {
        x = F() + x;
        x = F() + x;
        x = F() + x;
        x = F() + x;
        return x;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static double Right(double x)
    {
        x = x + F();
        x = x + F();
        x = x + F();
        x = x + F();
        return x;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static double Middle(double x)
    {
        x += F();
        x += F();
        x += F();
        x += F();
        return x;
    }

    public static int Main()
    {
        double l = Left(1.0);
        double r = Right(2.0);
        double m = Middle(3.0);

        bool ok = (l == 13.0) && (r == 14.0) && (m == 15.0);

        return ok ? 100 : -1;
    }
}

Current jit produces the same (or similar) code for all 3 methods:

; Assembly listing for method X:Right(double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] ( 11, 11   )  double  ->  mm6        
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 56

G_M8440_IG01:              ;; offset=0000H
       4883EC38             sub      rsp, 56
       C5F877               vzeroupper 
       C5F829742420         vmovaps  qword ptr [rsp+20H], xmm6
       C5F828F0             vmovaps  xmm6, xmm0
						;; size=17 bbWeight=1    PerfScore 3.50
G_M8440_IG02:              ;; offset=0011H
       FF15216B1C00         call     [X:F():double]
       C5FB58F6             vaddsd   xmm6, xmm0, xmm6
       FF15176B1C00         call     [X:F():double]
       C5FB58F6             vaddsd   xmm6, xmm0, xmm6
       FF150D6B1C00         call     [X:F():double]
       C5FB58F6             vaddsd   xmm6, xmm0, xmm6
       FF15036B1C00         call     [X:F():double]
       C5FB58F6             vaddsd   xmm6, xmm0, xmm6
       C5F828C6             vmovaps  xmm0, xmm6
						;; size=44 bbWeight=1    PerfScore 24.25
G_M8440_IG03:              ;; offset=003DH
       C5F828742420         vmovaps  xmm6, qword ptr [rsp+20H]
       4883C438             add      rsp, 56
       C3                   ret  

but with the current PR the latter two end up with spills

; Assembly listing for method X:Right(double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  5,  5   )  double  ->  mm6        
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  TEMP_01                                  double  ->  [rsp+0x28]
;
; Lcl frame size = 72

G_M35683_IG01:              ;; offset=0000H
       4883EC48             sub      rsp, 72
       C5F877               vzeroupper 
       C5F829742430         vmovaps  qword ptr [rsp+30H], xmm6
       C5F828F0             vmovaps  xmm6, xmm0
						;; size=17 bbWeight=1    PerfScore 3.50
G_M35683_IG02:              ;; offset=0011H
       FF15816A1C00         call     [X:F():double]
       C5FB58C6             vaddsd   xmm0, xmm0, xmm6
       C5FB11442428         vmovsd   qword ptr [rsp+28H], xmm0
       FF15716A1C00         call     [X:F():double]
       C5FB104C2428         vmovsd   xmm1, qword ptr [rsp+28H]
       C5F358C0             vaddsd   xmm0, xmm1, xmm0
       C5FB11442428         vmovsd   qword ptr [rsp+28H], xmm0
       FF155B6A1C00         call     [X:F():double]
       C5FB104C2428         vmovsd   xmm1, qword ptr [rsp+28H]
       C5F358C0             vaddsd   xmm0, xmm1, xmm0
       C5FB11442428         vmovsd   qword ptr [rsp+28H], xmm0
       FF15456A1C00         call     [X:F():double]
       C5FB104C2428         vmovsd   xmm1, qword ptr [rsp+28H]
       C5F358F0             vaddsd   xmm6, xmm1, xmm0
       C5F828C6             vmovaps  xmm0, xmm6
						;; size=80 bbWeight=1    PerfScore 36.25
G_M35683_IG03:              ;; offset=0061H
       C5F828742430         vmovaps  xmm6, qword ptr [rsp+30H]
       4883C448             add      rsp, 72
       C3                   ret  

@AndyAyersMS
Copy link
Member Author

Current jit will produce similar looking code with spills for these methods where we are given trees from the get-go, rather than growing them ourselves:

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static double A(double x)
    {
        return F() + F() + F() + F() + x;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static double Z(double x)
    {
        return F() + F() + F() + F() + x;
    }

for example

 Assembly listing for method X:A(double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )  double  ->  [rsp+30H]   single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  double  ->  zero-ref    "non-inline candidate call"
;* V03 tmp2         [V03    ] (  0,  0   )  double  ->  zero-ref    "non-inline candidate call"
;* V04 tmp3         [V04    ] (  0,  0   )  double  ->  zero-ref    "non-inline candidate call"
;  TEMP_01                                  double  ->  [rsp+0x20]
;
; Lcl frame size = 40

G_M61538_IG01:              ;; offset=0000H
       4883EC28             sub      rsp, 40
       C5F877               vzeroupper
       C5FB11442430         vmovsd   qword ptr [rsp+30H], xmm0
                                                ;; size=13 bbWeight=1    PerfScore 2.25
G_M61538_IG02:              ;; offset=000DH
       FF1565691C00         call     [X:F():double]
       C5FB11442420         vmovsd   qword ptr [rsp+20H], xmm0
       FF1559691C00         call     [X:F():double]
       C5FB104C2420         vmovsd   xmm1, qword ptr [rsp+20H]
       C5F358C0             vaddsd   xmm0, xmm1, xmm0
       C5FB11442420         vmovsd   qword ptr [rsp+20H], xmm0
       FF1543691C00         call     [X:F():double]
       C5FB104C2420         vmovsd   xmm1, qword ptr [rsp+20H]
       C5F358C0             vaddsd   xmm0, xmm1, xmm0
       C5FB11442420         vmovsd   qword ptr [rsp+20H], xmm0
       FF152D691C00         call     [X:F():double]
       C5FB104C2420         vmovsd   xmm1, qword ptr [rsp+20H]
       C5F358C0             vaddsd   xmm0, xmm1, xmm0
       C5FB58442430         vaddsd   xmm0, xmm0, qword ptr [rsp+30H]
                                                ;; size=78 bbWeight=1    PerfScore 38.00
G_M61538_IG03:              ;; offset=005BH
       4883C428             add      rsp, 40
       C3                   ret
                                                ;; size=5 bbWeight=1    PerfScore 1.25

This looks like a missing case in LSRA preferencing -- we have live-across call tree edges and we do not preference them to callee saves.

cc @kunalspathak -- might be some LSRA low-hanging fruit?

@AndyAyersMS
Copy link
Member Author

This looks like a missing case in LSRA preferencing -- we have live-across call tree edges and we do not preference them to callee saves.

cc @kunalspathak -- might be some LSRA low-hanging fruit?

With current jit, a similar example using ints:

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Right(int x)
    {
        x = x + F();
        x = x + F();
        x = x + F();
        x = x + F();
        return x;
    }

gives us

; Assembly listing for method X:Right(int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] ( 11, 11   )     int  ->  rsi        
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 32

G_M30435_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC20             sub      rsp, 32
       8BF1                 mov      esi, ecx
						;; size=7 bbWeight=1    PerfScore 1.50
G_M30435_IG02:              ;; offset=0007H
       FF15FB6A1C00         call     [X:F():int]
       03F0                 add      esi, eax
       FF15F36A1C00         call     [X:F():int]
       03C6                 add      eax, esi
       8BF0                 mov      esi, eax
       FF15E96A1C00         call     [X:F():int]
       03C6                 add      eax, esi
       8BF0                 mov      esi, eax
       FF15DF6A1C00         call     [X:F():int]
       03C6                 add      eax, esi
       8BF0                 mov      esi, eax
       8BC6                 mov      eax, esi
						;; size=40 bbWeight=1    PerfScore 14.00
G_M30435_IG03:              ;; offset=002FH
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret  

The current PR "fixes" this....

@kunalspathak
Copy link
Member

The current PR "fixes" this....

Can you paste the new code from this PR?

@AndyAyersMS
Copy link
Member Author

The current PR "fixes" this....

Can you paste the new code from this PR?

; Assembly listing for method X:Right(int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  5,  5   )     int  ->  rsi        
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 32

G_M30435_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC20             sub      rsp, 32
       8BF1                 mov      esi, ecx
						;; size=7 bbWeight=1    PerfScore 1.50
G_M30435_IG02:              ;; offset=0007H
       FF15FB6A1C00         call     [X:F():int]
       03F0                 add      esi, eax
       FF15F36A1C00         call     [X:F():int]
       03F0                 add      esi, eax
       FF15EB6A1C00         call     [X:F():int]
       03F0                 add      esi, eax
       FF15E36A1C00         call     [X:F():int]
       03F0                 add      esi, eax
       8BC6                 mov      eax, esi
						;; size=34 bbWeight=1    PerfScore 13.25
G_M30435_IG03:              ;; offset=0029H
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret      
						;; size=6 bbWeight=1    PerfScore 1.75

This is pretty good given the ABI constraints. The very last add could be commuted to put the sum into eax directly... not sure when/how we'd do something like this; maybe codegen?

@BruceForstall
Copy link
Member

Meant to open this as a draft, but apparently there's no way to convert it once you make the wrong choice.

Look on the right-hand side, under Reviewers, for "Still in progress? Convert to draft"

@AndyAyersMS AndyAyersMS marked this pull request as draft September 30, 2022 02:22
@AndyAyersMS
Copy link
Member Author

Meant to open this as a draft, but apparently there's no way to convert it once you make the wrong choice.

Look on the right-hand side, under Reviewers, for "Still in progress? Convert to draft"

Thanks, converted.

Not sure this is viable yet. Perhaps the work @jakobbotsch is considering on call args ordering might clean some of these problem cases up magically.

@AndyAyersMS
Copy link
Member Author

Possibly worth re-evaluating once #76671 is in?

@kunalspathak
Copy link
Member

Possibly worth re-evaluating once #76671 is in?

I spent some time understanding the double scenario. When assigning register to the intermediate result from addition, we always have more than one register that we could assign to and as a result, pick up the 1st one based on REGORDER heuristics. However, I noticed that for intervals that interfere with calls, we should mark them as preferCalleeSave (more on it below). Today, we just mark the intervals that represent local vars as preferCalleeSave while other intervals (like this one) is not, and we make bad choice for them. From what I understand, if these intervals get assigned to any of the CALLEE-TRASH register, they will always get spilled to the stack before the call and reloaded at the next use, which happens in this case as well.

Here Interval 3 is SDSU with #34 definition before the call and #50 after the call. Although it interferes with the call and it doesn't have any particular preference (see Preferences=[allFloat]), since it is not related to local var, we do not prefer calleeSavee to it.

Interval  3: double RefPositions {#34@16 #50@23} physReg:NA Preferences=[allFloat]
...
...

  N015.                    ADD      
                               Use:<I1>(#32) *
                               Use:<I2>(#33) *
        Def:<I3>(#34)
  N017.                    CNS_INT(h) 0xd1ffab1e ftn
  N019.                    IND      
  N021.                    CALL     
        Kill: rax rcx rdx r8 r9 r10 r11 mm0 mm1 mm2 mm3 mm4 mm5 
        Def:<I4>(#49) mm0
  N023.                    ADD      
                               Use:<I3>(#50) *
                               Use:<I4>(#51) *

I am working on a prototype where we iterate through all the intervals and mark them preferCalleeSave aggressively if they interfere with the call. We can be conservative on marking only the intervals that are SDSU but will have to experiment and check the asmdiff.

CALLEE-SAVED have two perspectives. When the method that is being compiled is itself a callee, if we assign too many CALLEE-SAVED registers, we will have to save/restore them in prolog/epilog. But when the method acts as a caller, and has multiple calls inside it, we can utilize these registers (and should prefer using them) because we know that they will be restored when the call returns, and we don't have to save/restore/spill them. If we don't utilize them and instead prefer CALLEE-TRASH, we have to end up save/restore before/after every call site. In my opinion, save/restore in prolog/epilog is cheaper than save/restore before/after every call (again we can make a heuristic decision here to only do this if there are XXX calls in a method). Lastly, for cases where we have to save/restore prolog/epilog, we can improve it for small methods that has single call by doing the save/restore just around the call instead of prolog/epilog. That way, if the call is inside an if-condition, we can skip save/restore execution present in prolog/epilog.

@kunalspathak
Copy link
Member

I am working on a prototype where we iterate through all the intervals and mark them preferCalleeSave aggressively if they interfere with the call. We can be conservative on marking only the intervals that are SDSU but will have to experiment and check the asmdiff.

With that we get this:

; Assembly listing for method System.Memory.TestClass:A(double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )  double  ->  [rsp+40H]   single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  double  ->  zero-ref    "non-inline candidate call"
;* V03 tmp2         [V03    ] (  0,  0   )  double  ->  zero-ref    "non-inline candidate call"
;* V04 tmp3         [V04    ] (  0,  0   )  double  ->  zero-ref    "non-inline candidate call"
;
; Lcl frame size = 56

G_M29446_IG01:
       sub      rsp, 56
       vzeroupper
       vmovaps  qword ptr [rsp+20H], xmm6
       vmovsd   qword ptr [rsp+40H], xmm0
                                                ;; size=19 bbWeight=1    PerfScore 4.25
G_M29446_IG02:
       call     [hackishMethodName]
       vmovaps  xmm6, xmm0
       call     [hackishMethodName]
       vaddsd   xmm6, xmm6, xmm0
       call     [hackishMethodName]
       vaddsd   xmm6, xmm6, xmm0
       call     [hackishMethodName]
       vaddsd   xmm0, xmm6, xmm0
       vaddsd   xmm0, xmm0, qword ptr [rsp+40H]
                                                ;; size=46 bbWeight=1    PerfScore 26.25
G_M29446_IG03:
       vmovaps  xmm6, qword ptr [rsp+20H]
       add      rsp, 56
       ret
                                                ;; size=11 bbWeight=1    PerfScore 5.25

@kunalspathak
Copy link
Member

My quick fix to update this for all intervals inside BuildKillPositionsForNode() is over-aggressive and leads to lot of regressions. I need to wait until we build all the intervals/refpositions and then iterate over the intervals and set preferCalleeSave only for those which interferes with the call.

// Keep tracks of refpositions that are calls
list<RefPositions> callRelatedRefPositions = ...

for (interval : intervals) {
   defRefPos = interval->firstRefPosition;
   lastUseRefPos = interval->lastRefPosition;

    for (callRefPosition : callRelatedRefPositions)  {
        if ((defRefPos->location < callRefPosition->location) && (lastUseRefPos->location > callRefPosition->location)) {
             interval->preferCalleeSave = true;
             break;
        }
    }
}

@kunalspathak
Copy link
Member

only for those which interferes with the call.

Here is what we get. Here, we

G_M29446_IG01:
       sub      rsp, 56
       vzeroupper 
       vmovaps  qword ptr [rsp+20H], xmm6
       vmovsd   qword ptr [rsp+40H], xmm0
						;; size=19 bbWeight=1    PerfScore 4.25
G_M29446_IG02:
       call     [hackishMethodName]
       vmovaps  xmm6, xmm0
       call     [hackishMethodName]
       vaddsd   xmm6, xmm6, xmm0
       call     [hackishMethodName]
       vaddsd   xmm6, xmm6, xmm0
       call     [hackishMethodName]
       vaddsd   xmm0, xmm6, xmm0
       vaddsd   xmm0, xmm0, qword ptr [rsp+40H]
						;; size=46 bbWeight=1    PerfScore 26.25
G_M29446_IG03:
       vmovaps  xmm6, qword ptr [rsp+20H]
       add      rsp, 56
       ret      
						;; size=11 bbWeight=1    PerfScore 5.25

Here is sample diff from benchmark. I will put a PR.

image

@AndyAyersMS
Copy link
Member Author

Not going to be able to get back to this any time soon, so will close.

@AndyAyersMS AndyAyersMS closed this Nov 3, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Dec 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants