-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: extend forward sub to handle adjacent assignments #75565
Conversation
Allow forward sub for multiple use/def locals if we can prove that a given def has just one adjacent use, eg the first def of `X` in: ``` X = A X = X + B ```
@dotnet/jit-contrib FYI Meant to open this as a draft, but apparently there's no way to convert it once you make the wrong choice. This addresses #75206 and more. Some sizeable regressions where we fold up a sequence of calls into a tree, then have a hard time generating decent code. For example (from coreclr_tests.pmi)
FwdSub will stop growing trees after a while as we have complexity limits in place. But we probably need to find a way to throttle this better, both from a CQ standpoint and also because it has quadratic complexity as each pair of trees is walked each time. Or if the current cutoffs actually do prevent really bad TP regression, look deeper into what it is downstream that can't cope. |
TP impact ~0.5% is high for the small benefit this currently brings. I'm a bit surprised since the first early out check is for consecutive assignments to the same local—would not expect us to actually be looking more deeply at very many cases. |
Don't we end up calling expensive |
Good point, moving the bail out earlier seems to mitigate the TP impact. Also gives me ideas for further reduction (we are redundantly running Trying to figure out where things go south with some of the regressions. Seems like forward subbing a call into a call argument can create physreg conflicts that are hard for morph/lsra to sort out. |
Here's a simple example where problems arise: using System;
using System.Runtime.CompilerServices;
public class X
{
[MethodImpl(MethodImplOptions.NoInlining)]
static double F() => 3.0;
[MethodImpl(MethodImplOptions.NoInlining)]
public static double Left(double x)
{
x = F() + x;
x = F() + x;
x = F() + x;
x = F() + x;
return x;
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static double Right(double x)
{
x = x + F();
x = x + F();
x = x + F();
x = x + F();
return x;
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static double Middle(double x)
{
x += F();
x += F();
x += F();
x += F();
return x;
}
public static int Main()
{
double l = Left(1.0);
double r = Right(2.0);
double m = Middle(3.0);
bool ok = (l == 13.0) && (r == 14.0) && (m == 15.0);
return ok ? 100 : -1;
}
} Current jit produces the same (or similar) code for all 3 methods: ; Assembly listing for method X:Right(double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 11, 11 ) double -> mm6
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 56
G_M8440_IG01: ;; offset=0000H
4883EC38 sub rsp, 56
C5F877 vzeroupper
C5F829742420 vmovaps qword ptr [rsp+20H], xmm6
C5F828F0 vmovaps xmm6, xmm0
;; size=17 bbWeight=1 PerfScore 3.50
G_M8440_IG02: ;; offset=0011H
FF15216B1C00 call [X:F():double]
C5FB58F6 vaddsd xmm6, xmm0, xmm6
FF15176B1C00 call [X:F():double]
C5FB58F6 vaddsd xmm6, xmm0, xmm6
FF150D6B1C00 call [X:F():double]
C5FB58F6 vaddsd xmm6, xmm0, xmm6
FF15036B1C00 call [X:F():double]
C5FB58F6 vaddsd xmm6, xmm0, xmm6
C5F828C6 vmovaps xmm0, xmm6
;; size=44 bbWeight=1 PerfScore 24.25
G_M8440_IG03: ;; offset=003DH
C5F828742420 vmovaps xmm6, qword ptr [rsp+20H]
4883C438 add rsp, 56
C3 ret but with the current PR the latter two end up with spills ; Assembly listing for method X:Right(double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 5, 5 ) double -> mm6
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; TEMP_01 double -> [rsp+0x28]
;
; Lcl frame size = 72
G_M35683_IG01: ;; offset=0000H
4883EC48 sub rsp, 72
C5F877 vzeroupper
C5F829742430 vmovaps qword ptr [rsp+30H], xmm6
C5F828F0 vmovaps xmm6, xmm0
;; size=17 bbWeight=1 PerfScore 3.50
G_M35683_IG02: ;; offset=0011H
FF15816A1C00 call [X:F():double]
C5FB58C6 vaddsd xmm0, xmm0, xmm6
C5FB11442428 vmovsd qword ptr [rsp+28H], xmm0
FF15716A1C00 call [X:F():double]
C5FB104C2428 vmovsd xmm1, qword ptr [rsp+28H]
C5F358C0 vaddsd xmm0, xmm1, xmm0
C5FB11442428 vmovsd qword ptr [rsp+28H], xmm0
FF155B6A1C00 call [X:F():double]
C5FB104C2428 vmovsd xmm1, qword ptr [rsp+28H]
C5F358C0 vaddsd xmm0, xmm1, xmm0
C5FB11442428 vmovsd qword ptr [rsp+28H], xmm0
FF15456A1C00 call [X:F():double]
C5FB104C2428 vmovsd xmm1, qword ptr [rsp+28H]
C5F358F0 vaddsd xmm6, xmm1, xmm0
C5F828C6 vmovaps xmm0, xmm6
;; size=80 bbWeight=1 PerfScore 36.25
G_M35683_IG03: ;; offset=0061H
C5F828742430 vmovaps xmm6, qword ptr [rsp+30H]
4883C448 add rsp, 72
C3 ret |
Current jit will produce similar looking code with spills for these methods where we are given trees from the get-go, rather than growing them ourselves: [MethodImpl(MethodImplOptions.NoInlining)]
public static double A(double x)
{
return F() + F() + F() + F() + x;
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static double Z(double x)
{
return F() + F() + F() + F() + x;
} for example Assembly listing for method X:A(double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) double -> [rsp+30H] single-def
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) double -> zero-ref "non-inline candidate call"
;* V03 tmp2 [V03 ] ( 0, 0 ) double -> zero-ref "non-inline candidate call"
;* V04 tmp3 [V04 ] ( 0, 0 ) double -> zero-ref "non-inline candidate call"
; TEMP_01 double -> [rsp+0x20]
;
; Lcl frame size = 40
G_M61538_IG01: ;; offset=0000H
4883EC28 sub rsp, 40
C5F877 vzeroupper
C5FB11442430 vmovsd qword ptr [rsp+30H], xmm0
;; size=13 bbWeight=1 PerfScore 2.25
G_M61538_IG02: ;; offset=000DH
FF1565691C00 call [X:F():double]
C5FB11442420 vmovsd qword ptr [rsp+20H], xmm0
FF1559691C00 call [X:F():double]
C5FB104C2420 vmovsd xmm1, qword ptr [rsp+20H]
C5F358C0 vaddsd xmm0, xmm1, xmm0
C5FB11442420 vmovsd qword ptr [rsp+20H], xmm0
FF1543691C00 call [X:F():double]
C5FB104C2420 vmovsd xmm1, qword ptr [rsp+20H]
C5F358C0 vaddsd xmm0, xmm1, xmm0
C5FB11442420 vmovsd qword ptr [rsp+20H], xmm0
FF152D691C00 call [X:F():double]
C5FB104C2420 vmovsd xmm1, qword ptr [rsp+20H]
C5F358C0 vaddsd xmm0, xmm1, xmm0
C5FB58442430 vaddsd xmm0, xmm0, qword ptr [rsp+30H]
;; size=78 bbWeight=1 PerfScore 38.00
G_M61538_IG03: ;; offset=005BH
4883C428 add rsp, 40
C3 ret
;; size=5 bbWeight=1 PerfScore 1.25 This looks like a missing case in LSRA preferencing -- we have live-across call tree edges and we do not preference them to callee saves. cc @kunalspathak -- might be some LSRA low-hanging fruit? |
With current jit, a similar example using ints: [MethodImpl(MethodImplOptions.NoInlining)]
public static int Right(int x)
{
x = x + F();
x = x + F();
x = x + F();
x = x + F();
return x;
} gives us ; Assembly listing for method X:Right(int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 11, 11 ) int -> rsi
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 32
G_M30435_IG01: ;; offset=0000H
56 push rsi
4883EC20 sub rsp, 32
8BF1 mov esi, ecx
;; size=7 bbWeight=1 PerfScore 1.50
G_M30435_IG02: ;; offset=0007H
FF15FB6A1C00 call [X:F():int]
03F0 add esi, eax
FF15F36A1C00 call [X:F():int]
03C6 add eax, esi
8BF0 mov esi, eax
FF15E96A1C00 call [X:F():int]
03C6 add eax, esi
8BF0 mov esi, eax
FF15DF6A1C00 call [X:F():int]
03C6 add eax, esi
8BF0 mov esi, eax
8BC6 mov eax, esi
;; size=40 bbWeight=1 PerfScore 14.00
G_M30435_IG03: ;; offset=002FH
4883C420 add rsp, 32
5E pop rsi
C3 ret The current PR "fixes" this.... |
Can you paste the new code from this PR? |
; Assembly listing for method X:Right(int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 5, 5 ) int -> rsi
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 32
G_M30435_IG01: ;; offset=0000H
56 push rsi
4883EC20 sub rsp, 32
8BF1 mov esi, ecx
;; size=7 bbWeight=1 PerfScore 1.50
G_M30435_IG02: ;; offset=0007H
FF15FB6A1C00 call [X:F():int]
03F0 add esi, eax
FF15F36A1C00 call [X:F():int]
03F0 add esi, eax
FF15EB6A1C00 call [X:F():int]
03F0 add esi, eax
FF15E36A1C00 call [X:F():int]
03F0 add esi, eax
8BC6 mov eax, esi
;; size=34 bbWeight=1 PerfScore 13.25
G_M30435_IG03: ;; offset=0029H
4883C420 add rsp, 32
5E pop rsi
C3 ret
;; size=6 bbWeight=1 PerfScore 1.75 This is pretty good given the ABI constraints. The very last add could be commuted to put the sum into |
Look on the right-hand side, under Reviewers, for "Still in progress? Convert to draft" |
Thanks, converted. Not sure this is viable yet. Perhaps the work @jakobbotsch is considering on call args ordering might clean some of these problem cases up magically. |
Possibly worth re-evaluating once #76671 is in? |
I spent some time understanding the Here
I am working on a prototype where we iterate through all the intervals and mark them
|
With that we get this: ; Assembly listing for method System.Memory.TestClass:A(double):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) double -> [rsp+40H] single-def
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) double -> zero-ref "non-inline candidate call"
;* V03 tmp2 [V03 ] ( 0, 0 ) double -> zero-ref "non-inline candidate call"
;* V04 tmp3 [V04 ] ( 0, 0 ) double -> zero-ref "non-inline candidate call"
;
; Lcl frame size = 56
G_M29446_IG01:
sub rsp, 56
vzeroupper
vmovaps qword ptr [rsp+20H], xmm6
vmovsd qword ptr [rsp+40H], xmm0
;; size=19 bbWeight=1 PerfScore 4.25
G_M29446_IG02:
call [hackishMethodName]
vmovaps xmm6, xmm0
call [hackishMethodName]
vaddsd xmm6, xmm6, xmm0
call [hackishMethodName]
vaddsd xmm6, xmm6, xmm0
call [hackishMethodName]
vaddsd xmm0, xmm6, xmm0
vaddsd xmm0, xmm0, qword ptr [rsp+40H]
;; size=46 bbWeight=1 PerfScore 26.25
G_M29446_IG03:
vmovaps xmm6, qword ptr [rsp+20H]
add rsp, 56
ret
;; size=11 bbWeight=1 PerfScore 5.25 |
My quick fix to update this for all intervals inside // Keep tracks of refpositions that are calls
list<RefPositions> callRelatedRefPositions = ...
for (interval : intervals) {
defRefPos = interval->firstRefPosition;
lastUseRefPos = interval->lastRefPosition;
for (callRefPosition : callRelatedRefPositions) {
if ((defRefPos->location < callRefPosition->location) && (lastUseRefPos->location > callRefPosition->location)) {
interval->preferCalleeSave = true;
break;
}
}
}
|
Here is what we get. Here, we G_M29446_IG01:
sub rsp, 56
vzeroupper
vmovaps qword ptr [rsp+20H], xmm6
vmovsd qword ptr [rsp+40H], xmm0
;; size=19 bbWeight=1 PerfScore 4.25
G_M29446_IG02:
call [hackishMethodName]
vmovaps xmm6, xmm0
call [hackishMethodName]
vaddsd xmm6, xmm6, xmm0
call [hackishMethodName]
vaddsd xmm6, xmm6, xmm0
call [hackishMethodName]
vaddsd xmm0, xmm6, xmm0
vaddsd xmm0, xmm0, qword ptr [rsp+40H]
;; size=46 bbWeight=1 PerfScore 26.25
G_M29446_IG03:
vmovaps xmm6, qword ptr [rsp+20H]
add rsp, 56
ret
;; size=11 bbWeight=1 PerfScore 5.25 Here is sample diff from benchmark. I will put a PR. |
Not going to be able to get back to this any time soon, so will close. |
Allow forward sub for multiple use/def locals if we can prove that a given def has just one adjacent use, eg the first def of
X
in: