Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor performance regression going to dotnet core 3.1 from .NET 4.8 #31613

Open
mrange opened this issue Feb 1, 2020 · 12 comments
Open

Minor performance regression going to dotnet core 3.1 from .NET 4.8 #31613

mrange opened this issue Feb 1, 2020 · 12 comments
Milestone

Comments

@mrange
Copy link

mrange commented Feb 1, 2020

Hi.

I saw a performance regression going from .NET 4.8 to dotnet core 3.1. It's small so in practice this might not hurt most users but I thought it's better to create an issue than keeping mum.

I noticed it when discussing my other issue: #2191 so the code will be similar although I don't think this is tail call related but I don't know for sure of course.

When setting up a simple push stream pipeline

// Minimalistic PushStream
//  A PushStream accepts a receiver function that will be called
//  with each value in the PushStream
type 'T PushStream = ('T -> unit) -> unit

module PushStream =
  let inline zero      ()       = LanguagePrimitives.GenericZero
  let inline push      r v      = r v

  // Creates a PushStream with all integers from b to e (inclusive)
  let inline fromRange b e    r = for i = b to e do push r i
  // Maps all values in ps using mapping function f
  let inline map       f   ps r = ps (fun v -> push r (f v))
  // Filters all values in ps using filter function f
  let inline filter    f   ps r = ps (fun v -> if f v then push r v)
  // Sums all values in ps
  let inline sum           ps   = let mutable s = zero () in ps (fun v -> s <- s + v); s

[<DisassemblyDiagnoser>]
type Benchmarks () =
  [<Params (10000, 100)>] 
  member val public Count = 100 with get, set

  [<Benchmark>]
  member x.SimplePushStreamTest () =
    PushStream.fromRange  0 x.Count
    |> PushStream.map     int64
    |> PushStream.filter  (fun v -> (v &&& 1L) = 0L)
    |> PushStream.map     ((+) 1L)
    |> PushStream.sum

Benchmark dotnet reports:

$ dotnet run -c Release -f netcoreapp3.1 --filter '*' --runtimes net48 netcoreapp3.1
...
|               Method |       Runtime |     Toolchain | Count |        Mean |     Error |    StdDev | Ratio | RatioSD | Code Size |
|--------------------- |-------------- |-------------- |------ |------------:|----------:|----------:|------:|--------:|----------:|
| SimplePushStreamTest |      .NET 4.8 |         net48 |   100 |    400.6 ns |   3.92 ns |   3.67 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | .NET Core 3.1 | netcoreapp3.1 |   100 |    439.3 ns |   4.35 ns |   4.07 ns |  1.10 |    0.02 |     273 B |
|                      |               |               |       |             |           |           |       |         |           |
| SimplePushStreamTest |      .NET 4.8 |         net48 | 10000 | 33,542.5 ns | 143.25 ns | 133.99 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | .NET Core 3.1 | netcoreapp3.1 | 10000 | 39,449.8 ns | 259.08 ns | 242.35 ns |  1.18 |    0.01 |     273 B |

.NET 4.8 performs between 10% to 20% faster than dotnet core 3.1.

I dug a bit into the jitted assembler and found the following differences

--- dotnetcore.asm
+++ net48.asm
@@ -1,4 +1,4 @@
-; dotnet core 3.1
+; .net v48
 
 ; PushStream.fromRange  0 x.Count
 LOOP:
@@ -12,7 +12,6 @@
 jne     LOOP
 
 ; PushStream.map     int64
-nop     dword ptr [rax+rax]
 mov     rcx,qword ptr [rcx+8]
 movsxd  rdx,edx
 mov     rax,qword ptr [rcx]
@@ -21,8 +20,7 @@
 jmp     rax
 
 ; PushStream.filter  (fun v -> (v &&& 1L) = 0L)
-nop     dword ptr [rax+rax]
-mov     eax,edx
+mov     rax,rdx
 test    al,1
 jne     BAILOUT
 mov     rcx,qword ptr [rcx+8]
@@ -35,7 +33,6 @@
 ret
 
 ; PushStream.map     ((+) 1L)
-nop     dword ptr [rax+rax]
 mov     rcx,qword ptr [rcx+8]
 inc     rdx
 mov     rax,qword ptr [rcx]
@@ -44,11 +41,9 @@
 jmp     rax
 
 ; PushStream.sum
-nop     dword ptr [rax+rax]
 mov     rax,qword ptr [rcx+8]
 mov     rcx,rax
 add     rdx,qword ptr [rax+8]
 mov     qword ptr [rcx+8],rdx
 xor     eax,eax
 ret
-

It seems that in dotnet core there's an extra nop at the start of each method. I suspected tiered compilation but after much messing about trying to disable tiered compilation it's either unrelated or I wasn't able to disable tiered compilation.

It surprises me that the nop adds this much overhead but I can't spot anything else of significance.

The code is here: https://github.com/mrange/TryNewDisassembler/tree/fsharpPerformanceRegression

And here:

module PerformanceRegression =
  open System
  open System.Linq
  open System.Diagnostics

  // Minimalistic PushStream
  //  A PushStream accepts a receiver function that will be called
  //  with each value in the PushStream
  type 'T PushStream = ('T -> unit) -> unit

  module PushStream =
    let inline zero      ()       = LanguagePrimitives.GenericZero
    let inline push      r v      = r v

    // Creates a PushStream with all integers from b to e (inclusive)
    let inline fromRange b e    r = for i = b to e do push r i
    // Maps all values in ps using mapping function f
    let inline map       f   ps r = ps (fun v -> push r (f v))
    // Filters all values in ps using filter function f
    let inline filter    f   ps r = ps (fun v -> if f v then push r v)
    // Sums all values in ps
    let inline sum           ps   = let mutable s = zero () in ps (fun v -> s <- s + v); s

  module Tests =
    open BenchmarkDotNet.Attributes
    open BenchmarkDotNet.Configs
    open BenchmarkDotNet.Jobs
    open BenchmarkDotNet.Horology
    open BenchmarkDotNet.Running
    open BenchmarkDotNet.Diagnostics.Windows.Configs

    [<DisassemblyDiagnoser>]
    type Benchmarks () =
      [<Params (10000, 100)>] 
      member val public Count = 100 with get, set

      [<Benchmark>]
      member x.SimplePushStreamTest () =
        PushStream.fromRange  0 x.Count
        |> PushStream.map     int64
        |> PushStream.filter  (fun v -> (v &&& 1L) = 0L)
        |> PushStream.map     ((+) 1L)
        |> PushStream.sum

    let run argv = 
      let job = Job.Default
                    .WithWarmupCount(30)
                    .WithIterationTime(TimeInterval.FromMilliseconds(250.0)) // the default is 0.5s per iteration, which is slighlty too much for us
                    .WithMinIterationCount(15)
                    .WithMaxIterationCount(20)
                    .AsDefault()
      let config = DefaultConfig.Instance.AddJob(job)
      let b = BenchmarkSwitcher [|typeof<Benchmarks>|]
      let summary = b.Run(argv, config)
      printfn "%A" summary

// Run with: dotnet run -c Release -f netcoreapp3.1 --filter '*' --runtimes net48 netcoreapp3.1
[<EntryPoint>]
let main argv =
  PerformanceRegression.Tests.run argv
  0

category:cq
theme:optimization
skill-level:intermediate
cost:medium

@mrange mrange changed the title Performance regression between dotnet core 3.1 from .NET 4.8 Minor performance regression going to dotnet core 3.1 from .NET 4.8 Feb 1, 2020
@jkotas jkotas added the tenet-performance Performance related issue label Feb 1, 2020
@jkotas
Copy link
Member

jkotas commented Feb 1, 2020

The extra no-ops should be gone in current master after dotnet/coreclr#26740 .

@jkotas
Copy link
Member

jkotas commented Feb 1, 2020

cc @adamsitnik

@jkotas jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 1, 2020
@AndyAyersMS
Copy link
Member

We should verify perf is recovered. Let's take a look as part of .NET 5.

@AndyAyersMS AndyAyersMS added this to the 5.0 milestone May 6, 2020
@BruceForstall
Copy link
Member

@erozenfeld Can you verify if any work remains here?

@AndyAyersMS
Copy link
Member

I still see a regression with a fairly recent P8 build:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.959 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.8.20359.6
  [Host]     : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT DEBUG
  Job-DTXTLF : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  Job-XPSVFA : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
  Job-KQQRIJ : .NET Core 5.0.0 (CoreCLR 5.0.20.35809, CoreFX 5.0.20.35809), X64 RyuJIT

MaxIterationCount=20  MinIterationCount=15  WarmupCount=30

|               Method |        Job |       Runtime |     Toolchain | Count |        Mean |       Error |      StdDev | Ratio | RatioSD |
|--------------------- |----------- |-------------- |-------------- |------ |------------:|------------:|------------:|------:|--------:|
| SimplePushStreamTest | Job-DTXTLF |      .NET 4.8 |         net48 |   100 |    372.9 ns |     4.34 ns |     4.06 ns |  1.00 |    0.00 |
| SimplePushStreamTest | Job-XPSVFA | .NET Core 3.1 | netcoreapp3.1 |   100 |    406.1 ns |     9.33 ns |     9.58 ns |  1.09 |    0.03 |
| SimplePushStreamTest | Job-KQQRIJ | .NET Core 5.0 |  netcoreapp50 |   100 |    439.1 ns |    23.10 ns |    26.60 ns |  1.18 |    0.08 |
|                      |            |               |               |       |             |             |             |       |         |
| SimplePushStreamTest | Job-DTXTLF |      .NET 4.8 |         net48 | 10000 | 29,770.6 ns |   531.09 ns |   470.80 ns |  1.00 |    0.00 |
| SimplePushStreamTest | Job-XPSVFA | .NET Core 3.1 | netcoreapp3.1 | 10000 | 30,609.8 ns |   402.64 ns |   356.93 ns |  1.03 |    0.02 |
| SimplePushStreamTest | Job-KQQRIJ | .NET Core 5.0 |  netcoreapp50 | 10000 | 37,312.9 ns | 1,136.50 ns | 1,167.10 ns |  1.25 |    0.04 |

@AndyAyersMS
Copy link
Member

@briansull if you haven't started looking at this, I'll take a look.

@briansull
Copy link
Contributor

Sure, go ahead and take a look

@AndyAyersMS AndyAyersMS self-assigned this Aug 7, 2020
@AndyAyersMS
Copy link
Member

Suspect this is some residual overhead from tiered compilation. Disabling that gives

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.959 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.8.20359.6
  [Host]     : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT DEBUG
  Job-FZFFZC : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  Job-FIZZFG : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
  Job-HDIAOD : .NET Core 5.0.0 (CoreCLR 5.0.20.35809, CoreFX 5.0.20.35809), X64 RyuJIT

MaxIterationCount=20  MinIterationCount=15  WarmupCount=30

|               Method |        Job |       Runtime |     Toolchain | Count |        Mean |     Error |    StdDev | Ratio | RatioSD | Code Size |
|--------------------- |----------- |-------------- |-------------- |------ |------------:|----------:|----------:|------:|--------:|----------:|
| SimplePushStreamTest | Job-FZFFZC |      .NET 4.8 |         net48 |   100 |    375.1 ns |   4.80 ns |   4.71 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | Job-FIZZFG | .NET Core 3.1 | netcoreapp3.1 |   100 |    388.9 ns |  15.25 ns |  15.66 ns |  1.03 |    0.04 |     272 B |
| SimplePushStreamTest | Job-HDIAOD | .NET Core 5.0 | netcoreapp5.0 |   100 |    389.1 ns |   7.26 ns |   6.79 ns |  1.04 |    0.02 |     273 B |
|                      |            |               |               |       |             |           |           |       |         |           |
| SimplePushStreamTest | Job-FZFFZC |      .NET 4.8 |         net48 | 10000 | 31,360.3 ns | 545.98 ns | 510.71 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | Job-FIZZFG | .NET Core 3.1 | netcoreapp3.1 | 10000 | 30,618.3 ns | 217.74 ns | 193.02 ns |  0.98 |    0.02 |     272 B |
| SimplePushStreamTest | Job-HDIAOD | .NET Core 5.0 | netcoreapp5.0 | 10000 | 31,787.3 ns | 412.13 ns | 385.51 ns |  1.01 |    0.02 |     273 B |

Differential profiles of 4.8 vs 5.0 show identical codegen in all the hot methods, eg:

;; 4.8
;; Program+PerformanceRegression+Tests+SimplePushStreamTest@42-6.Invoke(Int64)

00007ff8`f2816040 488b4908        mov     rcx,qword ptr [rcx+8]
00007ff8`f2816044 48ffc2          inc     rdx
00007ff8`f2816047 488b01          mov     rax,qword ptr [rcx]
00007ff8`f281604a 488b4040        mov     rax,qword ptr [rax+40h]
00007ff8`f281604e 488b4020        mov     rax,qword ptr [rax+20h]
00007ff8`f2816052 48ffe0          jmp     rax

;; 5.0
;; Assembly listing for method SimplePushStreamTest@42-6:Invoke(long):Unit:this

G_M35068_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M35068_IG02:
       mov      rcx, gword ptr [rcx+8]
       inc      rdx
       mov      rax, qword ptr [rcx]
       mov      rax, qword ptr [rax+64]
       mov      rax, qword ptr [rax+32]
						;; bbWeight=1    PerfScore 8.25
G_M35068_IG03:
       rex.jmp  rax

cc @kouvel

Going to relabel this as VM.

@AndyAyersMS AndyAyersMS added area-VM-coreclr and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Aug 8, 2020
@kouvel
Copy link
Member

kouvel commented Aug 11, 2020

This seems to have something to do with JIT timing. I have seen some cases before where different loop alignment led to noticeable regressions when tiering is enabled. It could also be better at random, but naturally the regressions are more noticed. Adjusting timing of rejits by changing the call count threshold seems to change the perf significantly. Code locality may also be relevant, as I noticed that using R2R'ed runtime binaries versus using IL-only runtime binaries also affects perf significantly regardless of tiering. Tiering timings have changed between 3.1 and 5.0, typically it would be for the better, though there may be benchmarks where slight differences in timing may realize as larger differences in performance. I suspect this is not a regression due to a bug but rather a regression due to chance. It would need more investigation to determine the root causes, and if my theory is right, to determine how to reduce the chance factor and make it more deterministic.

@kouvel kouvel modified the milestones: 5.0.0, 6.0.0, Future Aug 11, 2020
@kouvel
Copy link
Member

kouvel commented Aug 11, 2020

I see large differences in time spent in a function when the function's code crosses a cache line boundary. For small functions it may help to align them such that they would fit within a cache line.

@kouvel kouvel modified the milestones: Future, 6.0.0 Aug 11, 2020
@AndyAyersMS
Copy link
Member

#2249 added 32 byte alignment for tier1 methods with loops; might be interesting to try this for all tier1 methods.

@kouvel
Copy link
Member

kouvel commented Aug 11, 2020

I see, I was also thinking something like this:

  • If 0 < size <= 16 then align at any 16-byte offset in a cache line
  • If 16 < size <= 32 then align at offsets 0, 16, or 32, but not at 48
  • If 32 < size <= 48 then align at offsets 0 or 16
  • If 48 < size <= 64 then align at offset 0 only

Might allow for a bit more density but #2249 also mentioned that crossing a 32-byte boundary has perf issues for loops, so 32-byte alignment may work better in some cases.

@mangod9 mangod9 modified the milestones: 6.0.0, Future Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants