Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unpeel group by 3 ways to enable vectorization #7949

Conversation

richardstartin
Copy link
Member

@richardstartin richardstartin commented Dec 22, 2021

This rearranges the group key assignment in 2 steps - split the loop to compute the group ids from the loop which marks the existence of the groups in a block, and then specialises the group id calculation loop for cases where there are 2 groups and 3 groups respectively. This simplifies the group id calculation loop to the extent that C2 can vectorize it on JDK11. This speeds up the code up at least 5x compared to a loop which can handle any number of group by expressions in a microbenchmark small enough to examine the generated code:

import org.openjdk.jmh.annotations.*;

import java.util.concurrent.ThreadLocalRandom;

public class GroupByBenchmark {


  @State(Scope.Benchmark)
  public static class OneState {

    @Param({"10", "100"})
    int cardinality0;

    @Param({"1024", "10000"})
    int numDocs;

    Generic generic;
    One one;

    int[] outGroupIds;
    int[] dictIds0;
    int[][] dictIds;

    @Setup(Level.Trial)
    public void setup() {
      one = new One(cardinality0);
      generic = new Generic(new int[]{cardinality0});
      outGroupIds = new int[numDocs];
      dictIds0 = new int[numDocs];
      for (int i = 0; i < numDocs; i++) {
        dictIds0[i] = ThreadLocalRandom.current().nextInt(cardinality0);
      }
      dictIds = new int[][] {dictIds0};
    }
  }

  @State(Scope.Benchmark)
  public static class TwoState {

    @Param({"10", "100"})
    int cardinality0;

    @Param({"3", "30"})
    int cardinality1;

    @Param("1024")
    int numDocs;

    Generic generic;
    Two two;

    int[] outGroupIds;
    int[] dictIds0;
    int[] dictIds1;
    int[][] dictIds;

    @Setup(Level.Trial)
    public void setup() {
      two = new Two(cardinality0, cardinality1);
      generic = new Generic(new int[]{cardinality0, cardinality1});
      outGroupIds = new int[numDocs];
      dictIds0 = new int[numDocs];
      dictIds1 = new int[numDocs];
      for (int i = 0; i < numDocs; i++) {
        dictIds0[i] = ThreadLocalRandom.current().nextInt(cardinality0);
        dictIds1[i] = ThreadLocalRandom.current().nextInt(cardinality1);
      }
      dictIds = new int[][] {dictIds0, dictIds1};
    }
  }

  @State(Scope.Benchmark)
  public static class ThreeState {

    @Param({"10", "100"})
    int cardinality0;

    @Param({"3", "30"})
    int cardinality1;

    @Param({"5", "50"})
    int cardinality2;

    @Param({"1024", "10000"})
    int numDocs;

    Generic generic;
    Three three;
    GenericFissured genericFissured;

    int[] outGroupIds;
    int[] dictIds0;
    int[] dictIds1;
    int[] dictIds2;
    int[][] dictIds;

    @Setup(Level.Trial)
    public void setup() {
      three = new Three(cardinality0, cardinality1, cardinality2);
      generic = new Generic(new int[]{cardinality0, cardinality1, cardinality2});
      outGroupIds = new int[numDocs];
      dictIds0 = new int[numDocs];
      dictIds1 = new int[numDocs];
      dictIds2 = new int[numDocs];
      for (int i = 0; i < numDocs; i++) {
        dictIds0[i] = ThreadLocalRandom.current().nextInt(cardinality0);
        dictIds1[i] = ThreadLocalRandom.current().nextInt(cardinality1);
        dictIds2[i] = ThreadLocalRandom.current().nextInt(cardinality2);
      }
      dictIds = new int[][] {dictIds0, dictIds1, dictIds2};
    }
  }

  public static class One {
    private int numKeys;
    private int groupCountUpperBound;
    private final boolean[] groups;

    public One(int cardinality) {
      this.groupCountUpperBound = cardinality;
      this.groups = new boolean[groupCountUpperBound];
    }
    public void apply(int numDocs, int[] dictIds, int[] outGroupIds) {
      System.arraycopy(dictIds, 0, outGroupIds, 0, numDocs);
      for (int i = 0; i < numDocs && numKeys < groupCountUpperBound; i++) {
        if (!groups[outGroupIds[i]]) {
          numKeys++;
          groups[outGroupIds[i]] = true;
        }
      }
    }
  }

  public static class Two {
    private final int groupCountUpperBound;
    private final boolean[] groups;
    private final int cardinality0;
    private int numKeys;


    public Two(int cardinality0, int cardinality1) {
      this.groupCountUpperBound = cardinality0 * cardinality1;
      this.groups = new boolean[groupCountUpperBound];
      this.cardinality0 = cardinality0;
    }

    public void apply(int numDocs, int[] dictIds0, int[] dictIds1, int[] outGroupIds) {
      for (int i = 0; i < numDocs; i++) {
        outGroupIds[i] = dictIds1[i] * cardinality0 + dictIds0[i];
      }
      for (int i = 0; i < numDocs && numKeys < groupCountUpperBound; i++) {
        if (!groups[outGroupIds[i]]) {
          numKeys++;
          groups[outGroupIds[i]] = true;
        }
      }
    }
  }

  public static class Three {
    private final int groupCountUpperBound;
    private final boolean[] groups;
    private final int cardinality0;
    private final int cardinality1;
    private int numKeys;


    public Three(int cardinality0, int cardinality1, int cardinality2) {
      this.groupCountUpperBound = cardinality0 * cardinality1 * cardinality2;
      this.groups = new boolean[groupCountUpperBound];
      this.cardinality0 = cardinality0;
      this.cardinality1 = cardinality1;
    }

    public void apply(int numDocs, int[] dictIds0, int[] dictIds1, int[] dictIds2, int[] outGroupIds) {
      for (int i = 0; i < numDocs; i++) {
        outGroupIds[i] = (dictIds2[i] * cardinality1 + dictIds1[i]) * cardinality0 + dictIds0[i];
      }
      for (int i = 0; i < numDocs && numKeys < groupCountUpperBound; i++) {
        if (!groups[outGroupIds[i]]) {
          numKeys++;
          groups[outGroupIds[i]] = true;
        }
      }
    }
  }

  public static class Generic {
    private final int groupCountUpperBound;
    private final int numExpressions;
    private final int[] cardinalities;
    private final boolean[] groups;
    int numKeys;

    public Generic(int[] cardinalities) {
      this.numExpressions = cardinalities.length;
      int count = 1;
      for (int cardinality : cardinalities) {
        count *= cardinality;
      }
      this.groupCountUpperBound = count;
      this.groups = new boolean[groupCountUpperBound];
      this.cardinalities = cardinalities;
    }

    public void apply(int numDocs, int[][] dictIds, int[] outGroupIds) {
      for (int i = 0; i < numDocs; i++) {
        int groupId = 0;
        for (int j = numExpressions - 1; j >= 0; j--) {
          groupId = groupId * cardinalities[j] + dictIds[j][i];
        }
        outGroupIds[i] = groupId;
        // if the flag is false, then increase the key num
        if (!groups[groupId]) {
          numKeys++;
        }
        groups[groupId] = true;
      }

    }
  }

  @Benchmark
  public int[] one(OneState state) {
    state.one.apply(state.numDocs, state.dictIds0, state.outGroupIds);
    return state.outGroupIds;
  }

  @Benchmark
  public int[] oneGeneric(OneState state) {
    state.generic.apply(state.numDocs, state.dictIds, state.outGroupIds);
    return state.outGroupIds;
  }

  @Benchmark
  public int[] two(TwoState state) {
    state.two.apply(state.numDocs, state.dictIds0, state.dictIds1, state.outGroupIds);
    return state.outGroupIds;
  }

  @Benchmark
  public int[] twoGeneric(TwoState state) {
    state.generic.apply(state.numDocs, state.dictIds, state.outGroupIds);
    return state.outGroupIds;
  }

  @Benchmark
  public int[] three(ThreeState state) {
    state.three.apply(state.numDocs, state.dictIds0, state.dictIds1, state.dictIds2, state.outGroupIds);
    return state.outGroupIds;
  }

  @Benchmark
  public int[] threeGeneric(ThreeState state) {
    state.generic.apply(state.numDocs, state.dictIds, state.outGroupIds);
    return state.outGroupIds;
  }
}
Benchmark                       (cardinality0)  (cardinality1)  (cardinality2)  (numDocs)  Mode  Cnt      Score       Error  Units
GroupByBenchmark.one                        10             N/A             N/A       1024  avgt    5     87.888 ±     0.035  ns/op
GroupByBenchmark.one                       100             N/A             N/A       1024  avgt    5     85.826 ±     0.090  ns/op
GroupByBenchmark.oneGeneric                 10             N/A             N/A       1024  avgt    5   3968.017 ±     4.523  ns/op
GroupByBenchmark.oneGeneric                100             N/A             N/A       1024  avgt    5   3967.884 ±     2.931  ns/op
GroupByBenchmark.three                      10               3               5       1024  avgt    5    180.661 ±     0.162  ns/op
GroupByBenchmark.three                      10               3               5      10000  avgt    5   2519.999 ±    15.816  ns/op
GroupByBenchmark.three                      10               3              50       1024  avgt    5    788.203 ±     0.850  ns/op
GroupByBenchmark.three                      10               3              50      10000  avgt    5   8369.957 ±    14.328  ns/op
GroupByBenchmark.three                      10              30               5       1024  avgt    5    789.272 ±     3.107  ns/op
GroupByBenchmark.three                      10              30               5      10000  avgt    5   8352.994 ±    25.510  ns/op
GroupByBenchmark.three                      10              30              50       1024  avgt    5    791.969 ±     0.755  ns/op
GroupByBenchmark.three                      10              30              50      10000  avgt    5   8456.677 ±    39.868  ns/op
GroupByBenchmark.three                     100               3               5       1024  avgt    5    788.635 ±     1.158  ns/op
GroupByBenchmark.three                     100               3               5      10000  avgt    5   8473.065 ±    43.604  ns/op
GroupByBenchmark.three                     100               3              50       1024  avgt    5    792.696 ±     1.293  ns/op
GroupByBenchmark.three                     100               3              50      10000  avgt    5   8400.931 ±    39.148  ns/op
GroupByBenchmark.three                     100              30               5       1024  avgt    5    792.267 ±     0.815  ns/op
GroupByBenchmark.three                     100              30               5      10000  avgt    5   8362.967 ±    46.329  ns/op
GroupByBenchmark.three                     100              30              50       1024  avgt    5   1009.376 ±     3.741  ns/op
GroupByBenchmark.three                     100              30              50      10000  avgt    5  11784.628 ±    75.056  ns/op
GroupByBenchmark.threeGeneric               10               3               5       1024  avgt    5   5168.556 ±    23.392  ns/op
GroupByBenchmark.threeGeneric               10               3               5      10000  avgt    5  55207.217 ± 26107.977  ns/op
GroupByBenchmark.threeGeneric               10               3              50       1024  avgt    5   5203.942 ±     7.881  ns/op
GroupByBenchmark.threeGeneric               10               3              50      10000  avgt    5  55742.896 ± 24792.893  ns/op
GroupByBenchmark.threeGeneric               10              30               5       1024  avgt    5   5163.989 ±     2.379  ns/op
GroupByBenchmark.threeGeneric               10              30               5      10000  avgt    5  55189.596 ± 26087.251  ns/op
GroupByBenchmark.threeGeneric               10              30              50       1024  avgt    5   5208.037 ±     9.829  ns/op
GroupByBenchmark.threeGeneric               10              30              50      10000  avgt    5  55353.802 ± 25864.541  ns/op
GroupByBenchmark.threeGeneric              100               3               5       1024  avgt    5   5163.058 ±     3.719  ns/op
GroupByBenchmark.threeGeneric              100               3               5      10000  avgt    5  55229.799 ± 26282.820  ns/op
GroupByBenchmark.threeGeneric              100               3              50       1024  avgt    5   5165.277 ±     1.592  ns/op
GroupByBenchmark.threeGeneric              100               3              50      10000  avgt    5  55250.657 ± 26038.710  ns/op
GroupByBenchmark.threeGeneric              100              30               5       1024  avgt    5   5248.230 ±    10.805  ns/op
GroupByBenchmark.threeGeneric              100              30               5      10000  avgt    5  55216.541 ± 26101.864  ns/op
GroupByBenchmark.threeGeneric              100              30              50       1024  avgt    5   5169.316 ±     7.386  ns/op
GroupByBenchmark.threeGeneric              100              30              50      10000  avgt    5  55558.478 ± 25691.232  ns/op
GroupByBenchmark.two                        10               3             N/A       1024  avgt    5    112.575 ±     0.119  ns/op
GroupByBenchmark.two                        10              30             N/A       1024  avgt    5    904.502 ±     0.849  ns/op
GroupByBenchmark.two                       100               3             N/A       1024  avgt    5    901.040 ±     3.153  ns/op
GroupByBenchmark.two                       100              30             N/A       1024  avgt    5    902.323 ±     0.614  ns/op
GroupByBenchmark.twoGeneric                 10               3             N/A       1024  avgt    5   4370.175 ±    21.262  ns/op
GroupByBenchmark.twoGeneric                 10              30             N/A       1024  avgt    5   4396.317 ±   129.928  ns/op
GroupByBenchmark.twoGeneric                100               3             N/A       1024  avgt    5   4372.822 ±    16.521  ns/op
GroupByBenchmark.twoGeneric                100              30             N/A       1024  avgt    5   4371.179 ±    21.749  ns/op

The difference in code generation can be seen with perfasm (for a 3D group by, cardinalities 100 x 30 x 50):

before:

....[Hottest Region 1]..............................................................................
c2, level 4, groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub, version 497 (183 bytes) 

                 0x00007ff78b6eef95: lea    (%r12,%r11,8),%rbp
                 0x00007ff78b6eef99: vmovd  %r11d,%xmm1
                 0x00007ff78b6eef9e: vmovd  %xmm2,%r11d
                 0x00007ff78b6eefa3: lea    (%r12,%r11,8),%rax
                 0x00007ff78b6eefa7: xor    %ecx,%ecx
0x00007ff78b6eefa9: jmp    0x00007ff78b6eefdf
         │↗      0x00007ff78b6eefab: xor    %r8d,%r8d          ;*iflt {reexecute=0 rethrow=0 return_oop=0}
         ││                                                    ; - groupby.GroupByBenchmark$Generic::apply@22 (line 213)
         ││                                                    ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
         ││                                                    ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  5.10%  ││   ↗  0x00007ff78b6eefae: mov    %r8d,0x10(%rbp,%rcx,4)  ;*iastore {reexecute=0 rethrow=0 return_oop=0}
         ││   │                                                ; - groupby.GroupByBenchmark$Generic::apply@56 (line 216)
         ││   │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
         ││   │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.73%  ││   │  0x00007ff78b6eefb3: cmp    %r14d,%r8d
         ││   │  0x00007ff78b6eefb6: jae    0x00007ff78b6ef0d0
  0.09%  ││   │  0x00007ff78b6eefbc: movslq %r8d,%r11
  1.00%  ││   │  0x00007ff78b6eefbf: movzbl 0x10(%rax,%r11,1),%edx  ;*baload {reexecute=0 rethrow=0 return_oop=0}
         ││   │                                                ; - groupby.GroupByBenchmark$Generic::apply@63 (line 218)
         ││   │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
         ││   │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
 28.57%  ││   │  0x00007ff78b6eefc5: test   %edx,%edx
         ││   │  0x00007ff78b6eefc7: je     0x00007ff78b6ef13c  ;*ifne {reexecute=0 rethrow=0 return_oop=0}
         ││   │                                                ; - groupby.GroupByBenchmark$Generic::apply@64 (line 218)
         ││   │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
         ││   │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  5.57%  ││   │  0x00007ff78b6eefcd: movb   $0x1,0x10(%rax,%r11,1)  ;*bastore {reexecute=0 rethrow=0 return_oop=0}
         ││   │                                                ; - groupby.GroupByBenchmark$Generic::apply@84 (line 221)
         ││   │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
         ││   │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.22%  ││   │  0x00007ff78b6eefd3: inc    %ecx               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
         ││   │                                                ; - groupby.GroupByBenchmark$Generic::apply@85 (line 211)
         ││   │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
         ││   │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.31%  ││   │  0x00007ff78b6eefd5: cmp    0x70(%rsp),%ecx
  0.02%  ││   │  0x00007ff78b6eefd9: jge    0x00007ff78b6eeeab  ;*iconst_0 {reexecute=0 rethrow=0 return_oop=0}
         ││   │                                                ; - groupby.GroupByBenchmark$Generic::apply@9 (line 212)
         ││   │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
         ││   │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  1.51%  ↘│   │  0x00007ff78b6eefdf: test   %r10d,%r10d
          ╰   │  0x00007ff78b6eefe2: jl     0x00007ff78b6eefab  ;*iflt {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Generic::apply@22 (line 213)
                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  5.23%       │  0x00007ff78b6eefe4: mov    %r10d,%edx
  0.40%       │  0x00007ff78b6eefe7: xor    %r8d,%r8d          ;*iload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Generic::apply@25 (line 214)
                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.20%    ↗  │  0x00007ff78b6eefea: mov    0x10(%r9,%rdx,4),%esi  ;*aaload {reexecute=0 rethrow=0 return_oop=0}
           │  │                                                ; - groupby.GroupByBenchmark$Generic::apply@38 (line 214)
           │  │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
           │  │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  1.76%    │  │  0x00007ff78b6eefef: mov    0xc(%r12,%rsi,8),%r11d  ; implicit exception: dispatches to 0x00007ff78b6ef328
  5.68%    │  │  0x00007ff78b6eeff4: imul   0x10(%rbx,%rdx,4),%r8d  ;*imul {reexecute=0 rethrow=0 return_oop=0}
           │  │                                                ; - groupby.GroupByBenchmark$Generic::apply@34 (line 214)
           │  │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
           │  │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.18%    │  │  0x00007ff78b6eeffa: lea    (%r12,%rsi,8),%rdi  ;*aaload {reexecute=0 rethrow=0 return_oop=0}
           │  │                                                ; - groupby.GroupByBenchmark$Generic::apply@38 (line 214)
           │  │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
           │  │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.09%    │  │  0x00007ff78b6eeffe: cmp    %r11d,%ecx
           │  │  0x00007ff78b6ef001: jae    0x00007ff78b6ef09c
  1.02%    │  │  0x00007ff78b6ef007: add    0x10(%rdi,%rcx,4),%r8d  ;*iadd {reexecute=0 rethrow=0 return_oop=0}
           │  │                                                ; - groupby.GroupByBenchmark$Generic::apply@42 (line 214)
           │  │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
           │  │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  6.06%    │  │  0x00007ff78b6ef00c: dec    %edx               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
           │  │                                                ; - groupby.GroupByBenchmark$Generic::apply@45 (line 213)
           │  │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
           │  │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.18%    │  │  0x00007ff78b6ef00e: cmp    %r13d,%edx
           ╰  │  0x00007ff78b6ef011: jg     0x00007ff78b6eefea  ;*iflt {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Generic::apply@22 (line 213)
                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.18%       │  0x00007ff78b6ef013: test   %edx,%edx
            ╭ │  0x00007ff78b6ef015: jle    0x00007ff78b6ef065
  0.80%     │ │  0x00007ff78b6ef017: nopw   0x0(%rax,%rax,1)   ;*iload {reexecute=0 rethrow=0 return_oop=0}
            │ │                                                ; - groupby.GroupByBenchmark$Generic::apply@25 (line 214)
            │ │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
            │ │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  5.66%     │↗│  0x00007ff78b6ef020: mov    0x10(%r9,%rdx,4),%esi  ;*aaload {reexecute=0 rethrow=0 return_oop=0}
            │││                                                ; - groupby.GroupByBenchmark$Generic::apply@38 (line 214)
            │││                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
            │││                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.18%     │││  0x00007ff78b6ef025: mov    0xc(%r12,%rsi,8),%r11d  ; implicit exception: dispatches to 0x00007ff78b6ef328
  0.24%     │││  0x00007ff78b6ef02a: imul   0x10(%rbx,%rdx,4),%r8d  ;*imul {reexecute=0 rethrow=0 return_oop=0}
            │││                                                ; - groupby.GroupByBenchmark$Generic::apply@34 (line 214)
            │││                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
            │││                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  1.43%     │││  0x00007ff78b6ef030: lea    (%r12,%rsi,8),%rdi
  5.79%     │││  0x00007ff78b6ef034: cmp    %r11d,%ecx
            │││  0x00007ff78b6ef037: jae    0x00007ff78b6ef09c
  0.24%     │││  0x00007ff78b6ef039: mov    0xc(%r9,%rdx,4),%r11d
  0.11%     │││  0x00007ff78b6ef03e: mov    0xc(%r12,%r11,8),%esi  ; implicit exception: dispatches to 0x00007ff78b6ef328
  1.09%     │││  0x00007ff78b6ef043: add    0x10(%rdi,%rcx,4),%r8d
  5.86%     │││  0x00007ff78b6ef048: lea    (%r12,%r11,8),%rdi  ;*aaload {reexecute=0 rethrow=0 return_oop=0}
            │││                                                ; - groupby.GroupByBenchmark$Generic::apply@38 (line 214)
            │││                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
            │││                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.13%     │││  0x00007ff78b6ef04c: mov    0xc(%rbx,%rdx,4),%r11d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
            │││                                                ; - groupby.GroupByBenchmark$Generic::apply@33 (line 214)
            │││                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
            │││                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.22%     │││  0x00007ff78b6ef051: imul   %r11d,%r8d         ;*imul {reexecute=0 rethrow=0 return_oop=0}
            │││                                                ; - groupby.GroupByBenchmark$Generic::apply@34 (line 214)
            │││                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
            │││                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  1.54%     │││  0x00007ff78b6ef055: cmp    %esi,%ecx
  0.02%     │││  0x00007ff78b6ef057: jae    0x00007ff78b6ef09a
  5.61%     │││  0x00007ff78b6ef059: add    0x10(%rdi,%rcx,4),%r8d  ;*iadd {reexecute=0 rethrow=0 return_oop=0}
            │││                                                ; - groupby.GroupByBenchmark$Generic::apply@42 (line 214)
            │││                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
            │││                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.27%     │││  0x00007ff78b6ef05e: add    $0xfffffffe,%edx   ;*iinc {reexecute=0 rethrow=0 return_oop=0}
            │││                                                ; - groupby.GroupByBenchmark$Generic::apply@45 (line 213)
            │││                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
            │││                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  0.27%     │││  0x00007ff78b6ef061: test   %edx,%edx
            │╰│  0x00007ff78b6ef063: jg     0x00007ff78b6ef020  ;*iflt {reexecute=0 rethrow=0 return_oop=0}
            │ │                                                ; - groupby.GroupByBenchmark$Generic::apply@22 (line 213)
            │ │                                                ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
            │ │                                                ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
  1.22%     ↘ │  0x00007ff78b6ef065: cmp    $0xffffffff,%edx
0x00007ff78b6ef068: jle    0x00007ff78b6eefae
                 0x00007ff78b6ef06e: xchg   %ax,%ax            ;*iload {reexecute=0 rethrow=0 return_oop=0}
                                                               ; - groupby.GroupByBenchmark$Generic::apply@25 (line 214)
                                                               ; - groupby.GroupByBenchmark::threeGeneric@16 (line 300)
                                                               ; - groupby.generated.GroupByBenchmark_threeGeneric_jmhTest::threeGeneric_avgt_jmhStub@19 (line 237)
                 0x00007ff78b6ef070: mov    0x10(%r9,%rdx,4),%esi  ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                                               ; - groupby.GroupByBenchmark$Generic::apply@38 (line 214)
....................................................................................................
 94.79%  <total for region 1>

The vectorized group id assignment is so fast that it's not the bottleneck afterwards - nearly 70% of the time is now spent in group marking, and only 20% computing group ids. The marking can probably be improved but it's harder to parallelise than computing group ids.

....[Hottest Region 1]..............................................................................
c2, level 4, groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub, version 470 (189 bytes) 

            0x00007ff5076f08f3: test   %r8d,%r8d
            0x00007ff5076f08f6: je     0x00007ff5076f0b88  ;*ifne {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - groupby.GroupByBenchmark$Three::apply@75 (line 184)
                                                          ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                          ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
            0x00007ff5076f08fc: mov    %r13d,%r11d
            0x00007ff5076f08ff: add    $0xfffffffd,%r11d
            0x00007ff5076f0903: cmp    %r11d,%r13d
            0x00007ff5076f0906: mov    $0x80000000,%r8d
            0x00007ff5076f090c: cmovl  %r8d,%r11d
  0.04%     0x00007ff5076f0910: cmp    $0x1,%r11d
            0x00007ff5076f0914: jle    0x00007ff5076f0b77
            0x00007ff5076f091a: mov    $0x1,%ebx
  0.04%     0x00007ff5076f091f: nop                       ;*aload_0 {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - groupby.GroupByBenchmark$Three::apply@54 (line 183)
                                                          ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                          ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  3.78%  ↗  0x00007ff5076f0920: mov    0x10(%r9,%rbx,4),%r8d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@73 (line 184)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.09%  │  0x00007ff5076f0925: cmp    %esi,%r8d
0x00007ff5076f0928: jae    0x00007ff5076f09d9
  5.21%  │  0x00007ff5076f092e: movzbl 0x10(%rdx,%r8,1),%r8d  ;*baload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@74 (line 184)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  3.43%  │  0x00007ff5076f0934: test   %r8d,%r8d
0x00007ff5076f0937: je     0x00007ff5076f0a18  ;*ifne {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@75 (line 184)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  3.56%  │  0x00007ff5076f093d: mov    0x14(%r9,%rbx,4),%r8d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@73 (line 184)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.30%  │  0x00007ff5076f0942: mov    %ebx,%ecx
  3.78%  │  0x00007ff5076f0944: inc    %ecx               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@99 (line 183)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
0x00007ff5076f0946: cmp    %esi,%r8d
0x00007ff5076f0949: jae    0x00007ff5076f09db
  3.60%  │  0x00007ff5076f094f: movzbl 0x10(%rdx,%r8,1),%r8d  ;*baload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@74 (line 184)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  5.34%  │  0x00007ff5076f0955: test   %r8d,%r8d
0x00007ff5076f0958: je     0x00007ff5076f0a1a  ;*ifne {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@75 (line 184)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  4.73%  │  0x00007ff5076f095e: mov    0x18(%r9,%rbx,4),%r8d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@73 (line 184)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.04%  │  0x00007ff5076f0963: mov    %ebx,%ecx
  2.99%  │  0x00007ff5076f0965: add    $0x2,%ecx          ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@99 (line 183)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.04%  │  0x00007ff5076f0968: cmp    %esi,%r8d
0x00007ff5076f096b: jae    0x00007ff5076f09db
  4.95%  │  0x00007ff5076f096d: movzbl 0x10(%rdx,%r8,1),%r8d  ;*baload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@74 (line 184)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  4.04%  │  0x00007ff5076f0973: test   %r8d,%r8d
0x00007ff5076f0976: je     0x00007ff5076f0a1a  ;*ifne {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@75 (line 184)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  4.38%  │  0x00007ff5076f097c: mov    0x1c(%r9,%rbx,4),%r8d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@73 (line 184)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.04%  │  0x00007ff5076f0981: mov    %ebx,%ecx
  4.64%  │  0x00007ff5076f0983: add    $0x3,%ecx          ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@99 (line 183)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
0x00007ff5076f0986: cmp    %esi,%r8d
0x00007ff5076f0989: jae    0x00007ff5076f09db
  3.52%  │  0x00007ff5076f098b: movzbl 0x10(%rdx,%r8,1),%r8d  ;*baload {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@74 (line 184)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  5.03%  │  0x00007ff5076f0991: test   %r8d,%r8d
0x00007ff5076f0994: je     0x00007ff5076f0a1a  ;*ifne {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@75 (line 184)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  5.12%  │  0x00007ff5076f099a: add    $0x4,%ebx          ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                ; - groupby.GroupByBenchmark$Three::apply@99 (line 183)
                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.04%  │  0x00007ff5076f099d: cmp    %r11d,%ebx
0x00007ff5076f09a0: jl     0x00007ff5076f0920  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - groupby.GroupByBenchmark$Three::apply@51 (line 183)
                                                          ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                          ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
            0x00007ff5076f09a6: cmp    %r13d,%ebx
            0x00007ff5076f09a9: jge    0x00007ff5076f0641
            0x00007ff5076f09af: nop                       ;*aload_0 {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - groupby.GroupByBenchmark$Three::apply@54 (line 183)
                                                          ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                          ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.09%     0x00007ff5076f09b0: mov    0x10(%r9,%rbx,4),%r8d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - groupby.GroupByBenchmark$Three::apply@73 (line 184)
                                                          ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                          ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
            0x00007ff5076f09b5: cmp    %esi,%r8d
            0x00007ff5076f09b8: jae    0x00007ff5076f0b81
  0.09%     0x00007ff5076f09be: movzbl 0x10(%rdx,%r8,1),%r8d  ;*baload {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - groupby.GroupByBenchmark$Three::apply@74 (line 184)
                                                          ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                          ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
            0x00007ff5076f09c4: test   %r8d,%r8d
            0x00007ff5076f09c7: je     0x00007ff5076f0b8f  ;*ifne {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - groupby.GroupByBenchmark$Three::apply@75 (line 184)
                                                          ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                          ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.17%     0x00007ff5076f09cd: inc    %ebx               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - groupby.GroupByBenchmark$Three::apply@99 (line 183)
                                                          ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                          ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
            0x00007ff5076f09cf: cmp    %r13d,%ebx
            0x00007ff5076f09d2: jl     0x00007ff5076f09b0  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                          ; - groupby.GroupByBenchmark$Three::apply@51 (line 183)
                                                          ; - groupby.GroupByBenchmark::three@24 (line 294)
....................................................................................................
 69.10%  <total for region 1>

....[Hottest Region 2]..............................................................................
c2, level 4, groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub, version 470 (273 bytes) 

                0x00007ff5076f079d: vmovd  %ebx,%xmm2
                0x00007ff5076f07a1: vpshufd $0x0,%xmm2,%xmm2
                0x00007ff5076f07a6: vinserti128 $0x1,%xmm2,%ymm2,%ymm2
                0x00007ff5076f07ac: vmovd  %esi,%xmm3
                0x00007ff5076f07b0: vpshufd $0x0,%xmm3,%xmm3
                0x00007ff5076f07b5: vinserti128 $0x1,%xmm3,%ymm3,%ymm3
                0x00007ff5076f07bb: nopl   0x0(%rax,%rax,1)   ;*aload {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - groupby.GroupByBenchmark$Three::apply@9 (line 181)
                                                              ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                              ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.26%  ↗      0x00007ff5076f07c0: vpmulld 0x10(%r8,%rdx,4),%ymm2,%ymm0
  0.78%  │      0x00007ff5076f07c7: vpaddd 0x10(%r10,%rdx,4),%ymm0,%ymm0
  0.61%  │      0x00007ff5076f07ce: vpmulld %ymm3,%ymm0,%ymm0
  3.43%  │      0x00007ff5076f07d3: vpaddd 0x10(%r11,%rdx,4),%ymm0,%ymm0
  1.74%  │      0x00007ff5076f07da: vmovdqu %ymm0,0x10(%r9,%rdx,4)
  0.35%  │      0x00007ff5076f07e1: movslq %edx,%rdi
  0.22%  │      0x00007ff5076f07e4: vpmulld 0x30(%r8,%rdi,4),%ymm2,%ymm0
  0.87%  │      0x00007ff5076f07eb: vpaddd 0x30(%r10,%rdi,4),%ymm0,%ymm0
  0.30%  │      0x00007ff5076f07f2: vpmulld %ymm3,%ymm0,%ymm0
  0.56%  │      0x00007ff5076f07f7: vpaddd 0x30(%r11,%rdi,4),%ymm0,%ymm0
  0.74%  │      0x00007ff5076f07fe: vmovdqu %ymm0,0x30(%r9,%rdi,4)
  0.04%  │      0x00007ff5076f0805: vpmulld 0x50(%r8,%rdi,4),%ymm2,%ymm0
  1.39%  │      0x00007ff5076f080c: vpaddd 0x50(%r10,%rdi,4),%ymm0,%ymm0
  1.09%  │      0x00007ff5076f0813: vpmulld %ymm3,%ymm0,%ymm0
  2.56%  │      0x00007ff5076f0818: vpaddd 0x50(%r11,%rdi,4),%ymm0,%ymm0
  1.48%  │      0x00007ff5076f081f: vmovdqu %ymm0,0x50(%r9,%rdi,4)
  0.48%  │      0x00007ff5076f0826: vpmulld 0x70(%r8,%rdi,4),%ymm2,%ymm0
  0.65%  │      0x00007ff5076f082d: vpaddd 0x70(%r10,%rdi,4),%ymm0,%ymm0
  0.61%  │      0x00007ff5076f0834: vpmulld %ymm3,%ymm0,%ymm0
  0.39%  │      0x00007ff5076f0839: vpaddd 0x70(%r11,%rdi,4),%ymm0,%ymm0
  1.13%  │      0x00007ff5076f0840: vmovdqu %ymm0,0x70(%r9,%rdi,4)  ;*iastore {reexecute=0 rethrow=0 return_oop=0}
                                                    ; - groupby.GroupByBenchmark$Three::apply@38 (line 181)
                                                    ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                    ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.43%  │      0x00007ff5076f0847: add    $0x20,%edx         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                    ; - groupby.GroupByBenchmark$Three::apply@39 (line 180)
                                                    ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                    ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.30%  │      0x00007ff5076f084a: cmp    %ecx,%edx
0x00007ff5076f084c: jl     0x00007ff5076f07c0  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - groupby.GroupByBenchmark$Three::apply@6 (line 180)
                                                              ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                              ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
                0x00007ff5076f0852: mov    %r13d,%ecx
                0x00007ff5076f0855: add    $0xfffffff9,%ecx
                0x00007ff5076f0858: cmp    %ecx,%r13d
                0x00007ff5076f085b: mov    $0x80000000,%edi
                0x00007ff5076f0860: cmovl  %edi,%ecx
                0x00007ff5076f0863: cmp    %ecx,%edx
0x00007ff5076f0865: jge    0x00007ff5076f0890
0x00007ff5076f0867: nop                       ;*aload {reexecute=0 rethrow=0 return_oop=0}
                                                   ; - groupby.GroupByBenchmark$Three::apply@9 (line 181)
                                                   ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                   ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.13%   │↗    0x00007ff5076f0868: vpmulld 0x10(%r8,%rdx,4),%ymm2,%ymm0
  0.04%   ││    0x00007ff5076f086f: vpaddd 0x10(%r10,%rdx,4),%ymm0,%ymm0
  0.04%   ││    0x00007ff5076f0876: vpmulld %ymm3,%ymm0,%ymm0
  0.17%   ││    0x00007ff5076f087b: vpaddd 0x10(%r11,%rdx,4),%ymm0,%ymm0
  0.04%   ││    0x00007ff5076f0882: vmovdqu %ymm0,0x10(%r9,%rdx,4)  ;*iastore {reexecute=0 rethrow=0 return_oop=0}
          ││                                                  ; - groupby.GroupByBenchmark$Three::apply@38 (line 181)
          ││                                                  ; - groupby.GroupByBenchmark::three@24 (line 294)
          ││                                                  ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.04%   ││    0x00007ff5076f0889: add    $0x8,%edx          ;*iinc {reexecute=0 rethrow=0 return_oop=0}
          ││                                                  ; - groupby.GroupByBenchmark$Three::apply@39 (line 180)
          ││                                                  ; - groupby.GroupByBenchmark::three@24 (line 294)
          ││                                                  ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
          ││    0x00007ff5076f088c: cmp    %ecx,%edx
          │╰    0x00007ff5076f088e: jl     0x00007ff5076f0868  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                   ; - groupby.GroupByBenchmark$Three::apply@6 (line 180)
                                                   ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                   ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
0x00007ff5076f0890: cmp    %r13d,%edx
0x00007ff5076f0893: jge    0x00007ff5076f08b9
0x00007ff5076f0895: data16 xchg %ax,%ax       ;*aload {reexecute=0 rethrow=0 return_oop=0}
                                                 ; - groupby.GroupByBenchmark$Three::apply@9 (line 181)
                                                 ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                 ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.04%     │↗  0x00007ff5076f0898: mov    0x10(%r8,%rdx,4),%ecx  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
            ││                                                ; - groupby.GroupByBenchmark$Three::apply@17 (line 181)
            ││                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
            ││                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
            ││  0x00007ff5076f089d: imul   %ebx,%ecx
            ││  0x00007ff5076f08a0: add    0x10(%r10,%rdx,4),%ecx
            ││  0x00007ff5076f08a5: imul   %esi,%ecx
  0.04%     ││  0x00007ff5076f08a8: add    0x10(%r11,%rdx,4),%ecx
  0.04%     ││  0x00007ff5076f08ad: mov    %ecx,0x10(%r9,%rdx,4)  ;*iastore {reexecute=0 rethrow=0 return_oop=0}
            ││                                                ; - groupby.GroupByBenchmark$Three::apply@38 (line 181)
            ││                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
            ││                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.04%     ││  0x00007ff5076f08b2: inc    %edx               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
            ││                                                ; - groupby.GroupByBenchmark$Three::apply@39 (line 180)
            ││                                                ; - groupby.GroupByBenchmark::three@24 (line 294)
            ││                                                ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
            ││  0x00007ff5076f08b4: cmp    %r13d,%edx
            │╰  0x00007ff5076f08b7: jl     0x00007ff5076f0898  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                 ; - groupby.GroupByBenchmark$Three::apply@6 (line 180)
                                                 ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                 ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.04%     ↘   0x00007ff5076f08b9: mov    0xc(%r12,%rax,8),%r10d  ;*getfield groupCountUpperBound {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - groupby.GroupByBenchmark$Three::apply@59 (line 183)
                                                              ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                              ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
                0x00007ff5076f08be: mov    0x18(%r12,%rax,8),%r8d  ;*getfield numKeys {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - groupby.GroupByBenchmark$Three::apply@55 (line 183)
                                                              ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                              ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.04%         0x00007ff5076f08c3: cmp    %r10d,%r8d
                0x00007ff5076f08c6: jge    0x00007ff5076f0b40
                0x00007ff5076f08cc: mov    0x1c(%r12,%rax,8),%edi  ;*getfield groups {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - groupby.GroupByBenchmark$Three::apply@66 (line 184)
                                                              ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                              ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
  0.04%         0x00007ff5076f08d1: mov    0xc(%r12,%rdi,8),%esi  ;*baload {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - groupby.GroupByBenchmark$Three::apply@74 (line 184)
                                                              ; - groupby.GroupByBenchmark::three@24 (line 294)
                                                              ; - groupby.generated.GroupByBenchmark_three_jmhTest::three_avgt_jmhStub@19 (line 237)
                                                              ; implicit exception: dispatches to 0x00007ff5076f0b40
                0x00007ff5076f08d6: vmovd  %xmm1,%r10d
                0x00007ff5076f08db: mov    0x10(%r12,%r10,8),%r8d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                                              ; - groupby.GroupByBenchmark$Three::apply@73 (line 184)
....................................................................................................
 21.18%  <total for region 2>

@richardstartin richardstartin force-pushed the vectorized-complex-groupby-key-generation branch 4 times, most recently from d784614 to cbe0ffd Compare December 22, 2021 18:34
@codecov-commenter
Copy link

codecov-commenter commented Dec 22, 2021

Codecov Report

Merging #7949 (76084cd) into master (b6eeaf3) will decrease coverage by 43.66%.
The diff coverage is 90.00%.

❗ Current head 76084cd differs from pull request most recent head 10670dd. Consider uploading reports for the commit 10670dd to get more accurate results
Impacted file tree graph

@@              Coverage Diff              @@
##             master    #7949       +/-   ##
=============================================
- Coverage     71.20%   27.54%   -43.67%     
=============================================
  Files          1593     1584        -9     
  Lines         82477    82144      -333     
  Branches      12305    12269       -36     
=============================================
- Hits          58729    22623    -36106     
- Misses        19788    57461    +37673     
+ Partials       3960     2060     -1900     
Flag Coverage Δ
integration1 ?
integration2 27.54% <90.00%> (+0.05%) ⬆️
unittests1 ?
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...tion/groupby/DictionaryBasedGroupKeyGenerator.java 43.47% <90.00%> (-46.60%) ⬇️
.../java/org/apache/pinot/spi/utils/BooleanUtils.java 0.00% <0.00%> (-100.00%) ⬇️
...ava/org/apache/pinot/spi/config/table/FSTType.java 0.00% <0.00%> (-100.00%) ⬇️
...ava/org/apache/pinot/spi/data/MetricFieldSpec.java 0.00% <0.00%> (-100.00%) ⬇️
...va/org/apache/pinot/spi/utils/BigDecimalUtils.java 0.00% <0.00%> (-100.00%) ⬇️
...java/org/apache/pinot/common/tier/TierFactory.java 0.00% <0.00%> (-100.00%) ⬇️
...a/org/apache/pinot/spi/config/table/TableType.java 0.00% <0.00%> (-100.00%) ⬇️
.../org/apache/pinot/spi/data/DimensionFieldSpec.java 0.00% <0.00%> (-100.00%) ⬇️
.../org/apache/pinot/spi/data/readers/FileFormat.java 0.00% <0.00%> (-100.00%) ⬇️
...org/apache/pinot/spi/config/table/QuotaConfig.java 0.00% <0.00%> (-100.00%) ⬇️
... and 1136 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6eeaf3...10670dd. Read the comment docs.

Copy link
Member

@kishoreg kishoreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea!

@richardstartin richardstartin marked this pull request as ready for review December 22, 2021 22:29
@richardstartin richardstartin force-pushed the vectorized-complex-groupby-key-generation branch 2 times, most recently from bf263db to 10670dd Compare December 23, 2021 13:53
@siddharthteotia
Copy link
Contributor

I am guessing the vpaddd and other similar instructions are the SIMD instructions getting used in the second region in the JIT compiled code and that sort of proves that new code is getting auto-vectorized. Is this correct ?

@richardstartin
Copy link
Member Author

Yes, the multiplications (vpmulld) and additions (vpaddd) are vectorised so 8 are performed at a time after the transformation, which explains the speedup.

@richardstartin richardstartin force-pushed the vectorized-complex-groupby-key-generation branch from 10670dd to 47108d0 Compare December 24, 2021 01:22
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@siddharthteotia siddharthteotia merged commit f2f8e38 into apache:master Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants