Consider using RoaringBitmapWriter for bitmap construction #6764

richardstartin · 2018-12-20T15:08:38Z

I noticed you are investigating using RoaringBitmap 0.7.30. If you do so, it is worth considering a different mechanism to build your bitmaps which favours ordered insertions by buffering them into a container and appending to the bitmap as late as possible (just before the bitmap is queried, or when the next multiple of 2^16 is crossed). There is no need to manually run optimise bitmaps built this way, because they are run optimised at the container level whenever it is appended to the bitmap.

I am not a druid user and am unlikely to become one soon, so this PR is intended as an FYI about the feature only.

I ran the BitmapIterationBenchmark.constructAndIter on JDK8 on Ubuntu 16.0.4 at 7a09cde4de1953eee75c5033e863cfde8f94d6c1 and got:

Benchmark                                     (bitmapAlgo)  (n)  (prob)   (size)  Mode  Cnt          Score          Error  Units
BitmapIterationBenchmark.constructAndIter           bitset  N/A     0.0  1000000  avgt    5         11.958 ±        0.411  ns/op
BitmapIterationBenchmark.constructAndIter           bitset  N/A   0.001  1000000  avgt    5      55820.663 ±     4765.600  ns/op
BitmapIterationBenchmark.constructAndIter           bitset  N/A     0.1  1000000  avgt    5     853821.933 ±    10916.693  ns/op
BitmapIterationBenchmark.constructAndIter           bitset  N/A     0.5  1000000  avgt    5    3014089.931 ±    65283.409  ns/op
BitmapIterationBenchmark.constructAndIter           bitset  N/A    0.99  1000000  avgt    5    5628379.542 ±   227488.488  ns/op
BitmapIterationBenchmark.constructAndIter           bitset  N/A     1.0  1000000  avgt    5    5612304.605 ±    54199.692  ns/op
BitmapIterationBenchmark.constructAndIter          concise  N/A     0.0  1000000  avgt    5          8.073 ±        0.178  ns/op
BitmapIterationBenchmark.constructAndIter          concise  N/A   0.001  1000000  avgt    5      27473.710 ±      626.528  ns/op
BitmapIterationBenchmark.constructAndIter          concise  N/A     0.1  1000000  avgt    5    3635751.625 ±    56888.246  ns/op
BitmapIterationBenchmark.constructAndIter          concise  N/A     0.5  1000000  avgt    5    9798233.678 ±   237069.198  ns/op
BitmapIterationBenchmark.constructAndIter          concise  N/A    0.99  1000000  avgt    5    9588921.943 ±   214705.602  ns/op
BitmapIterationBenchmark.constructAndIter          concise  N/A     1.0  1000000  avgt    5    8077899.901 ±   118088.071  ns/op
BitmapIterationBenchmark.constructAndIter          roaring  N/A     0.0  1000000  avgt    5        131.791 ±        2.237  ns/op
BitmapIterationBenchmark.constructAndIter          roaring  N/A   0.001  1000000  avgt    5      46860.367 ±     1753.583  ns/op
BitmapIterationBenchmark.constructAndIter          roaring  N/A     0.1  1000000  avgt    5    1709465.928 ±    38854.875  ns/op
BitmapIterationBenchmark.constructAndIter          roaring  N/A     0.5  1000000  avgt    5    6898408.274 ±   210501.998  ns/op
BitmapIterationBenchmark.constructAndIter          roaring  N/A    0.99  1000000  avgt    5   13340397.558 ±   283832.841  ns/op
BitmapIterationBenchmark.constructAndIter          roaring  N/A     1.0  1000000  avgt    5   13415893.194 ±   170437.084  ns/op

At b313193c81ed868a9afe04c658306705f63daaef I got:

Benchmark                                  (bitmapAlgo)  (prob)   (size)  Mode  Cnt         Score         Error  Units
BitmapIterationBenchmark.constructAndIter        bitset     0.0  1000000  avgt    5        12.665 ±       1.104  ns/op
BitmapIterationBenchmark.constructAndIter        bitset   0.001  1000000  avgt    5     74471.073 ±   54445.042  ns/op
BitmapIterationBenchmark.constructAndIter        bitset     0.1  1000000  avgt    5    887366.201 ±   35143.327  ns/op
BitmapIterationBenchmark.constructAndIter        bitset     0.5  1000000  avgt    5   3166248.403 ±  495669.632  ns/op
BitmapIterationBenchmark.constructAndIter        bitset    0.99  1000000  avgt    5   6324809.163 ± 1012027.080  ns/op
BitmapIterationBenchmark.constructAndIter        bitset     1.0  1000000  avgt    5   5913067.177 ±  132629.211  ns/op
BitmapIterationBenchmark.constructAndIter       concise     0.0  1000000  avgt    5         8.068 ±       0.115  ns/op
BitmapIterationBenchmark.constructAndIter       concise   0.001  1000000  avgt    5     27547.146 ±     546.018  ns/op
BitmapIterationBenchmark.constructAndIter       concise     0.1  1000000  avgt    5   3635772.683 ±   56079.798  ns/op
BitmapIterationBenchmark.constructAndIter       concise     0.5  1000000  avgt    5  10400194.474 ±  147495.368  ns/op
BitmapIterationBenchmark.constructAndIter       concise    0.99  1000000  avgt    5   9409295.891 ±  122748.484  ns/op
BitmapIterationBenchmark.constructAndIter       concise     1.0  1000000  avgt    5   8641773.847 ±  193212.416  ns/op
BitmapIterationBenchmark.constructAndIter       roaring     0.0  1000000  avgt    5        75.142 ±       1.408  ns/op
BitmapIterationBenchmark.constructAndIter       roaring   0.001  1000000  avgt    5     13348.323 ±     205.987  ns/op
BitmapIterationBenchmark.constructAndIter       roaring     0.1  1000000  avgt    5   1300962.745 ±   30107.110  ns/op
BitmapIterationBenchmark.constructAndIter       roaring     0.5  1000000  avgt    5   5255974.593 ±  133759.112  ns/op
BitmapIterationBenchmark.constructAndIter       roaring    0.99  1000000  avgt    5  11122438.742 ±  168793.197  ns/op
BitmapIterationBenchmark.constructAndIter       roaring     1.0  1000000  avgt    5  11555370.664 ±  141606.255  ns/op

These quick results are quite noisy so require more careful consideration on your part. I am also unaware of how likely out of order insertions are in a typical druid workload, where there would be no penalty for using this abstraction, but there will be no benefit.

b-slim · 2018-12-20T18:21:16Z

@richardstartin can you please fix

Selected: Unused declaration (1)
processing/src/main/java/org/apache/druid/collections/bitmap
WrappedRoaringBitmap.java (1)
54: WrappedRoaringBitmap() Parameter compressRunOnSerialization is not used in either this method or any of its derived methods

to have a clean build. Thanks

richardstartin · 2018-12-20T18:58:30Z

@b-slim this is more of a working heads up (i.e. I think you can probably build your roaring bitmaps faster) rather than something I would expect you to merge. The fact that run compression can be applied incrementally so you don't need to do it when you serialise the bitmaps ripples out fairly quickly to "BitmapSerdeFactory" JSON formats, and I can't offer a view on how they should be kept backward compatible. If you update, please feel free to use this commit as a reference.

clintropolis

@richardstartin Thanks for taking interest enough to relay this information! Outside of benchmarks, I believe all of the bitmaps we construct are done in order, so if I understand you correctly there would be little benefit to making this change, and I suspect the main difference in the original benchmarks you collected was the performance improvements from the version bump, which I saw as well.

That said, if there is no harm either, I don't see any reason to not make the switch, and could be useful in the event we find ourselves in need of creating out of order bitmaps. But maybe we should repeat the benchmarks against the latest master to ensure there is no penalty for this change?

I am unsure what the best thing to do with regards to the compressRunOnSerialization parameter, I can't imagine the resulting segment size is very viable without it, but yeah it's very hard and annoying to remove things like that.

richardstartin · 2019-01-02T08:22:55Z

@clintropolis the change optimises ordered insertion, because it avoids binary search on the high 16 bits of the bitmap. The change is useless for unordered insertions.

clintropolis · 2019-01-02T08:28:15Z

Ah, I misread the PR description, 👍

richardstartin · 2019-01-02T11:57:52Z

@clintropolis You were right about where most of the performance came from. I cleaned up the commits a bit and ran against master.

Here's the benchmark at 114a9fc38feda5f85799d24889007bc572d04dea at 0.7.30

Benchmark                                  (bitmapAlgo)  (prob)   (size)  Mode  Cnt         Score         Error  Units
BitmapIterationBenchmark.constructAndIter       roaring     0.0  1000000  avgt    5       130.624 ±       2.645  ns/op
BitmapIterationBenchmark.constructAndIter       roaring   0.001  1000000  avgt    5     17553.925 ±    1177.041  ns/op
BitmapIterationBenchmark.constructAndIter       roaring     0.1  1000000  avgt    5   1704213.394 ±   51487.534  ns/op
BitmapIterationBenchmark.constructAndIter       roaring     0.5  1000000  avgt    5   6831889.531 ±  146377.716  ns/op
BitmapIterationBenchmark.constructAndIter       roaring    0.99  1000000  avgt    5  13106844.584 ±  661339.555  ns/op
BitmapIterationBenchmark.constructAndIter       roaring     1.0  1000000  avgt    5  15204652.686 ± 1441562.179  ns/op

Here's a slight improvement on this branch at 1afb602de27d31367440b1cccc86ec799c59dc4c owing to reduced construction times.

Benchmark                                  (bitmapAlgo)  (prob)   (size)  Mode  Cnt         Score        Error  Units
BitmapIterationBenchmark.constructAndIter       roaring     0.0  1000000  avgt    5       189.940 ±      3.313  ns/op
BitmapIterationBenchmark.constructAndIter       roaring   0.001  1000000  avgt    5     13719.152 ±     42.376  ns/op
BitmapIterationBenchmark.constructAndIter       roaring     0.1  1000000  avgt    5   1268587.758 ±  42864.087  ns/op
BitmapIterationBenchmark.constructAndIter       roaring     0.5  1000000  avgt    5   4658899.187 ± 163463.751  ns/op
BitmapIterationBenchmark.constructAndIter       roaring    0.99  1000000  avgt    5  10556288.928 ± 212975.696  ns/op
BitmapIterationBenchmark.constructAndIter       roaring     1.0  1000000  avgt    5  11036729.972 ± 346125.258  ns/op

PS this has been squashed and force pushed so take another look.

richardstartin · 2019-01-02T12:09:54Z

In fact, here's a benchmark to isolate the construction time from the iteration time:

  @Benchmark
  public Object construct(ConstructAndIterState state)
  {
    int dataSize = state.dataSize;
    int[] data = state.data;
    MutableBitmap mutableBitmap = factory.makeEmptyMutableBitmap();
    for (int i = 0; i < dataSize; i++) {
      mutableBitmap.add(data[i]);
    }
    return factory.makeImmutableBitmap(mutableBitmap);
  }

and at 114a9fc38feda5f85799d24889007bc572d04dea (master) I get

Benchmark                           (bitmapAlgo)  (prob)   (size)  Mode  Cnt         Score        Error  Units
BitmapIterationBenchmark.construct       roaring     0.0  1000000  avgt    5       124.236 ±      6.597  ns/op
BitmapIterationBenchmark.construct       roaring   0.001  1000000  avgt    5     14045.045 ±   1026.400  ns/op
BitmapIterationBenchmark.construct       roaring     0.1  1000000  avgt    5   1317274.340 ± 153275.511  ns/op
BitmapIterationBenchmark.construct       roaring     0.5  1000000  avgt    5   7415001.388 ± 377532.457  ns/op
BitmapIterationBenchmark.construct       roaring    0.99  1000000  avgt    5  10687372.213 ± 860813.095  ns/op
BitmapIterationBenchmark.construct       roaring     1.0  1000000  avgt    5  10790794.579 ± 961663.105  ns/op

And at 1afb602de27d31367440b1cccc86ec799c59dc4c (this PR) I get:

Benchmark                           (bitmapAlgo)  (prob)   (size)  Mode  Cnt        Score        Error  Units
BitmapIterationBenchmark.construct       roaring     0.0  1000000  avgt    5      187.924 ±     12.126  ns/op
BitmapIterationBenchmark.construct       roaring   0.001  1000000  avgt    5    12674.625 ±    267.506  ns/op
BitmapIterationBenchmark.construct       roaring     0.1  1000000  avgt    5   868981.551 ±  21139.938  ns/op
BitmapIterationBenchmark.construct       roaring     0.5  1000000  avgt    5  3391372.332 ±  86345.168  ns/op
BitmapIterationBenchmark.construct       roaring    0.99  1000000  avgt    5  7314761.225 ± 335907.952  ns/op
BitmapIterationBenchmark.construct       roaring     1.0  1000000  avgt    5  7147180.460 ± 107730.956  ns/op

So there's a good boost to be had from avoiding the binary searches, and it sounds like druid does write its bitmaps in an ordered fashion.

clintropolis reviewed Jan 2, 2019

View reviewed changes

clintropolis approved these changes Jan 2, 2019

View reviewed changes

use RoaringBitmapWriter for RoaringBitmap construction

1afb602

clintropolis approved these changes Jan 9, 2019

View reviewed changes

clintropolis merged commit 9909761 into apache:master Jan 9, 2019

jon-wei added this to the 0.14.0 milestone Feb 20, 2019

richardstartin mentioned this pull request Nov 5, 2019

optimize numeric column null value checking for low filter selectivity (more rows) #8822

Merged

2 tasks

richardstartin mentioned this pull request May 25, 2020

Druid 0.13 ~ 0.18 version roaringbitmap benchmark becomes slow #9920

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using RoaringBitmapWriter for bitmap construction #6764

Consider using RoaringBitmapWriter for bitmap construction #6764

richardstartin commented Dec 20, 2018

b-slim commented Dec 20, 2018

richardstartin commented Dec 20, 2018 •

edited

Loading

clintropolis left a comment

richardstartin commented Jan 2, 2019

clintropolis commented Jan 2, 2019

richardstartin commented Jan 2, 2019 •

edited

Loading

richardstartin commented Jan 2, 2019

Consider using RoaringBitmapWriter for bitmap construction #6764

Consider using RoaringBitmapWriter for bitmap construction #6764

Conversation

richardstartin commented Dec 20, 2018

b-slim commented Dec 20, 2018

richardstartin commented Dec 20, 2018 • edited Loading

clintropolis left a comment

Choose a reason for hiding this comment

richardstartin commented Jan 2, 2019

clintropolis commented Jan 2, 2019

richardstartin commented Jan 2, 2019 • edited Loading

richardstartin commented Jan 2, 2019

richardstartin commented Dec 20, 2018 •

edited

Loading

richardstartin commented Jan 2, 2019 •

edited

Loading