Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BoundFilter optimizations, and related interface changes. #2727

Merged
merged 1 commit into from
Mar 27, 2016

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Mar 24, 2016

BoundFilter:

  • For lexicographic bounds, use bitmapIndex.getIndex to find the start and end points,
    then union all bitmaps between those points.
  • For alphanumeric bounds, iterate through dimValues, and union all bitmaps for values
    matching the predicate.
  • Change behavior for nulls: it used to be that the BoundFilter would never match nulls,
    now it matches nulls if "" is allowed by the lower limit and not excluded by the
    upper limit.

Interface changes:

  • BitmapIndex: add int getIndex(value) to make it possible to get the index for a
    value without retrieving the bitmap.
  • BitmapIndex: remove ImmutableBitmap getBitmap(value), change callers to getBitmap(getIndex(value)).
  • BitmapIndexSelector: allow retrieving the underlying BitmapIndex through getBitmapIndex.
  • Clarified contract of indexOf in Indexed, GenericIndexed.

Also added tests for SelectorFilter, NotFilter, and BoundFilter.

@gianm gianm added this to the 0.9.1 milestone Mar 24, 2016
@gianm
Copy link
Contributor Author

gianm commented Mar 24, 2016

benchmarks:

roaring - new code
Benchmark                                          (cardinality)  Mode  Cnt       Score       Error  Units
BoundFilterBenchmark.matchEverythingAlphaNumeric            1000  avgt   10     922.892 ±    49.307  us/op
BoundFilterBenchmark.matchEverythingAlphaNumeric          100000  avgt   10   55084.688 ±  2612.769  us/op
BoundFilterBenchmark.matchEverythingAlphaNumeric         1000000  avgt   10  496876.650 ± 24082.952  us/op
BoundFilterBenchmark.matchEverythingLexicographic           1000  avgt   10     685.352 ±    21.630  us/op
BoundFilterBenchmark.matchEverythingLexicographic         100000  avgt   10   15020.188 ±   677.871  us/op
BoundFilterBenchmark.matchEverythingLexicographic        1000000  avgt   10  138014.970 ±  7010.974  us/op
BoundFilterBenchmark.matchHalfAlphaNumeric                  1000  avgt   10     419.717 ±    14.421  us/op
BoundFilterBenchmark.matchHalfAlphaNumeric                100000  avgt   10   44819.571 ±  2382.226  us/op
BoundFilterBenchmark.matchHalfAlphaNumeric               1000000  avgt   10  358544.263 ± 12146.582  us/op
BoundFilterBenchmark.matchHalfLexicographic                 1000  avgt   10     229.177 ±     9.528  us/op
BoundFilterBenchmark.matchHalfLexicographic               100000  avgt   10   25188.704 ±   626.630  us/op
BoundFilterBenchmark.matchHalfLexicographic              1000000  avgt   10   72892.182 ±  2968.422  us/op
BoundFilterBenchmark.matchNothingAlphaNumeric               1000  avgt   10     194.890 ±     7.406  us/op
BoundFilterBenchmark.matchNothingAlphaNumeric             100000  avgt   10   20405.402 ±   932.016  us/op
BoundFilterBenchmark.matchNothingAlphaNumeric            1000000  avgt   10  197780.159 ±  8140.001  us/op
BoundFilterBenchmark.matchNothingLexicographic              1000  avgt   10       1.611 ±     0.063  us/op
BoundFilterBenchmark.matchNothingLexicographic            100000  avgt   10       2.663 ±     0.156  us/op
BoundFilterBenchmark.matchNothingLexicographic           1000000  avgt   10       3.264 ±     0.172  us/op

roaring - old code
Benchmark                                          (cardinality)  Mode  Cnt        Score       Error  Units
BoundFilterBenchmark.matchEverythingAlphaNumeric            1000  avgt   10     1813.263 ±    51.723  us/op
BoundFilterBenchmark.matchEverythingAlphaNumeric          100000  avgt   10   203253.535 ±  7073.590  us/op
BoundFilterBenchmark.matchEverythingAlphaNumeric         1000000  avgt   10  2608738.649 ± 89007.817  us/op
BoundFilterBenchmark.matchEverythingLexicographic           1000  avgt   10     2010.454 ±    65.619  us/op
BoundFilterBenchmark.matchEverythingLexicographic         100000  avgt   10   231639.118 ± 12078.642  us/op
BoundFilterBenchmark.matchEverythingLexicographic        1000000  avgt   10  2608982.848 ± 41008.018  us/op
BoundFilterBenchmark.matchHalfAlphaNumeric                  1000  avgt   10      917.824 ±    45.221  us/op
BoundFilterBenchmark.matchHalfAlphaNumeric                100000  avgt   10   129332.609 ±  3458.142  us/op
BoundFilterBenchmark.matchHalfAlphaNumeric               1000000  avgt   10  1397767.744 ± 37115.753  us/op
BoundFilterBenchmark.matchHalfLexicographic                 1000  avgt   10     1009.587 ±    37.842  us/op
BoundFilterBenchmark.matchHalfLexicographic               100000  avgt   10   129598.064 ±  6365.075  us/op
BoundFilterBenchmark.matchHalfLexicographic              1000000  avgt   10  1502001.883 ± 59413.536  us/op
BoundFilterBenchmark.matchNothingAlphaNumeric               1000  avgt   10      202.195 ±     9.855  us/op
BoundFilterBenchmark.matchNothingAlphaNumeric             100000  avgt   10    20334.395 ±   694.945  us/op
BoundFilterBenchmark.matchNothingAlphaNumeric            1000000  avgt   10   200217.915 ±  3852.198  us/op
BoundFilterBenchmark.matchNothingLexicographic              1000  avgt   10      289.969 ±     5.781  us/op
BoundFilterBenchmark.matchNothingLexicographic            100000  avgt   10    28739.014 ±  1620.569  us/op
BoundFilterBenchmark.matchNothingLexicographic           1000000  avgt   10   288558.449 ±  8818.323  us/op

concise - new code
Benchmark                                          (cardinality)  Mode  Cnt        Score        Error  Units
BoundFilterBenchmark.matchEverythingAlphaNumeric            1000  avgt   10      825.754 ±     34.183  us/op
BoundFilterBenchmark.matchEverythingAlphaNumeric          100000  avgt   10   162995.602 ±   8164.037  us/op
BoundFilterBenchmark.matchEverythingAlphaNumeric         1000000  avgt   10  1730761.586 ±  65781.270  us/op
BoundFilterBenchmark.matchEverythingLexicographic           1000  avgt   10      706.111 ±     31.862  us/op
BoundFilterBenchmark.matchEverythingLexicographic         100000  avgt   10   146757.525 ±  10171.468  us/op
BoundFilterBenchmark.matchEverythingLexicographic        1000000  avgt   10  1501751.336 ± 175176.476  us/op
BoundFilterBenchmark.matchHalfLexicographic                 1000  avgt   10      341.310 ±     25.054  us/op
BoundFilterBenchmark.matchHalfLexicographic               100000  avgt   10    45627.867 ±   2141.370  us/op
BoundFilterBenchmark.matchHalfLexicographic              1000000  avgt   10   500524.691 ±  39671.188  us/op
BoundFilterBenchmark.matchNothingAlphaNumeric               1000  avgt   10      127.806 ±      6.626  us/op
BoundFilterBenchmark.matchNothingAlphaNumeric             100000  avgt   10    13703.036 ±    758.373  us/op
BoundFilterBenchmark.matchNothingAlphaNumeric            1000000  avgt   10   122963.784 ±   6927.395  us/op
BoundFilterBenchmark.matchNothingLexicographic              1000  avgt   10        1.646 ±      0.080  us/op
BoundFilterBenchmark.matchNothingLexicographic            100000  avgt   10        2.673 ±      0.179  us/op
BoundFilterBenchmark.matchNothingLexicographic           1000000  avgt   10        3.268 ±      0.147  us/op

concise - old code
Benchmark                                          (cardinality)  Mode  Cnt        Score        Error  Units
BoundFilterBenchmark.matchEverythingAlphaNumeric            1000  avgt   10     1599.303 ±     45.809  us/op
BoundFilterBenchmark.matchEverythingAlphaNumeric          100000  avgt   10   345393.965 ±  14189.766  us/op
BoundFilterBenchmark.matchEverythingAlphaNumeric         1000000  avgt   10  3684987.846 ±  84406.901  us/op
BoundFilterBenchmark.matchEverythingLexicographic           1000  avgt   10     1976.324 ±     54.248  us/op
BoundFilterBenchmark.matchEverythingLexicographic         100000  avgt   10   357991.768 ±   7785.695  us/op
BoundFilterBenchmark.matchEverythingLexicographic        1000000  avgt   10  3582925.614 ± 164209.427  us/op
BoundFilterBenchmark.matchHalfLexicographic                 1000  avgt   10     1186.107 ±     74.529  us/op
BoundFilterBenchmark.matchHalfLexicographic               100000  avgt   10   167451.872 ±   9642.969  us/op
BoundFilterBenchmark.matchHalfLexicographic              1000000  avgt   10  1890634.368 ±  61274.830  us/op
BoundFilterBenchmark.matchNothingAlphaNumeric               1000  avgt   10      128.502 ±      8.668  us/op
BoundFilterBenchmark.matchNothingAlphaNumeric             100000  avgt   10    13224.496 ±    931.381  us/op
BoundFilterBenchmark.matchNothingAlphaNumeric            1000000  avgt   10   128046.164 ±   5413.171  us/op
BoundFilterBenchmark.matchNothingLexicographic              1000  avgt   10      286.863 ±     12.302  us/op
BoundFilterBenchmark.matchNothingLexicographic            100000  avgt   10    28887.794 ±   1434.066  us/op
BoundFilterBenchmark.matchNothingLexicographic           1000000  avgt   10   286646.487 ±  13587.081  us/op

@fjy
Copy link
Contributor

fjy commented Mar 25, 2016

@gianm should we change default to use roaring as it seems to be much better in most cases

}
}
);
}

private boolean doesMatch(String input)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this logic just copypasta of what was there before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, no, it is not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly yes, except for the null handling.

@fjy
Copy link
Contributor

fjy commented Mar 25, 2016

👍

@gianm
Copy link
Contributor Author

gianm commented Mar 25, 2016

Some of the new tests depend on #2737 to work.

BoundFilter:

- For lexicographic bounds, use bitmapIndex.getIndex to find the start and end points,
  then union all bitmaps between those points.
- For alphanumeric bounds, iterate through dimValues, and union all bitmaps for values
  matching the predicate.
- Change behavior for nulls: it used to be that the BoundFilter would never match nulls,
  now it matches nulls if "" is allowed by the lower limit and not excluded by the
  upper limit.

Interface changes:

- BitmapIndex: add `int getIndex(value)` to make it possible to get the index for a
  value without retrieving the bitmap.
- BitmapIndex: remove `ImmutableBitmap getBitmap(value)`, change callers to `getBitmap(getIndex(value))`.
- BitmapIndexSelector: allow retrieving the underlying BitmapIndex through getBitmapIndex.
- Clarified contract of indexOf in Indexed, GenericIndexed.

Also added tests for SelectorFilter, NotFilter, and BoundFilter.
@@ -111,6 +112,17 @@ public BitmapFactory getBitmapFactory()
}

@Override
public BitmapIndex getBitmapIndex(String dimension)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should public BitmapIndex getBitmapIndex(String dimension) and public ImmutableBitmap getBitmapIndex(String dimension, String value) be named differently? Maybe slightly confusing to have functions with a different return type with the same name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I didn't want to touch too many files in this patch (all of the other filters), but I also didn't want to use a different name for this method, since it's the one that really should be called getBitmapIndex.

I am indifferent though & happy to go with whatever people think is best

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'm fine with this PR as-is, that rename if desired could be handled in later patch

@jon-wei
Copy link
Contributor

jon-wei commented Mar 25, 2016

👍 after travis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants