Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two-stage filtering #3018

Merged
merged 2 commits into from
Jun 22, 2016
Merged

Two-stage filtering #3018

merged 2 commits into from
Jun 22, 2016

Conversation

jon-wei
Copy link
Contributor

@jon-wei jon-wei commented May 25, 2016

Implements the proposal in #2742

This PR adds support for filtering without bitmap indexes, by adding a new filtering step during Cursor iteration in the QueryableIndexStorageAdapter.

The linked proposal contains details about the approach taken, with the exception that filter match result caching is not implemented (can be a later PR).


Benchmarks, using the benchmarks in #2875

Master:

GroupByBenchmark.queryMultiQueryableIndex                  N/A              4                N/A            100000       N/A           basic.A          N/A  avgt   50   420317.283 ±  9960.026  us/op
GroupByBenchmark.querySingleIncrementalIndex               N/A              4                N/A            100000       N/A           basic.A          N/A  avgt   50   237684.936 ±  4982.999  us/op
GroupByBenchmark.querySingleQueryableIndex                 N/A              4                N/A            100000       N/A           basic.A          N/A  avgt   50   192516.042 ±  1247.105  us/op
SearchBenchmark.queryMultiQueryableIndex                  1000              1                N/A            750000       N/A           basic.A          N/A  avgt   50    18569.254 ±   130.116  us/op
SearchBenchmark.querySingleIncrementalIndex               1000              1                N/A            750000       N/A           basic.A          N/A  avgt   50   252040.937 ±  5094.427  us/op
SearchBenchmark.querySingleQueryableIndex                 1000              1                N/A            750000       N/A           basic.A          N/A  avgt   50    18522.747 ±    96.640  us/op
SelectBenchmark.queryIncrementalIndex                      N/A              1               1000             25000       N/A           basic.A          N/A  avgt   50    59077.585 ±   775.733  us/op
SelectBenchmark.queryMultiQueryableIndex                   N/A              1               1000             25000       N/A           basic.A          N/A  avgt   50    84701.323 ±   591.074  us/op
SelectBenchmark.queryQueryableIndex                        N/A              1               1000             25000       N/A           basic.A          N/A  avgt   50    86754.784 ±   536.997  us/op
TimeseriesBenchmark.queryFilteredSingleQueryableIndex      N/A              1                N/A            750000       N/A           basic.A          N/A  avgt   50    15513.263 ±    34.824  us/op
TimeseriesBenchmark.queryMultiQueryableIndex               N/A              1                N/A            750000       N/A           basic.A          N/A  avgt   50   141221.807 ±   922.429  us/op
TimeseriesBenchmark.querySingleIncrementalIndex            N/A              1                N/A            750000       N/A           basic.A          N/A  avgt   50  1491135.435 ± 16168.599  us/op
TimeseriesBenchmark.querySingleQueryableIndex              N/A              1                N/A            750000       N/A           basic.A          N/A  avgt   50   142651.506 ±   734.796  us/op
TopNBenchmark.queryMultiQueryableIndex                     N/A              1                N/A            750000       N/A           basic.A           10  avgt   50   183568.518 ±   929.395  us/op
TopNBenchmark.querySingleIncrementalIndex                  N/A              1                N/A            750000       N/A           basic.A           10  avgt   50  1459638.382 ± 27925.835  us/op
TopNBenchmark.querySingleQueryableIndex                    N/A              1                N/A            750000       N/A           basic.A           10  avgt   50   177885.035 ±  2059.651  us/op

Patch:

GroupByBenchmark.queryMultiQueryableIndex                  N/A              4                N/A            100000           basic.A          N/A  avgt   50   425610.302 ± 12920.928  us/op
GroupByBenchmark.querySingleIncrementalIndex               N/A              4                N/A            100000           basic.A          N/A  avgt   50   233186.139 ±  1117.466  us/op
GroupByBenchmark.querySingleQueryableIndex                 N/A              4                N/A            100000           basic.A          N/A  avgt   50   185192.077 ±  1269.109  us/op
SearchBenchmark.queryMultiQueryableIndex                  1000              1                N/A            750000           basic.A          N/A  avgt   50    18409.140 ±    97.739  us/op
SearchBenchmark.querySingleIncrementalIndex               1000              1                N/A            750000           basic.A          N/A  avgt   50   259170.630 ±  5773.437  us/op
SearchBenchmark.querySingleQueryableIndex                 1000              1                N/A            750000           basic.A          N/A  avgt   50    18474.211 ±   143.386  us/op
SelectBenchmark.queryIncrementalIndex                      N/A              1               1000             25000           basic.A          N/A  avgt   50    60913.878 ±   569.650  us/op
SelectBenchmark.queryMultiQueryableIndex                   N/A              1               1000             25000           basic.A          N/A  avgt   50    85294.587 ±   451.770  us/op
SelectBenchmark.queryQueryableIndex                        N/A              1               1000             25000           basic.A          N/A  avgt   50    83608.035 ±   558.624  us/op
TimeseriesBenchmark.queryFilteredSingleQueryableIndex      N/A              1                N/A            750000           basic.A          N/A  avgt   50    15723.450 ±    54.988  us/op
TimeseriesBenchmark.queryMultiQueryableIndex               N/A              1                N/A            750000           basic.A          N/A  avgt   50   139903.759 ±   783.649  us/op
TimeseriesBenchmark.querySingleIncrementalIndex            N/A              1                N/A            750000           basic.A          N/A  avgt   50  1477342.648 ± 13718.615  us/op
TimeseriesBenchmark.querySingleQueryableIndex              N/A              1                N/A            750000           basic.A          N/A  avgt   50   138654.076 ±   640.646  us/op
TopNBenchmark.queryMultiQueryableIndex                     N/A              1                N/A            750000           basic.A           10  avgt   50   184123.006 ±   787.114  us/op
TopNBenchmark.querySingleIncrementalIndex                  N/A              1                N/A            750000           basic.A           10  avgt   50  1465099.858 ± 26925.094  us/op
TopNBenchmark.querySingleQueryableIndex                    N/A              1                N/A            750000           basic.A           10  avgt   50   176951.700 ±  1448.476  us/op

current results for reading 1,500,000 rows from a long column, no filters:

master
FilterPartitionBenchmark.longRead           1500000     basic  avgt   50  44587.905 ± 116.094  us/op

patch
FilterPartitionBenchmark.longRead           1500000     basic  avgt   50  44774.780 ± 117.027  us/op

Benchmark that applies a single selector filter to a 750,000 row segment:
Master

FilterPartitionBenchmark.readWithPreFilter                 N/A            N/A                N/A            750000     basic               N/A          N/A  avgt   50     5713.381 ±     6.405  us/op

Patch

FilterPartitionBenchmark.readWithPreFilter                 N/A            N/A                N/A            750000     basic               N/A          N/A  avgt   50     5998.602 ±   567.914  us/op

Benchmark only applicable to this patch:

Benchmark                                              (limit)  (numSegments)  (pagingThreshold)  (rowsPerSegment)  (schema)  (schemaAndQuery)  (threshold)  Mode  Cnt        Score       Error  Units
FilterPartitionBenchmark.longRead                          750000     basic  avgt   50   24258.797 ±   86.279  us/op
FilterPartitionBenchmark.readComplexOrFilter               750000     basic  avgt   50  209979.618 ± 7549.892  us/op
FilterPartitionBenchmark.readComplexOrFilterCNF            750000     basic  avgt   50  412001.016 ± 9765.903  us/op
FilterPartitionBenchmark.readOrFilter                      750000     basic  avgt   50   67674.482 ± 1301.258  us/op
FilterPartitionBenchmark.readOrFilterCNF                   750000     basic  avgt   50   76440.205 ± 1675.725  us/op
FilterPartitionBenchmark.readWithExFnPostFilter            750000     basic  avgt   50  135087.429 ±  795.120  us/op
FilterPartitionBenchmark.readWithExFnPreFilter             750000     basic  avgt   50    6146.368 ±  178.410  us/op
FilterPartitionBenchmark.readWithPostFilter                750000     basic  avgt   50   19275.322 ±  614.610  us/op
FilterPartitionBenchmark.readWithPreFilter                 750000     basic  avgt   50    5943.020 ±  492.261  us/op
FilterPartitionBenchmark.stringRead                        750000     basic  avgt   50   87951.983 ± 2488.246  us/op

@jon-wei jon-wei added this to the 0.9.2 milestone May 25, 2016
{
ImmutableBitmap roaringBitmap = bitmapIndex;
if (!(bitmapIndex instanceof WrappedImmutableRoaringBitmap)) {
final BitmapFactory factory = RoaringBitmapSerdeFactory.bitmapFactory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems pretty expensive if the bitmap isn't roaring

are we planning on making roaring the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not planning on changing the default bitmap to roaring with this PR at least.

if anyone knows of a faster way to get a reverse bitmap iterator, please let me know

Copy link
Contributor

@gianm gianm Jun 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a better way for concise.

Also this code is just moved from lower down in the same file, it was added as part of the descending select patch. It should only get used for descending queries.

@navis navis mentioned this pull request May 26, 2016
@@ -327,7 +394,34 @@ public Cursor apply(final Long input)
{
private final Offset initOffset = offset.clone();
private final DateTime myBucket = gran.toDateTime(input);
private Offset cursorOffset = offset;
private CursorOffsetHolder cursorOffsetHolder = new CursorOffsetHolder();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added final

return current;
}

private static Filter flatten(Filter root) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not from Hive?

public boolean supportsBitmapIndex(BitmapIndexSelector selector)
{
for (Filter filter : filters) {
if(!filter.supportsBitmapIndex(selector)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting

@jon-wei jon-wei force-pushed the twostage_filter branch 2 times, most recently from 49823ad to df2a1bd Compare June 13, 2016 19:07
if(row.size() == 0) {
return matchNull;
}
for (int rowVal : row) {
Copy link
Contributor

@gianm gianm Jun 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely will be better to use get to avoid boxing and unboxing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed for loop to use get

@gianm
Copy link
Contributor

gianm commented Jun 16, 2016

@gianm
Copy link
Contributor

gianm commented Jun 16, 2016

Would be great to see the ability to filter on long columns (including __time, both as a long and with time-based extractionFn) as a follow up. It seems like that should be doable, and useful, with existing filters like "selector" and "bound".

use case e.g. https://groups.google.com/d/msg/druid-user/9rJBeMT0SeQ/G_hBwzwbAAAJ

@gianm
Copy link
Contributor

gianm commented Jun 17, 2016

@nishantmonu51 are you able to take another look?

@nishantmonu51
Copy link
Member

@gianm: sure, I ll check out the changes.

@@ -115,7 +115,56 @@ public BitmapFactory getBitmapFactory()
public BitmapIndex getBitmapIndex(String dimension)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably also need to modify getBitmapIndex(String dimension, String value) below in this class so as to return null when column does not have bitmap index ?
would also be nice to have the same logic at one place and use it in both the methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I changed getBitmapIndex(String dimension, String value) to return null for that case.

re: reusing logic, I'd prefer to leave this as-is, I assume you mean this block:

if (Strings.isNullOrEmpty(value)) {
        return bitmapFactory.complement(bitmapFactory.makeEmptyImmutableBitmap(), getNumRows());
      } else {
        return bitmapFactory.makeEmptyImmutableBitmap();
      }

I could have public ImmutableBitmap getBitmap(int idx) in the anonymous BitmapIndex call getBitmapIndex(String dimension, String value), but I would still have to check for idx == 0, and would need to pass in some bogus non-null/empty value for the idx != 0 case.

I think it looks cleaner as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

@nishantmonu51
Copy link
Member

👍 after #3018 (comment) gets addressed.

@gianm gianm assigned gianm and unassigned gianm Jun 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants