-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Two-stage Filtering #2742
Comments
SGTM |
In your hypothetical string example, lets say we have But if I ask for a NOT of the filter, the getBimapIndex would allow the cursor to advance to In a crazier case, if you just had This issue arrises from an implicit boolean connection between the two steps in the special filter that is not preserved in the If I have miss-read part of the methodology please let me know. |
@drcrallen Thanks for the feedback, let me update this proposal in a bit, it doesn't currently handle boolean filters properly as you mentioned, (e.g., even without the hypothetical filter, a selector on string OR with selector on float wouldn't work properly either). |
Here are my current thoughts on the boolean filtering: I suppose the general issue is that "filter cursor offsets using bitmap, then apply value matcher" only works if the conditions on either side are parts of an AND (implicit in the pruning), doing OR wouldn't work. One option could be to convert filters to conjunctive normal form, which is what @navis does in: #2217. A combination of filters would be converted into a set of AND clauses. This set of AND clauses would then be partitioned into two sets:
The first set would be used to create bitmap index during cursor generation, the second set is applied as a ValueMatcher during iteration. I suppose it would also be possible to allow this two-stage filtering to work with an OR split, i.e.:
Maybe a viable approach would be to let the user configure this aspect of filtering behavior:
Returning to the example you provided with a NOT on an AND filter on two of my hypothetical string filters, maybe it's better to postpone support for allowing a single non-boolean filter to return both a bitmap index and value matcher. There would have to be a new way for the individual filter to express whether the returned bitmap index/valuematcher pair is in an AND relation or an OR relation, and none of the current filters seem like they would require support for that feature. |
It sounds good to me to skip allowing the basic filters to return both bitmaps and value matchers. You're right that we don't really have a use case for that right now. however- I think it'd still be good for a particular filter class to be able to decide to do a bitmap index vs a value matcher depending on the actual column capabilities, rather than as a class-level thing. For example a selector filter would want to return a bitmap on a string dimension but a value matcher on a column that doesn't have an index (like |
agreed, I think returning bitmap/value matcher should be dynamically chosen based on the column capabilities |
I've updated the proposal with:
|
@jon-wei I'm a bit confused as to what we are proposing vs. what we are implementing. You mention several time that what you are proposing is not what is being done in the PR you submitted. Wouldn't it make sense to rework those PRs to do what we ultimately want to do? |
The PRs I mentioned were closed, with the intent to reopen after this proposal settles, since this would have a significant impact on the implementation. They are there just for reference as use cases for the proposed filtering change. I would like to get more feedback on the proposal before working on a potentially non-viable implementation |
@jon-wei ah ok, that makes more sense |
Two-stage filtering
Related PRs:
Support multi-dimensional filters
#2217
Support filtering on _time with column scan
#2673
Support Long/Float dimensions with generic dimension handling interface
#2621
Motivation
Currently there are two ways which filters are applied when generating a Cursor:
The PRs linked above introduce the concept of filters that cannot be applied using bitmap indexes:
A simple implementation of filtering without bitmap indexes might be to have each Filter's getBitmapIndex() function perform a column scan when a bitmap index is not usable; this is the approach currently taken in PR #2673 and #2621.
However, this is not optimal.
Suppose the user wishes to perform an AND on a set of filters that do not support bitmap indexes. This would result in a full column scan for each filter in the AND set. Ideally, the filtering should perform a single scan that applies all of the filters at once, per-row.
PR #2217 handles this problem more efficiently. It splits the filter set into two parts based on whether each filter supports bitmap indexes.
Proposed Implementation
Filters would be applied in two stages (Cursor generation via bitmap indexes, Cursor iteration via ValueMatcher).
The following method would be added to the Filter interface:
public boolean supportsBitmapIndex(BitmapIndexSelector selector);
Given a BitmapIndexSelector, a filter would return true if a bitmap index is present and usable for the filter, false otherwise.
If true is returned, this indicates that the Filter should be applied in the Cursor generation (pre-filtering) stage.
If false is returned, this indicates that the Filter should be applied in the Cursor iteration (post-filtering) stage.
The QueryableIndexStorageAdapter will call supportsBitmapIndex() to decide how to partition the set of filters in makeCursor().
Example behavior
Suppose a SelectorFilter is applied on a String dimension. Since String dimensions support bitmap indexes:
Suppose a SelectorFilter is applied on a Float dimension. Since Float dimensions do not support bitmap indexes:
A hypothetical example where splitting filter operations across both stages could be useful/more efficient:
Suppose there is an AndFilter, containing a SelectorFilter and an ExtractionFilter on another dimension with very high cardinality even after extraction.
Boolean Filters
Moving filters across "stages", RowOffsetMatcher
Pruning row offsets via bitmap indexes, followed by filtering of rows during iteration, can be thought of as an implicit AND with two "sides".
In the presence of boolean filters, if an OR condition spans both "sides", the entire OR condition needs to be moved to one "side".
Suppose we have the filter: Selector(dimA, "Hello") OR Selector(dimB, 3.1415)
For efficiency, we could still reuse the bitmap index available on dimA. Instead of returning a ValueMatcher that looks at row values, dimA's selector filter could return a matcher object that stores an internal reference to the bitmap for "Hello" in dimA. This bitmap would be used to match on the current row offset during cursor iteration. For example:
if we have the following rows:
The bitmap for (dimA, "Hello") would have bit 0 and 3 set. When iterating through the cursor:
To support this, new matcher object that has such knowledge of matching bitmap offsets, RowOffsetMatcher, will be added. Like the ValueMatcher it would have a single method, match().
Any RowOffsetMatcher contained within AND or NOT filters that have been moved to the iteration stage should be evaluated before any "real" ValueMatchers within the same AND/OR clause, to take advantage of bitmap indexes.
The Filter interface would have this method added:
Using the BitmapIndexSelector to determine whether bitmaps are present and usable, the Filter would create either a ValueMatcher or RowOffsetMatcher.
The ValueMatcherFactory and RowOffsetMatcherFactory would be created and provided by the QueryableIndexStorageAdapter.
A RowOffsetMatcherFactory has the following method:
public RowOffsetMatcher makeRowOffsetMatcher(ImmutableBitmap bitmap);
Given a bitmap indicating rows that match a Filter, return a RowOffsetMatcher that returns true when the underlying Cursor has advanced to a matching row.
CNF example
Because of the implicit AND present in the two stages of cursor generation/iteration, it may be useful to convert a boolean filter expression to conjunctive normal form (https://en.wikipedia.org/wiki/Conjunctive_normal_form).
You can see this being done in @navis's PR: https://github.com/navis/druid/blob/multi-dimensional-filter/processing/src/main/java/io/druid/segment/filter/Filters.java#L143
This converts the filter expression to a set of AND clauses. These clauses can then be moved to stage 1 or stage 2 of filtering based on whether they require column scans. The decision to convert to CNF should be user-configurable.
Suppose we have three dimensions:
Suppose we have the following boolean expression:
where:
This is an example where CNF conversion may be useful. We have an OR clause at the "top" level, and this set of filters would return both a bitmap index and a valuematcher, since dimY does not support bitmap indexes. As is, we cannot split the filtering here.
If the expression is converted to CNF, it has the form:
The subclause
(X OR Z)
filters on two dimensions that both support bitmap indexes, while the subclause(Y OR Z)
requires a ValueMatcher.(X OR Z)
can be pushed to stage 1, while(Y OR Z)
can be pushed to stage 2.The filtering operation would take the following steps after CNF conversion:
Alternatively, suppose the user does not want the CNF conversion to occur. The entire expression (X AND Y) OR Z could be moved to the iteration stage, and filtering would take the following steps:
The user can specify whether CNF conversion should occur via a new boolean query context flag,
useFilterCNF
.Filter Result Caching
Suppose we have four dimensions:
Suppose we have a set of filters:
In CNF, this would be:
where:
In either case, the ValueMatcher stage has to evaluate the expression
(B OR (C AND D))
or(B OR C) AND (B OR D)
per row (These expressions are moved to stage 2 because they require column scans).A's rows can be filtered on during cursor generation, since it is a stand alone AND clause on a dimension that supports bitmap indexes.
While D can be evaluated efficiently during iteration using a RowOffsetMatcher because D supports bitmap indexes, B and C's expensive filters would have to be evaluated at least once per row.
It may be useful to cache the ValueMatcher results for filter B and C. This cache could be indexed by a filter's cache key + dimension value, e.g. a Map indexed by filter cache key, with each entry containing a cache structure indexed by dimension value.
The text was updated successfully, but these errors were encountered: