Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Two-stage Filtering #2742

Closed
jon-wei opened this issue Mar 26, 2016 · 14 comments
Closed

[Proposal] Two-stage Filtering #2742

jon-wei opened this issue Mar 26, 2016 · 14 comments

Comments

@jon-wei
Copy link
Contributor

jon-wei commented Mar 26, 2016

Two-stage filtering

Related PRs:

Support multi-dimensional filters
#2217

  • Introduces filter partitioning, conversion to CNF concepts that were adapted/expanded for this proposal

Support filtering on _time with column scan
#2673

  • Shows example use-case for filtering without bitmap index

Support Long/Float dimensions with generic dimension handling interface
#2621

  • Another example of use case with filtering without bitmap indexes

Motivation

Currently there are two ways which filters are applied when generating a Cursor:

  • Retrieve BitmapIndex objects for values that matched on a filter. This is used by QueryableIndexStorageAdapter.
  • Create a ValueMatcher and apply the matcher as the Cursor iterates through rows. This is used by IncrementalIndexStorageAdapter.

The PRs linked above introduce the concept of filters that cannot be applied using bitmap indexes:

A simple implementation of filtering without bitmap indexes might be to have each Filter's getBitmapIndex() function perform a column scan when a bitmap index is not usable; this is the approach currently taken in PR #2673 and #2621.

However, this is not optimal.

Suppose the user wishes to perform an AND on a set of filters that do not support bitmap indexes. This would result in a full column scan for each filter in the AND set. Ideally, the filtering should perform a single scan that applies all of the filters at once, per-row.

PR #2217 handles this problem more efficiently. It splits the filter set into two parts based on whether each filter supports bitmap indexes.

  • Filters with bitmap support are used to create a bitmap of matching rows for generating the Cursor, in a "pre-filtering" step.
  • Filters without bitmap support are used to create a ValueMatcher that applies to each row as the Cursor advances, in a "post-filtering" step.

Proposed Implementation

Filters would be applied in two stages (Cursor generation via bitmap indexes, Cursor iteration via ValueMatcher).

The following method would be added to the Filter interface:
public boolean supportsBitmapIndex(BitmapIndexSelector selector);

Given a BitmapIndexSelector, a filter would return true if a bitmap index is present and usable for the filter, false otherwise.

If true is returned, this indicates that the Filter should be applied in the Cursor generation (pre-filtering) stage.

If false is returned, this indicates that the Filter should be applied in the Cursor iteration (post-filtering) stage.

The QueryableIndexStorageAdapter will call supportsBitmapIndex() to decide how to partition the set of filters in makeCursor().

Example behavior

Suppose a SelectorFilter is applied on a String dimension. Since String dimensions support bitmap indexes:

  • supportsBitmapIndex() would return true, and this SelectorFilter would be applied in the pre-filtering stage.

Suppose a SelectorFilter is applied on a Float dimension. Since Float dimensions do not support bitmap indexes:

  • supportsBitmapIndex() would return false, and this SelectorFilter would be applied in the post-filtering stage.

A hypothetical example where splitting filter operations across both stages could be useful/more efficient:
Suppose there is an AndFilter, containing a SelectorFilter and an ExtractionFilter on another dimension with very high cardinality even after extraction.

  • the SelectorFilter could return true for supportsBitmapIndex() (retrieve a single bitmap for the selector value)
  • the ExtractionFilter could return false for supportsBitmapIndex(), skipping the bitmap generation for the ExtractionFilter (union of bitmaps can be expensive with high cardinality), instead being applied during iteration of the rows pruned by the SelectorFilter

Boolean Filters

Moving filters across "stages", RowOffsetMatcher

Pruning row offsets via bitmap indexes, followed by filtering of rows during iteration, can be thought of as an implicit AND with two "sides".

In the presence of boolean filters, if an OR condition spans both "sides", the entire OR condition needs to be moved to one "side".

Suppose we have the filter: Selector(dimA, "Hello") OR Selector(dimB, 3.1415)

  • dimA is a String column, supports bitmap indexes
  • dimB is a Float column, does not support bitmap indexes
  • This OR needs to be handled on the ValueMatcher (row iteration) side, since the selector on dimB requires a column scan.
  • The filtering should be done with a combined ValueMatcher that matches rows on dimA == "Hello" OR dimB == 3.1415. i.e., while a selector on dimA would be handled in the first "stage" normally, it will be moved to the second "stage" so that the OR can be handled properly.
  • No rows will be pruned by bitmap indexes during cursor generation

For efficiency, we could still reuse the bitmap index available on dimA. Instead of returning a ValueMatcher that looks at row values, dimA's selector filter could return a matcher object that stores an internal reference to the bitmap for "Hello" in dimA. This bitmap would be used to match on the current row offset during cursor iteration. For example:

if we have the following rows:

{dimA: "Hello", dimB: 1.0}
{dimA: "World", dimB: 2.0}
{dimA: "Foo", dimB: 3.1415}
{dimA: "Hello", dimB: 3.1415}

The bitmap for (dimA, "Hello") would have bit 0 and 3 set. When iterating through the cursor:

row 0: dimA bitmap matches, don't need to evaluate dimB ValueMatcher
row 1: dimA bitmap does not match, dimB ValueMatcher does not match either
row 2: dimA bitmap does not match, dimB ValueMatcher returns true
row 3: dimA bitmap matches, don't need to evaluate dimB ValueMatcher

To support this, new matcher object that has such knowledge of matching bitmap offsets, RowOffsetMatcher, will be added. Like the ValueMatcher it would have a single method, match().

Any RowOffsetMatcher contained within AND or NOT filters that have been moved to the iteration stage should be evaluated before any "real" ValueMatchers within the same AND/OR clause, to take advantage of bitmap indexes.

The Filter interface would have this method added:

  public ValueMatcher makeMatcher(
      BitmapIndexSelector selector,
      ValueMatcherFactory valueMatcherFactory,
      RowOffsetMatcherFactory rowOffsetMatcherFactory
  );

Using the BitmapIndexSelector to determine whether bitmaps are present and usable, the Filter would create either a ValueMatcher or RowOffsetMatcher.

The ValueMatcherFactory and RowOffsetMatcherFactory would be created and provided by the QueryableIndexStorageAdapter.

A RowOffsetMatcherFactory has the following method:
public RowOffsetMatcher makeRowOffsetMatcher(ImmutableBitmap bitmap);

Given a bitmap indicating rows that match a Filter, return a RowOffsetMatcher that returns true when the underlying Cursor has advanced to a matching row.

CNF example

Because of the implicit AND present in the two stages of cursor generation/iteration, it may be useful to convert a boolean filter expression to conjunctive normal form (https://en.wikipedia.org/wiki/Conjunctive_normal_form).

You can see this being done in @navis's PR: https://github.com/navis/druid/blob/multi-dimensional-filter/processing/src/main/java/io/druid/segment/filter/Filters.java#L143

This converts the filter expression to a set of AND clauses. These clauses can then be moved to stage 1 or stage 2 of filtering based on whether they require column scans. The decision to convert to CNF should be user-configurable.

Suppose we have three dimensions:

dimX: String
dimY: Float (no bitmaps)
dimZ: String

Suppose we have the following boolean expression:

(X AND Y) OR Z

where:

X is a selector filter on dimX
Y is a selector filter on dimY (no bitmaps)
Z is a selector filter on dimZ

This is an example where CNF conversion may be useful. We have an OR clause at the "top" level, and this set of filters would return both a bitmap index and a valuematcher, since dimY does not support bitmap indexes. As is, we cannot split the filtering here.

If the expression is converted to CNF, it has the form:

(X OR Z) AND (Y OR Z)

The subclause (X OR Z) filters on two dimensions that both support bitmap indexes, while the subclause (Y OR Z) requires a ValueMatcher. (X OR Z) can be pushed to stage 1, while (Y OR Z) can be pushed to stage 2.

The filtering operation would take the following steps after CNF conversion:

  • During cursor generation, get the bitmap for (X OR Z), and prune rows that are not included in that bitmap.
  • During cursor iteration, we will have a ValueMatcher for (Y OR Z). Since Z has a bitmap, it should be checked first before applying Y's ValueMatcher.

Alternatively, suppose the user does not want the CNF conversion to occur. The entire expression (X AND Y) OR Z could be moved to the iteration stage, and filtering would take the following steps:

  • During cursor generation, don't prune any rows
  • During cursor iteration, check condition Z first, since Z supports bitmap indexes, while (X AND Y) requires row values to be read (since Y does not support bitmaps).
  • If Z matches, return true, no need to evaluate (X OR Y). If Z doesn't match, evaluate (X AND Y):
  • Check condition X first, since X supports bitmap indexes. If X matches return true, otherwise evaluate filter Y's ValueMatcher.

The user can specify whether CNF conversion should occur via a new boolean query context flag, useFilterCNF.

Filter Result Caching

Suppose we have four dimensions:

A: String
B: Float (no bitmaps)
C: Float (no bitmaps)
D: String

Suppose we have a set of filters:

A AND (B OR (C AND D))

In CNF, this would be:

A AND (B OR C) AND (B OR D)

where:

A is a selector filter on dim A
B is an expensive javascript filter on dim B (no bitmaps)
C is an expensive javascript filter on dim C (no bitmaps)
D is a selector filter on dim D

In either case, the ValueMatcher stage has to evaluate the expression (B OR (C AND D)) or (B OR C) AND (B OR D) per row (These expressions are moved to stage 2 because they require column scans).

A's rows can be filtered on during cursor generation, since it is a stand alone AND clause on a dimension that supports bitmap indexes.

While D can be evaluated efficiently during iteration using a RowOffsetMatcher because D supports bitmap indexes, B and C's expensive filters would have to be evaluated at least once per row.

It may be useful to cache the ValueMatcher results for filter B and C. This cache could be indexed by a filter's cache key + dimension value, e.g. a Map indexed by filter cache key, with each entry containing a cache structure indexed by dimension value.

@himanshug
Copy link
Contributor

SGTM

@drcrallen
Copy link
Contributor

In your hypothetical string example, lets say we have "AZ" and "BZ" as dimension values, and we have a filter which matches strings that start with A and end with Z using the method proposed. For a simple filter case you would expect "AZ" to be returned, and thus for the negation case you would expect "BZ" to be returned.

But if I ask for a NOT of the filter, the getBimapIndex would allow the cursor to advance to "BZ" and the makeMatcher would exclude it, so you would end up with no results.

In a crazier case, if you just had "A" and "Z" as dimension values, the filter of "starts with A and ends with Z" evaluated using the method previously proposed, would match neither value. But if you did a NOT of the filter, you would exclude the "starts with A" bitmap, leaving you Z when generating the cursor, then exclude the "Z" field when iterating through the values.

This issue arrises from an implicit boolean connection between the two steps in the special filter that is not preserved in the not filter.

If I have miss-read part of the methodology please let me know.

@jon-wei
Copy link
Contributor Author

jon-wei commented Mar 29, 2016

@drcrallen Thanks for the feedback, let me update this proposal in a bit, it doesn't currently handle boolean filters properly as you mentioned, (e.g., even without the hypothetical filter, a selector on string OR with selector on float wouldn't work properly either).

@jon-wei
Copy link
Contributor Author

jon-wei commented Mar 29, 2016

@drcrallen

Here are my current thoughts on the boolean filtering:

I suppose the general issue is that "filter cursor offsets using bitmap, then apply value matcher" only works if the conditions on either side are parts of an AND (implicit in the pruning), doing OR wouldn't work.

One option could be to convert filters to conjunctive normal form, which is what @navis does in: #2217.

A combination of filters would be converted into a set of AND clauses. This set of AND clauses would then be partitioned into two sets:

  • First set contains AND clauses containing filters that can use bitmap indexes
  • Second set contains AND clauses with at least one filter requiring a column scan

The first set would be used to create bitmap index during cursor generation, the second set is applied as a ValueMatcher during iteration.

I suppose it would also be possible to allow this two-stage filtering to work with an OR split, i.e.:

  • Bitmaps are used not as a pre-filter but as a "cache" of matched rows (don't pre-filter any rows, check the bitmap first before appying ValueMatcher to rows during column scan)
  • Maybe accompanied with a similar filter transformation to disjunctive normal form? (set of OR clauses)

Maybe a viable approach would be to let the user configure this aspect of filtering behavior:

  • Query-level option that configures whether to convert filters to CNF, DNF, or leave user-provided filters as-is
    • This could address situations where the normalized form is needlessly complex/inefficient, let the user manually arrange the AND/OR clauses if needed
  • If top-level filter is AND, use bitmaps as filter
  • If top-level filter is OR, use bitmaps as matched row "cache"
  • If top-level is NOT, convert boolean children accordingly and then apply the rules above

Returning to the example you provided with a NOT on an AND filter on two of my hypothetical string filters, maybe it's better to postpone support for allowing a single non-boolean filter to return both a bitmap index and value matcher. There would have to be a new way for the individual filter to express whether the returned bitmap index/valuematcher pair is in an AND relation or an OR relation, and none of the current filters seem like they would require support for that feature.

@gianm
Copy link
Contributor

gianm commented Mar 29, 2016

It sounds good to me to skip allowing the basic filters to return both bitmaps and value matchers. You're right that we don't really have a use case for that right now.

however- I think it'd still be good for a particular filter class to be able to decide to do a bitmap index vs a value matcher depending on the actual column capabilities, rather than as a class-level thing. For example a selector filter would want to return a bitmap on a string dimension but a value matcher on a column that doesn't have an index (like __time or a numeric dimension without an index).

@jon-wei
Copy link
Contributor Author

jon-wei commented Mar 29, 2016

agreed, I think returning bitmap/value matcher should be dynamically chosen based on the column capabilities

@jon-wei
Copy link
Contributor Author

jon-wei commented Mar 30, 2016

@himanshug @drcrallen @gianm

I've updated the proposal with:

  • note that a single non-boolean filter can only return either a bitmap index or value matcher, but not both (decision is dynamically made)
  • more details/examples on boolean filter handling
  • description of a new kind of 'ValueMatcher' that uses bitmaps/row offsets during cursor iteration
  • filter result caching

@xvrl
Copy link
Member

xvrl commented Apr 5, 2016

@jon-wei I'm a bit confused as to what we are proposing vs. what we are implementing. You mention several time that what you are proposing is not what is being done in the PR you submitted. Wouldn't it make sense to rework those PRs to do what we ultimately want to do?

@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 5, 2016

@xvrl

The PRs I mentioned were closed, with the intent to reopen after this proposal settles, since this would have a significant impact on the implementation. They are there just for reference as use cases for the proposed filtering change.

I would like to get more feedback on the proposal before working on a potentially non-viable implementation

@xvrl
Copy link
Member

xvrl commented Apr 5, 2016

@jon-wei ah ok, that makes more sense

@xvrl
Copy link
Member

xvrl commented Apr 5, 2016

@jon-wei can you update this proposal to explain how #2760 fits into this?

@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 5, 2016

@xvrl #2760 wouldn't be affected by it, since that doesn't introduce any non-bitmap index/dictionary based scans.

@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 5, 2016

I would say that this proposal and PR #2760 are separate from each other, but both introduce features that would be used later to implement Long/Float dimension support (filtering without bitmap indexes in this proposal, type-specific dimension handling implementations in #2760)

@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 12, 2016

@navis Can you take a look at this proposal when you have time? It uses/expands on the filter partitioning and CNF conversion that you introduced for PR #2217

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants