Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix filtering bug in filtering unnest cols and dim cols: Received a non-applicable rewrite #14587

Merged

Conversation

pranavbhole
Copy link
Contributor

@pranavbhole pranavbhole commented Jul 14, 2023

Fixes #XXXX.

Description

This PR fixes the bug in the following query, currently it throws the exception org.apache.druid.query.QueryException: Received a non-applicable rewrite: {j0.unnest=c2}, filter's dimension: c1
Whenever we have And / OR filters on unnest columns, we rewrite the filters on unnest cols.
We were covering OR filters and disqualifying the rewrite for non unnest cols and we have to to do same thing with AND filters as well. This PR checks for Boolean filter and disqualify the rewrite for non unnest cols.

Adding the 3 tests that covers the given scenario.

select * from ( 
          select * from (
                         select * from mytest, unnest(mv_to_array(c2)) as u(n1)
                     ), unnest(mv_to_array(c2)) as u(n2)
               ), unnest(mv_to_array(c2)) as u(n3)
where n1='A' and c1>=1

3rd test case cover the nested filer scenario for unnest filters, This PR address the unpack the nested OR/AND filters and attempt to rewrite only for qualified filters.

SELECT dimZipf, dim3_unnest1, dim3_unnest2, dim3_unnest3 FROM 
      ( SELECT * FROM 
           ( SELECT * FROM lotsocolumns, UNNEST(MV_TO_ARRAY(dimMultivalEnumerated)) as ut(dim3_unnest1) )           ,UNNEST(MV_TO_ARRAY(dimMultivalEnumerated)) as ut(dim3_unnest2) 
      ), UNNEST(MV_TO_ARRAY(dimMultivalEnumerated)) as ut(dim3_unnest3)  WHERE dimZipf=27 AND (dim3_unnest1='Baz' OR dim3_unnest2='Hello') AND dim3_unnest3='World'

Thank you

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

Release note


Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@pranavbhole pranavbhole force-pushed the fixBugInfilterOnUnnestColWithDimFilers branch from a3dc3ac to 89aa0b5 Compare July 14, 2023 21:52
Copy link
Contributor

@somu-imply somu-imply left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for the PR. The recursive one is a nice touch to handle the complex filters

@pranavbhole pranavbhole force-pushed the fixBugInfilterOnUnnestColWithDimFilers branch 2 times, most recently from 209c889 to 93afd47 Compare July 14, 2023 22:35
Comment on lines 380 to 381
final OrFilter preOrFilter = new OrFilter(preFilterList);
filterSplitter.addPreFilter(preOrFilter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we always make an OrFilter if we are only checking for BooleanFilter to break down? Isn't this going to turn AndFilter into OrFilter? Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are previously this code was dealign with only OR filters, we need to change the final filter based on type. I Need refactor the code to create new PreFilterList.

Copy link
Contributor

@somu-imply somu-imply Jul 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh shoot, I missed that. Good catch. We are extracting the filters and adding each to the pre-filter list. Maybe at each level of recursion, we need to recreate the And/Or filter and add it back to the pre-filter list. In such a case the number of filters in pre-filter list will be equal to the original filter size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@somu-imply @clintropolis Pushed a change to address the comment. Can you please review?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be an issue with the test test_pushdown_or_filters_unnested_and_original_dimension_with_unnest_adapters as the expected results have changed. I'm going through the code today

@pranavbhole pranavbhole force-pushed the fixBugInfilterOnUnnestColWithDimFilers branch 3 times, most recently from 35bee11 to 42c0a61 Compare July 17, 2023 22:15
))
));

final OrFilter expectedPushDownFilter = new OrFilter(ImmutableList.of(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a similar test case on AND of OR filters as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding more test case

@somu-imply
Copy link
Contributor

somu-imply commented Jul 17, 2023

Almost there. Suggesting 1 more change:

There are 2 cases:

  1. Consider the query
select * from foo, UNNEST(mv_to_array(dim3)) as u(d3) where d3 IN (a,b) or m1 < 10

Here since d3 can be mapped directly to an input column dim3, the filter can be reconstructed as a pre-filter with dim3 IN (a,b) or m1<10 and pushed to the base cursor.

  1. Consider the query
select * from foo, UNNEST(ARRAY[dim1,dim2]) as u(d12) where d12 IN (a,b) or m1 < 10

since the filter outside contains an OR filter and d12 has no direct match to an input column there should be no pre-filters and the entire filter should be just used as a post-filter. In the current code, this needs to be accounted for and a test case needs to be added for the same. Currently, for the failing query in CI, this does not seem to be the case and returns incorrect results

filterSplitter.filtersForPostUnnestCursor = {ArrayList@10853}  size = 1
0 = {OrFilter@8617} "(m1 as numeric < 2 || j0.unnest IN (a, aa))"
filterSplitter.filtersPushedDownToBaseCursor = {ArrayList@10854}  size = 1
0 = {OrFilter@10852} "(m1 as numeric < 2)"

@pranavbhole pranavbhole force-pushed the fixBugInfilterOnUnnestColWithDimFilers branch from 42c0a61 to 7d9518e Compare July 18, 2023 19:15
Comment on lines 2881 to 2936
bound("dimZipf", "27", "27", false, false, null, StringComparators.NUMERIC),
new SelectorDimFilter("j0.unnest", "Baz", null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since #14542 has been merged, this should use range and equality instead of bound and making a SelectorDimFilter directly. same for other tests added in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@pranavbhole pranavbhole force-pushed the fixBugInfilterOnUnnestColWithDimFilers branch 3 times, most recently from 3b27907 to dd1f92d Compare July 31, 2023 16:48
@pranavbhole pranavbhole force-pushed the fixBugInfilterOnUnnestColWithDimFilers branch 3 times, most recently from 26c4fb1 to 77b62a6 Compare July 31, 2023 23:39
Copy link
Contributor

@somu-imply somu-imply left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the code and adding those tests. I have some comments so far, mostly cosmetic.

return new SelectorDimFilter(dimension, value, extractionFn).toFilter();
}

public static DimFilter sdfd(String dimension, String value, ExtractionFn extractionFn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. maybe some better names...but am good with these too

* 3. Filters on unnest column which is derived from Array or any other Expression can not be pushe down to base.
* for example: a=1 and vc=3 , lets say vc is ExpressionVirtualColumn, and vc=3 can not be push down to base even if top level is AND filter.
* 4.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments, for future developers can we make it a bit clear with why in case 1, the entire filter (b=2 OR c=2) cannot be pushed down to the preFilter. Probably we can add an example saying that either b2 or c2 references a column that is not a part of the input for the case of unnest.

Something like, in unnest we come across 2 cases:

  1. unnesting a single dimension e.g. select * from foo, UNNEST(MV_TO_ARRAY(dim3)) as u(d3)
  2. Unnesting an expression from multiple columns e.g. select * from foo, UNNEST(ARRAY[dim1,dim2]) as u(d12)

In case 1, d3 is a direct reference to dim3 so any filter using d3 can be rewritten using dim3 and added to pre filter while in case2, due to presence of the expression virtual column expressionVirtualColumn("j0.unnest", "array(\"dim1\",\"dim2\")", ColumnType.STRING_ARRAY) the filters on d12 cannot be pushed to the pre filters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comments

@pranavbhole pranavbhole force-pushed the fixBugInfilterOnUnnestColWithDimFilers branch from 77b62a6 to ccabe2b Compare August 1, 2023 01:23
@pranavbhole pranavbhole force-pushed the fixBugInfilterOnUnnestColWithDimFilers branch from ccabe2b to 7210ae7 Compare August 1, 2023 18:41
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to add some tests with deeper nesting to UnnestStorageAdapterTest just to flex the recursive stuff a bit more

Comment on lines 3025 to 3019
NullHandling.sqlCompatible()
? equality("dimZipf", "27", ColumnType.LONG)
: bound("dimZipf", "27", "27", false, false, null, StringComparators.NUMERIC),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use numericEquality instead of this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -44,4 +47,19 @@ public static SelectorFilter selector(final String fieldName, final String value
{
return new SelectorFilter(fieldName, value, null);
}

public static Filter sdf(String dimension, String value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to exist instead of just using selector? I don't see any callers using the version that takes extractionFn, and only one caller using sdfd which seems overkill..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed static functions that I added


public FilterSplitter(
String inputColumn,
ColumnCapabilities inputColumnCapabilites, VirtualColumns queryVirtualColumns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit formatting, VirtualColumns should be on a separate line (the style checker has trouble enforcing this sometimes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@pranavbhole pranavbhole force-pushed the fixBugInfilterOnUnnestColWithDimFilers branch 2 times, most recently from c3e0a33 to 0508ec4 Compare August 2, 2023 19:38
preFilterList.add(filter);
// requires check for OR and And filters, disqualify rewrite for non-unnest filters
if (queryFilter instanceof BooleanFilter) {
int originalFilterCount = Filters.countNumberOfFilters(queryFilter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like it would be a lot more efficient to do both the before and after count as part of the traversal that the filter splitter does, any reason I'm missing not to do that? Otherwise it seems like we have to basically traverse the filter tree 3 times per segment, before to get a count, then to try to rewrite, then after to get the new count. It seems like this could all just be baked into the FilterSplitter itself. Apologies for not spotting this earlier on in review... but if we do this then i think we no longer need Filters.countNumberOfFilters

@@ -422,4 +423,19 @@ private static LinkedHashSet<Filter> flattenOrChildren(final Collection<Filter>

return retVal;
}

public static int countNumberOfFilters(Filter filter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if this method needs to stay (re if other comment suggesting pushing counting inside of filter splitter does not work out for some reason) then this argument should be marked @Nullable

));
testComputeBaseAndPostUnnestFilters(
testQueryFilter,
"unnested-multi-string1 = 3",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem right, shouldn't it be pushing down multi-string1 = 3 since unnested-multi-string1 doesn't exist on the base adapter? I think the problem is that testComputeBaseAndPostUnnestFilters is not getting the virtual column capabilities correctly, which makes it use the default capabilities to define it as float typed rather than string typed, so it isn't rewritten correctly in rewriteFilterOnUnnestColumnIfPossible because it is not string typed.

If i change the line to get the capabilities from the unnest storage adapter

vc.capabilities(UNNEST_STORAGE_ADAPTER, inputColumn)

, the test fails because the pushed down filter is (multi-string1 = 3 && multi-string1 = 2) which seems like the correct answer? Also makes most of the other tests fail expectations. I think we should probably add a check to ensure that filters on the unnest column never get pushed down, since I think that would effectively make the results null since the filter would match nothing since the column doesn't exist (unless it was filtering for nulls i guess)

null,
VirtualColumns.EMPTY,
inputColumn,
vc.capabilities(inputColumn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be

Suggested change
vc.capabilities(inputColumn)
vc.capabilities(UNNEST_STORAGE_ADAPTER, inputColumn)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

));
testComputeBaseAndPostUnnestFilters(
testQueryFilter,
"multi-string1 = 3",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it that only the filter on the unnest column pushed down instead of the whole and? same question for other test cases involving an AND? I think I would expect that in the case of AND especially, everything which can be pushed down is pushed down. This is what the comments in computeBaseAndPostUnnestFilters indicates should happen..

In the case of an AND filter, I think that everything can be pushed down except for the case where the unnest output column is not a simple direct access to a MVD. Since it is an AND though, I think all of the other clauses could still be pushed down in that case, leaving only the unnest filter which could not be rewritten as a post filter.

But in the case of OR filters, I think its either all or nothing can be pushed down. So if there is a filter on the unnest output column where the virtual column is not a simple direct column access to a multi-value string, then nothing can be pushed down, else if the unnest column filters can be rewritten because they are simple mvd direct access, everything can be pushed down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this test, it is case of And filters, and We only push down the filters to base where rewriteFilterOnUnnestColumnIfPossible is successful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the bug in filter counts and now tests looks good, Can you please verify again?

filterSplitter.addPreFilter(newFilter != null ? newFilter : filter);
filterSplitter.addToPreFilterCount(1);
}
filterSplitter.addToOriginalFilterCount(1);
Copy link
Contributor

@somu-imply somu-imply Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the filter getting added to the preFilter list twice or counted twice with addToPreFilterCount ? It already got added to preFilterList once in line 503. This would cause the count to go up and when you make the comparison of count on line 330, it would give the wrong count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@pranavbhole pranavbhole force-pushed the fixBugInfilterOnUnnestColWithDimFilers branch 2 times, most recently from b4b0ec1 to a0a8324 Compare August 16, 2023 06:02
@pranavbhole pranavbhole force-pushed the fixBugInfilterOnUnnestColWithDimFilers branch from a0a8324 to 96a8208 Compare August 16, 2023 16:31
Copy link
Contributor

@somu-imply somu-imply left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, two optimizations that need to be done are:

  1. Make sure not filters on unnested column are not pushed to base
  2. Update post filters for and filters to include only the subset not pushed to base

These can be done in a followup PR to not delay this one

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

per comment (and agreeing with @somu-imply) I think there is still some follow-up work to be done. Also it would be nice to have a few more tests where it can't push anything down, such as when filter is on unnest column that is wrapped in some expression like array_slice which would make the virtual column more than a direct access and so shouldn't be eligible for pushdown.

Comment on lines +473 to +477
testComputeBaseAndPostUnnestFilters(
testQueryFilter,
"(multi-string1 = 3 && multi-string1 = 2)",
"(unnested-multi-string1 = 3 && multi-string1 = 2)"
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still some work to do here, but is ok to do as a follow-up.

Side note, this would be a very strange query since there is like coincidentally a filter on the underlying unnest column that is referring to it directly alongside a filter on the unnested column itself. I think it would make sense to migrate a lot of these tests away from having filters on both the unnest column and its underlying column, since I cant imagine its going to be common in practice.

But thinking about the more general case, like if it was WHERE unnested-multi-string1 = 3 && other-column = 2 i the best rewrite would be

        testQueryFilter,
        "(multi-string1 = 3 && other-column= 2)",
        "unnested-multi-string1 = 3"
    );```
but the current code leaves the `other-column = 2` in the post-filters, which isn't necessary. It would be a pretty big optimization to leave these off and only have to do post-filters when absolutely necessary. OR filters cannot have the same optimization.

But, i think it should be ok to do this in a follow-up PR, since there is some other correctness stuff to be done as well, such as skipping pushdown of NOT filters, which could incorrectly filter out too much before the post-filter has a chance to match stuff.

@clintropolis clintropolis merged commit 26d82fd into apache:master Aug 17, 2023
74 checks passed
@pranavbhole pranavbhole deleted the fixBugInfilterOnUnnestColWithDimFilers branch September 6, 2023 16:11
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants