-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count distinct returned incorrect results without useApproximateCountDistinct #14748
Conversation
.../main/java/org/apache/druid/query/groupby/epinephelinae/SummaryRowSupplierVectorGrouper.java
Fixed
Show fixed
Hide fixed
for (int i = 0; i < aggSpec.size(); i++) { | ||
values[i] = aggSpec.get(i).factorize(new AllNullColumnSelectorFactory()).get(); | ||
} | ||
return Collections.singleton(ResultRow.of(values)).iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a Collections.singletonIterator
that you can use instead. It's a nit, but will save on an object allocation.
Sequence<ResultRow> process; | ||
if (isNestedQueryPushDown(query)) { | ||
return mergeResultsWithNestedQueryPushDown(query, resource, runner, context); | ||
process = mergeResultsWithNestedQueryPushDown(query, resource, runner, context); | ||
} else { | ||
process = mergeGroupByResultsWithoutPushDown(query, resource, runner, context); | ||
} | ||
return mergeGroupByResultsWithoutPushDown(query, resource, runner, context); | ||
return GroupByQueryRunnerFactory.wrapSummaryRowIfNeeded(query, process); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that this was required, which test caused you to need this change? I say this because the only way you should be able to get a completely empty sequence here is if the "leaf nodes" are producing completely empty sequences. The change in the other place should ensure that no leaf node ever produces a completely empty sequence, meaning that this change shouldn't be necessary...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking a look!
unfortunately its needed - I've linked the test(s) checking this.
The leaf nodes are not necessarily aggregating (in case of distinct
) so an empty sequence may be produced - the merger supposed to aggregate them - that's why this is needed.
For nested query stuff the merge runner becomes this lambda (note: I don't know why I didn't placed this call there - just moved it)
example tests
- testCountDistinctNonApproximateEmptySet is a sql level one
- testSummaryrowForEmptySubqueryInput as a runnertest
The last test results have uncovered that To avoid that issue I've moved the insertion of the optional summary row to be right before postprocessing is applied |
With
useApproximateCountDistinct=false
queries like:may have returned incorrected results.
This PR has: