Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count distinct returned incorrect results without useApproximateCountDistinct #14748

Merged
merged 101 commits into from
Sep 12, 2023

Conversation

kgyrtkirk
Copy link
Member

@kgyrtkirk kgyrtkirk commented Aug 3, 2023

With useApproximateCountDistinct=false queries like:

select count(distinct m1) from druid.foo where m1 < -1.0

may have returned incorrected results.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

for (int i = 0; i < aggSpec.size(); i++) {
values[i] = aggSpec.get(i).factorize(new AllNullColumnSelectorFactory()).get();
}
return Collections.singleton(ResultRow.of(values)).iterator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a Collections.singletonIterator that you can use instead. It's a nit, but will save on an object allocation.

Comment on lines 178 to 184
Sequence<ResultRow> process;
if (isNestedQueryPushDown(query)) {
return mergeResultsWithNestedQueryPushDown(query, resource, runner, context);
process = mergeResultsWithNestedQueryPushDown(query, resource, runner, context);
} else {
process = mergeGroupByResultsWithoutPushDown(query, resource, runner, context);
}
return mergeGroupByResultsWithoutPushDown(query, resource, runner, context);
return GroupByQueryRunnerFactory.wrapSummaryRowIfNeeded(query, process);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that this was required, which test caused you to need this change? I say this because the only way you should be able to get a completely empty sequence here is if the "leaf nodes" are producing completely empty sequences. The change in the other place should ensure that no leaf node ever produces a completely empty sequence, meaning that this change shouldn't be necessary...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking a look!
unfortunately its needed - I've linked the test(s) checking this.

The leaf nodes are not necessarily aggregating (in case of distinct) so an empty sequence may be produced - the merger supposed to aggregate them - that's why this is needed.

For nested query stuff the merge runner becomes this lambda (note: I don't know why I didn't placed this call there - just moved it)

example tests

@kgyrtkirk
Copy link
Member Author

The last test results have uncovered that HAVING clauses were not able to filter the summary row - because it was added after those were processed.

To avoid that issue I've moved the insertion of the optional summary row to be right before postprocessing is applied

@clintropolis clintropolis merged commit 5d16d0e into apache:master Sep 12, 2023
74 checks passed
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants