Count distinct returned incorrect results without useApproximateCountDistinct #14748

kgyrtkirk · 2023-08-03T14:52:33Z

With useApproximateCountDistinct=false queries like:

select count(distinct m1) from druid.foo where m1 < -1.0

may have returned incorrected results.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java

.../main/java/org/apache/druid/query/groupby/epinephelinae/SummaryRowSupplierVectorGrouper.java

imply-cheddar · 2023-08-30T01:18:28Z

processing/src/main/java/org/apache/druid/query/groupby/GroupByQueryRunnerFactory.java

+    for (int i = 0; i < aggSpec.size(); i++) {
+      values[i] = aggSpec.get(i).factorize(new AllNullColumnSelectorFactory()).get();
+    }
+    return Collections.singleton(ResultRow.of(values)).iterator();


There's a Collections.singletonIterator that you can use instead. It's a nit, but will save on an object allocation.

imply-cheddar · 2023-08-30T01:20:21Z

processing/src/main/java/org/apache/druid/query/groupby/GroupByQueryQueryToolChest.java

+    Sequence<ResultRow> process;
    if (isNestedQueryPushDown(query)) {
-      return mergeResultsWithNestedQueryPushDown(query, resource, runner, context);
+      process = mergeResultsWithNestedQueryPushDown(query, resource, runner, context);
+    } else {
+      process = mergeGroupByResultsWithoutPushDown(query, resource, runner, context);
    }
-    return mergeGroupByResultsWithoutPushDown(query, resource, runner, context);
+    return GroupByQueryRunnerFactory.wrapSummaryRowIfNeeded(query, process);


I'm surprised that this was required, which test caused you to need this change? I say this because the only way you should be able to get a completely empty sequence here is if the "leaf nodes" are producing completely empty sequences. The change in the other place should ensure that no leaf node ever produces a completely empty sequence, meaning that this change shouldn't be necessary...

Thank you for taking a look!
unfortunately its needed - I've linked the test(s) checking this.

The leaf nodes are not necessarily aggregating (in case of distinct) so an empty sequence may be produced - the merger supposed to aggregate them - that's why this is needed.

For nested query stuff the merge runner becomes this lambda (note: I don't know why I didn't placed this call there - just moved it)

example tests

testCountDistinctNonApproximateEmptySet is a sql level one

testSummaryrowForEmptySubqueryInput as a runnertest

kgyrtkirk · 2023-08-30T18:48:28Z

The last test results have uncovered that HAVING clauses were not able to filter the summary row - because it was added after those were processed.

To avoid that issue I've moved the insertion of the optional summary row to be right before postprocessing is applied

kgyrtkirk added 25 commits August 3, 2023 14:49

add test-copy1

64479e0

let the autoformat work

899badd

add tryies

8be630c

some more test

b63df49

build q

5089f8b

build q

453a81d

update test

34eace2

remove tries

def97c0

add test for good behaiour

5595f45

fix0

f8756c8

cleanup

d91e134

possible fix

ba54ce7

ignore test

058f1b9

fix format

00f5f8e

half-fix 1 test

ae581a2

test for #2

6e75a68

some changes

1ef0354

updates

7e99f43

fix a set of tests

f2e2fc5

fix more tests

5d2cdfe

unpatch

bcb4b3c

tries

2231831

allow timeseries in ingestion

6c66381

fix more tests

cb30e7c

fix a few more

8e4a3fc

github-advanced-security bot found potential problems Aug 7, 2023

View reviewed changes

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java Fixed Show fixed Hide fixed

kgyrtkirk added 4 commits August 8, 2023 13:51

fix-b

3907672

remove some unrelated stuff

33a2293

fix-B2

0f3b3a6

remove test framework changes

e88f3cf

fix

2931898

github-advanced-security bot found potential problems Aug 28, 2023

View reviewed changes

.../main/java/org/apache/druid/query/groupby/epinephelinae/SummaryRowSupplierVectorGrouper.java Fixed Show fixed Hide fixed

kgyrtkirk added 13 commits August 29, 2023 09:56

commented runnerfactory level

7325f72

updates

13a5097

remove grouper approach; migrate to runnerfactory

1dbb496

cleanup/format/etc

3d4403a

cleanup; add test for subq at processing

d37bf3d

ugly-subq handling

05d719c

updates

1e0f99b

clenaup

cf79d1c

remove type args; add safevarags

be18860

cleanup

cd6cc79

move stuff to toolchest

39b0ada

put back into factory

2281a71

cleanup

13eb306

imply-cheddar approved these changes Aug 30, 2023

View reviewed changes

kgyrtkirk added 2 commits August 30, 2023 04:50

move to mergeResults fn

cf59ed3

fix NullHandling.replaceWithDefault in GroupByQueryRunnerTest

faeec4e

clintropolis approved these changes Aug 30, 2023

View reviewed changes

kgyrtkirk added 3 commits August 30, 2023 18:25

having test+fix

3fb27dc

move to GroupingEngine#applyPostProcessing

7d4a7bf

processing-test

adcb001

kgyrtkirk added 2 commits August 31, 2023 05:48

make IS_FINER_THAN final

5332a95

fix asList

f85a732

clintropolis approved these changes Sep 6, 2023

View reviewed changes

asdf2014 added the Area - Querying label Sep 8, 2023

clintropolis merged commit 5d16d0e into apache:master Sep 12, 2023
74 checks passed

LakshSingla added this to the 28.0 milestone Oct 12, 2023

kgyrtkirk mentioned this pull request Oct 23, 2023

Fix summary row issues in case postaggregations are happening #15232

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count distinct returned incorrect results without useApproximateCountDistinct #14748

Count distinct returned incorrect results without useApproximateCountDistinct #14748

kgyrtkirk commented Aug 3, 2023 •

edited

Loading

imply-cheddar Aug 30, 2023

imply-cheddar Aug 30, 2023

kgyrtkirk Aug 30, 2023

kgyrtkirk commented Aug 30, 2023

Count distinct returned incorrect results without useApproximateCountDistinct #14748

Count distinct returned incorrect results without useApproximateCountDistinct #14748

Conversation

kgyrtkirk commented Aug 3, 2023 • edited Loading

imply-cheddar Aug 30, 2023

Choose a reason for hiding this comment

imply-cheddar Aug 30, 2023

Choose a reason for hiding this comment

kgyrtkirk Aug 30, 2023

Choose a reason for hiding this comment

kgyrtkirk commented Aug 30, 2023

kgyrtkirk commented Aug 3, 2023 •

edited

Loading