Support for ARG_MIN and ARG_MAX Functions #10636

jasperjiaguo · 2023-04-18T17:45:26Z

This PR adds ArgMin/ArgMax function

Added the prerequisite code for ArgMin/ArgMax query rewriting and query result rewriting
ArgMin/ArgMax Function

Syntax:

SELECT ArgMin(measuringCol1, measuringCol2, measuringCol3, projectionCol1), ArgMin(measuringCol1, measuringCol2, measuringCol3, projectionCol2) FROM table

These two functions do lexicographical ordering on <measuringCol1, measuringCol2, measuringCol3>, and project projectionCol1 and projectionCol2 for all appearances of minimum <measuringCol1, measuringCol2, measuringCol3>

E.g. for input data

floatCol	intCol	stringCol	doubleCol	longCol
1.0	1	"a2"	2.0	2
2.5	1	"a11"	3.0	2
5.0	2	"a2"	4.0	1

Query 1

SELECT 
argmin(intCol, stringCol, floatCol), 
argmin(intCol, stringCol, intCol) , 
argmin(intCol, stringCol, stringCol), 
argmin(intCol, stringCol, doubleCol)  
FROM table

Result table:

argmin(intCol, stringCol, floatCol)	argmin(intCol, stringCol, intCol)	argmin(intCol, stringCol, stringCol)	argmin(intCol, stringCol, doubleCol)
2.5	1	"a11"	3.0

Query 2

SELECT 
argmin(intCol, **stringCol**),  
argmin(intCol, **doubleCol**), 
sum(doubleCol)  
FROM table

Result table

argmin(intCol, stringCol)	argmin(intCol, doubleCol)	sum(doubleCol)
"a2"	2.0	9.0
"a11"*	3.0	null**

Note
* Without dedup all the rows with the same extremum key will be output
** Regular aggregation functions will still output 1 field in the first row, the other rows will be filled by null, same applies when two different argmin/max functions has different number of projection rows

Query 3

SELECT 
intCol, 
argmin(longCol, **doubleCol**),  
argmin(longCol, **longCol**)   
FROM table 
GROUP BY intCol

Result table

intCol	argmin(longCol, doubleCol)	argmin(longCol, longCol)
1	2.0	2
1*	3.0	2
2	4.0	1

Note
* note that we fill the fields where the group id is the same as the previous row

Notes:

This impl does not work with AS clause (e.g. SELECT argmin(longCol, doubleCol) AS argmin won't work), because we depend on the return column name to rewrite the query result.
Putting argmin/max column inside order by clause (e.g. SELECT intCol, argmin(longCol, doubleCol) FROM table GROUP BY intCol ORDER BY argmin(longCol, doubleCol)) is not supported as semantically ordering multi-column multi-row argmin(longCol, doubleCol) results doesn't make sense
Currently projecting MV bytes column doesn't work because DataBlock is not able to serialize it correctly

codecov-commenter · 2023-04-18T22:10:45Z

Codecov Report

Merging #10636 (b8a15ad) into master (53cb451) will increase coverage by 0.15%.
The diff coverage is 86.31%.

@@             Coverage Diff              @@
##             master   #10636      +/-   ##
============================================
+ Coverage     70.28%   70.43%   +0.15%     
- Complexity     6430     6462      +32     
============================================
  Files          2112     2140      +28     
  Lines        113994   115088    +1094     
  Branches      17219    17348     +129     
============================================
+ Hits          80121    81063     +942     
- Misses        28275    28384     +109     
- Partials       5598     5641      +43

Flag	Coverage Δ
integration1	`24.20% <5.30%> (-0.12%)`	⬇️
integration2	`23.92% <5.86%> (-0.18%)`	⬇️
unittests1	`68.04% <85.71%> (+0.20%)`	⬆️
unittests2	`13.74% <0.27%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...va/org/apache/pinot/spi/utils/CommonConstants.java	`26.74% <0.00%> (-0.32%)`	⬇️
...regation/groupby/DummyAggregationResultHolder.java	`14.28% <14.28%> (ø)`
.../aggregation/groupby/DummyGroupByResultHolder.java	`25.00% <25.00%> (ø)`
...org/apache/pinot/core/common/ObjectSerDeUtils.java	`89.92% <53.33%> (-1.46%)`	⬇️
...ls/argminmax/ArgMinMaxProjectionValSetWrapper.java	`60.00% <60.00%> (ø)`
...ils/argminmax/ArgMinMaxMeasuringValSetWrapper.java	`66.66% <66.66%> (ø)`
...gation/utils/argminmax/ArgMinMaxWrapperValSet.java	`72.22% <72.22%> (ø)`
...y/aggregation/utils/argminmax/ArgMinMaxObject.java	`81.48% <81.48%> (ø)`
...re/query/utils/rewriter/ResultRewriterFactory.java	`84.21% <84.21%> (ø)`
...n/function/ParentArgMinMaxAggregationFunction.java	`87.55% <87.55%> (ø)`
... and 13 more

... and 131 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

somandal

overall looks good, left mostly minor comments

...src/main/java/org/apache/pinot/core/query/aggregation/function/ChildAggregationFunction.java

pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java

...e/src/main/java/org/apache/pinot/core/query/aggregation/utils/argminmax/ArgMinMaxObject.java

pinot-core/src/test/java/org/apache/pinot/queries/ArgMinMaxTest.java

...ava/org/apache/pinot/core/query/aggregation/function/ParentArgMinMaxAggregationFunction.java

...ain/java/org/apache/pinot/core/query/aggregation/utils/argminmax/ArgMinMaxWrapperValSet.java

...ava/org/apache/pinot/core/query/aggregation/function/ParentArgMinMaxAggregationFunction.java

...src/main/java/org/apache/pinot/core/query/aggregation/function/ChildAggregationFunction.java

...rc/main/java/org/apache/pinot/core/query/aggregation/function/ParentAggregationFunction.java

pinot-core/src/test/java/org/apache/pinot/queries/ArgMinMaxTest.java

jasperjiaguo · 2023-05-04T01:38:30Z

@somandal Address the code related comments, will add more test cases tomorrow.

siddharthteotia · 2023-05-08T00:32:34Z

...c/main/java/org/apache/pinot/core/query/aggregation/function/AggregationFunctionFactory.java

+          case ARGMAX:
+          case ARGMIN:
+            throw new IllegalArgumentException("Aggregation function: " + function
+                + " is only supported in selection without alias.");


Not sure I am following this exception. Why do we need to throw this exception for ARG_MIN and ARG_MAX ?

This is for the argmin max that's not rewritten (invalid ones), i.e. the one not in selection or in selection but used with alias

siddharthteotia · 2023-05-08T00:34:59Z

...c/main/java/org/apache/pinot/core/query/aggregation/function/AggregationFunctionFactory.java

+            return new ParentArgMinMaxAggregationFunction(arguments, false);
+          case PINOTCHILDAGGREGATIONARGMAX:
+            return new ChildArgMinMaxAggregationFunction(arguments, true);
+          case PINOTCHILDAGGREGATIONARGMIN:


I vaguely remember but we hit this in some recent work as part of multi stage as well. Aggregation functions that are not going to be used by the user in SQL also need to be exposed here and ideally they shouldn't. I think it happened for the 3rd / 4th moment / reduce functions

Is it possible to only add ARG_MIN and ARG_MAX (the user level AggregationFunctions) in this interface ?

@siddharthteotia in that case we would need to use one specific argument in the argument list to denote if the function is parent or children and the factory here would need to look into the argument details, which IMO is not very clean.

siddharthteotia · 2023-05-08T01:07:21Z

pinot-core/src/test/java/org/apache/pinot/queries/ArgMinMaxTest.java

+
+    // Test transformation function inside argmax/argmin, for both projection and measuring
+    // the max of 3000x-x^2 is 2250000, which is the max of 3000x-x^2
+    query = "SELECT sum(intColumn), argmax(3000 * doubleColumn - intColumn * intColumn, doubleColumn),"


So essentially the first set of arguments for lex ordering (for min or max) can be a mix of identifier (column) or a transform (scalar or non-scalar). Correct ?

Is the same true for projectionColumn ? Can we project the transform instead of identifier ? Do we have tests for that ?

So essentially the first set of arguments for lex ordering (for min or max) can be a mix of identifier (column) or a transform (scalar or non-scalar). Correct ?

Yes

there are tests for projection of transformed cols in the testcase

query = "SELECT sum(intColumn), argmax(3000 * doubleColumn - intColumn * intColumn, doubleColumn)," + "argmax(3000 * doubleColumn - intColumn * intColumn, 3000 * doubleColumn - intColumn * intColumn)," + "argmax(3000 * doubleColumn - intColumn * intColumn, doubleColumn), " + "argmin(replace(stringColumn, \'a\', \'bb\'), replace(stringColumn, \'a\', \'bb\'))" + "FROM testTable";

siddharthteotia · 2023-05-08T01:11:51Z

pinot-core/src/test/java/org/apache/pinot/queries/ArgMinMaxTest.java

+
+    // TODO: The following query throws an exception,
+    //       requires fix for multi-value bytes column serialization in DataBlock
+    query = "SELECT arg_min(intColumn, mvBytesColumn) FROM testTable";


Can we also add other failure scenario tests ? For example, invalid number or types of arguments in the arg_min or arg_max function if not already added ?

Good point, added

siddharthteotia · 2023-05-08T01:17:36Z

pinot-core/src/test/java/org/apache/pinot/queries/ArgMinMaxTest.java

+    assertEquals(rows.get(3)[0], 1200);
+
+    // test1, with dedupe
+    query = "SELECT  "


Shall we have a query option or system option to bypass the default behavior and instead return just one of the rows for all duplicate / matching rows where min / max is happening ?

We can maybe do this as a follow up. It will require the stability sorting/dedup of projection results, as would be better to do in a seperate PR.

siddharthteotia · 2023-05-08T01:18:52Z

pinot-core/src/test/java/org/apache/pinot/queries/ResultRewriterRegressionTest.java

+/**
+ * Regression test for queries with result rewriter.
+ */
+public class ResultRewriterRegressionTest {


What are we testing here and curious why is this suffixed with RegressionTest ?

It's the test cases to show existing aggregation function should not be impacted by the new pluggable result rewriter.

siddharthteotia · 2023-05-08T01:34:42Z

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/AggregationFunctionType.java

+  PINOTPARENTAGGREGATIONARGMIN(CommonConstants.RewriterConstants.PARENT_AGGREGATION_NAME_PREFIX + ARGMIN.getName()),
+  PINOTPARENTAGGREGATIONARGMAX(CommonConstants.RewriterConstants.PARENT_AGGREGATION_NAME_PREFIX + ARGMAX.getName()),
+  PINOTCHILDAGGREGATIONARGMIN(CommonConstants.RewriterConstants.CHILD_AGGREGATION_NAME_PREFIX + ARGMIN.getName()),
+  PINOTCHILDAGGREGATIONARGMAX(CommonConstants.RewriterConstants.CHILD_AGGREGATION_NAME_PREFIX + ARGMAX.getName());


Can 103-106 be encapsulated within the rewriter itself as opposed to exposing them in AggregationFunctionType.java ?

I feel we should only have user exposed aggregation functions which can be used in SQL in this file ?

I feel we should only have user exposed aggregation functions which can be used in SQL in this file ?

Do we have a specific reason for doing this?

Main reason is clean interface. AggregationFunctionType is for user exposed in-built functions ideally.

So as a follow-up we should try to see how we can do this cleanly in future otherwise this file will end up having mix of things imo.

siddharthteotia · 2023-05-08T01:35:49Z

pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java

@@ -972,4 +974,11 @@ public static class Range {
  public static class IdealState {
    public static final String HYBRID_TABLE_TIME_BOUNDARY = "HYBRID_TABLE_TIME_BOUNDARY";
  }
+
+  public static class RewriterConstants {


I wonder why should this be defined in CommonConstants where we typically define Broker or Server instance level config constants ? Can this be taken inside rewriter as public constants ?

putting this in rewriter will make it not accessible from SPI

I see. Hmm we should try to fix this in follow-ups

pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/AggregationFunctionType.java

siddharthteotia · 2023-05-08T01:38:47Z

pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java

@@ -303,6 +303,8 @@ public static class Broker {
        "pinot.broker.instance.enableThreadAllocatedBytesMeasurement";
    public static final boolean DEFAULT_ENABLE_THREAD_CPU_TIME_MEASUREMENT = false;
    public static final boolean DEFAULT_THREAD_ALLOCATED_BYTES_MEASUREMENT = false;
+    public static final String CONFIG_OF_BROKER_RESULT_REWRITER_CLASS_NAMES
+        = "pinot.broker.result.rewriter.class.names";


I thought we already had rewriter configuration ? Can we reuse the same config ?

That is for query rewriter, we should be using a different rewriter class for results right?

Sounds good. Thanks.

siddharthteotia · 2023-05-08T01:39:20Z

pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java

+    public static final String PARENT_AGGREGATION_NAME_PREFIX = "pinotparentaggregation";
+    public static final String CHILD_AGGREGATION_NAME_PREFIX = "pinotchildaggregation";
+    public static final String CHILD_AGGREGATION_SEPERATOR = "@";
+    public static final String CHILD_KEY_SEPERATOR = "_";


Can you elaborate on purpose of CHILD_AGGREGATION_SEPERATOR and CHILD_KEY_SEPERATOR ?

/** * The name of the column as follows: * CHILD_AGGREGATION_NAME_PREFIX + actual function type + operands + CHILD_AGGREGATION_SEPERATOR * + actual function type + parent aggregation function id + CHILD_KEY_SEPERATOR + column key in parent function * e.g. if the child aggregation function is "argmax(0,a,b,x)", the name of the column is * "pinotchildaggregationargmax(a,b,x)@argmax0_x" */

To easily associate the child aggregation function with it's parents and extract the result with key.

siddharthteotia · 2023-05-08T01:44:39Z

...c/main/java/org/apache/pinot/core/query/aggregation/function/AggregationFunctionFactory.java

@@ -325,6 +325,18 @@ public static AggregationFunction getAggregationFunction(FunctionContext functio
            return new FourthMomentAggregationFunction(firstArgument, FourthMomentAggregationFunction.Type.KURTOSIS);
          case FOURTHMOMENT:
            return new FourthMomentAggregationFunction(firstArgument, FourthMomentAggregationFunction.Type.MOMENT);
+          case PINOTPARENTAGGREGATIONARGMAX:


Remove PINOT prefix and AGGREGATION as well ?

PARENTARGMAX

PARENTARGMIN

jasperjiaguo · 2023-05-08T02:15:32Z

Thinking more on my previous comment.....

May be one way to workaround the NULL business is to output array when we have duplicates where the min and max is happening ?

This query
SELECT 
argmin(intCol, **stringCol**),  
argmin(intCol, **doubleCol**), 
sum(doubleCol)  
FROM table
can output

argmin(intCol, stringCol) argmin(intCol, doubleCol) sum(doubleCol)
["a2", "a11"] [2.0, 3.0] 9.0
Similarly, the following query
SELECT 
intCol, 
argmin(longCol, **doubleCol**),  
argmin(longCol, **longCol**)   
FROM table 
GROUP BY intCol
Can output

intCol argmin(longCol, doubleCol) argmin(longCol, longCol)
1 [2.0, 3.0] 2
2 4.0 1
This is probably a more intuitive way to reason about response and is more SQL friendly imo and avoids populating NULLs.

@jasperjiaguo wdyt ?

Agreed that null filling can be confusing for group ids. I have made a change for the group id value filling and it now behaves like:

SELECT 
intCol, 
argmin(longCol, **doubleCol**),  
argmin(longCol, **longCol**)   
FROM table 
GROUP BY intCol

intCol	argmin(longCol, doubleCol)	argmin(longCol, longCol)
1	2.0	2
1	3.0	2
2	4.0	1

SELECT
argmin(intCol, stringCol),
argmin(intCol, doubleCol),
sum(doubleCol)
FROM table

argmin(intCol, stringCol)	argmin(intCol, doubleCol)	sum(doubleCol)
"a2"	2.0	9.0
"a11"*	3.0	9.0

which is essentially flattened view of

intCol	argmin(longCol, doubleCol)	argmin(longCol, longCol)
1	[2.0, 3.0]	2
2	4.0	1

and

argmin(intCol, stringCol)	argmin(intCol, doubleCol)	sum(doubleCol)
["a2", "a11"]	[2.0, 3.0]	9.0

respectively

Meanwhile, I have also considered the option array fashion of returning multiple rows of output, there are a few reasons I didn't use it:

It wouldn't work for all MV types as we currently don't have sth like ARRAY[ARRAY[INT]] for returned results
It would be easier for the user to parse the result when this is flattened, as the user side will not need to flatten + align them on their own when they are projecting multiple cols.
Using the flattened view will keep the output column type the same as the data column type, which I feel is cleaner.

IMO we should allow other aggregation functions with argmin and argmax.

+1 on this, if we have a well-defined output scheme then the user should have the power to run 1 query instead of 2

cc @siddharthteotia @somandal

siddharthteotia · 2023-05-08T04:35:19Z

When will we run into the problem of ARRAY[ARRAY[INT]] ?

jasperjiaguo · 2023-05-08T05:05:11Z

When will we run into the problem of ARRAY[ARRAY[INT]] ?

When we are projecting multiple rows of an INT MV column

…n names. Add more test cases. Refine error message.

siddharthteotia · 2023-05-09T07:15:24Z

...src/main/java/org/apache/pinot/core/query/aggregation/function/ChildAggregationFunction.java

+  @Override
+  public final String getResultColumnName() {
+    String type = getType().getName().toLowerCase();
+    return CommonConstants.RewriterConstants.CHILD_AGGREGATION_NAME_PREFIX


May be better to use StringBuilder in general but since this function will be called once per query, it should be fine for now

siddharthteotia · 2023-05-09T07:31:54Z

I have some suggestions / questions on simplifying the implementation a bit. But don't want to hold this. Let's discuss them sometime soon.

siddharthteotia · 2023-05-09T08:07:34Z

@jasperjiaguo please add user docs soon.

jasperjiaguo force-pushed the arg_min_max branch 5 times, most recently from 684df2d to c56153a Compare April 18, 2023 21:30

jasperjiaguo force-pushed the arg_min_max branch 2 times, most recently from 9058adc to 8414629 Compare April 26, 2023 21:38

jasperjiaguo force-pushed the arg_min_max branch 10 times, most recently from 5698824 to 9546e05 Compare May 2, 2023 19:57

jasperjiaguo marked this pull request as ready for review May 2, 2023 19:59

Add ArgMinMax aggregation function

b4b6471

jasperjiaguo force-pushed the arg_min_max branch from 927d265 to b4b6471 Compare May 3, 2023 17:47

somandal reviewed May 3, 2023

View reviewed changes

Add more test cases, bug fix

6937cad

somandal reviewed May 4, 2023

View reviewed changes

pinot-core/src/test/java/org/apache/pinot/queries/ArgMinMaxTest.java Outdated Show resolved Hide resolved

jasperjiaguo added 2 commits May 3, 2023 18:24

Address refactoring/doc related comments

48e6b81

Address doc related comments

e63ab02

Added more test cases

b447e80

jasperjiaguo force-pushed the arg_min_max branch from f4ac6b7 to b447e80 Compare May 4, 2023 23:24

Work around for DataBlock not able to ser/de empty array

396b295

siddharthteotia reviewed May 8, 2023

View reviewed changes

siddharthteotia changed the title ~~Adding ArgMin/ArgMax Function~~ Support for ARG_MIN and ARG_MAX Functions May 8, 2023

siddharthteotia reviewed May 8, 2023

View reviewed changes

pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java Outdated Show resolved Hide resolved

siddharthteotia reviewed May 8, 2023

View reviewed changes

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/AggregationFunctionType.java Outdated Show resolved Hide resolved

siddharthteotia reviewed May 8, 2023

View reviewed changes

Use another value filling scheme for multi-row result

6e9431b

siddharthteotia added the feature label May 8, 2023

siddharthteotia reviewed May 8, 2023

View reviewed changes

Removing Pinot prefix from parent and child aggregation function names

a93e655

jasperjiaguo added 3 commits May 7, 2023 22:05

Trigger Test

07e9840

Removing aggregation prefix from parent and child aggregation functio…

0ad010b

…n names. Add more test cases. Refine error message.

Trigger Test

b8a15ad

siddharthteotia reviewed May 9, 2023

View reviewed changes

siddharthteotia approved these changes May 9, 2023

View reviewed changes

siddharthteotia merged commit 7a673fd into apache:master May 9, 2023

walterddr mentioned this pull request Aug 2, 2023

support EXPR_MIN/EXPR_MAX #11254

Open

Support for ARG_MIN and ARG_MAX Functions #10636

Support for ARG_MIN and ARG_MAX Functions #10636

Conversation

jasperjiaguo commented Apr 18, 2023 • edited Loading

codecov-commenter commented Apr 18, 2023 • edited Loading

Codecov Report

somandal left a comment

Choose a reason for hiding this comment

jasperjiaguo commented May 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddharthteotia May 8, 2023 • edited Loading

Choose a reason for hiding this comment

jasperjiaguo May 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasperjiaguo May 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddharthteotia May 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddharthteotia May 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasperjiaguo commented May 8, 2023 • edited Loading

siddharthteotia commented May 8, 2023

jasperjiaguo commented May 8, 2023

Choose a reason for hiding this comment

siddharthteotia commented May 9, 2023

siddharthteotia commented May 9, 2023

jasperjiaguo commented Apr 18, 2023 •

edited

Loading

codecov-commenter commented Apr 18, 2023 •

edited

Loading

siddharthteotia May 8, 2023 •

edited

Loading

jasperjiaguo May 8, 2023 •

edited

Loading

jasperjiaguo May 8, 2023 •

edited

Loading

siddharthteotia May 8, 2023 •

edited

Loading

siddharthteotia May 8, 2023 •

edited

Loading

jasperjiaguo commented May 8, 2023 •

edited

Loading