Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ARG_MIN and ARG_MAX Functions #10636

Merged
merged 14 commits into from
May 9, 2023

Conversation

jasperjiaguo
Copy link
Contributor

@jasperjiaguo jasperjiaguo commented Apr 18, 2023

This PR adds ArgMin/ArgMax function

  • Added the prerequisite code for ArgMin/ArgMax query rewriting and query result rewriting
  • ArgMin/ArgMax Function

Syntax:

SELECT ArgMin(measuringCol1, measuringCol2, measuringCol3, projectionCol1), ArgMin(measuringCol1, measuringCol2, measuringCol3, projectionCol2) FROM table

These two functions do lexicographical ordering on <measuringCol1, measuringCol2, measuringCol3>, and project projectionCol1 and projectionCol2 for all appearances of minimum <measuringCol1, measuringCol2, measuringCol3>

E.g. for input data

floatCol intCol stringCol doubleCol longCol
1.0 1 "a2" 2.0 2
2.5 1 "a11" 3.0 2
5.0 2 "a2" 4.0 1

Query 1

SELECT 
argmin(intCol, stringCol, floatCol), 
argmin(intCol, stringCol, intCol) , 
argmin(intCol, stringCol, stringCol), 
argmin(intCol, stringCol, doubleCol)  
FROM table

Result table:

argmin(intCol, stringCol, floatCol) argmin(intCol, stringCol, intCol) argmin(intCol, stringCol, stringCol) argmin(intCol, stringCol, doubleCol)
2.5 1 "a11" 3.0

Query 2

SELECT 
argmin(intCol, **stringCol**),  
argmin(intCol, **doubleCol**), 
sum(doubleCol)  
FROM table

Result table

argmin(intCol, stringCol) argmin(intCol, doubleCol) sum(doubleCol)
"a2" 2.0 9.0
"a11"* 3.0 null**

Note
* Without dedup all the rows with the same extremum key will be output
** Regular aggregation functions will still output 1 field in the first row, the other rows will be filled by null, same applies when two different argmin/max functions has different number of projection rows

Query 3

SELECT 
intCol, 
argmin(longCol, **doubleCol**),  
argmin(longCol, **longCol**)   
FROM table 
GROUP BY intCol

Result table

intCol argmin(longCol, doubleCol) argmin(longCol, longCol)
1 2.0 2
1* 3.0 2
2 4.0 1

Note
* note that we fill the fields where the group id is the same as the previous row

Notes:

  • This impl does not work with AS clause (e.g. SELECT argmin(longCol, doubleCol) AS argmin won't work), because we depend on the return column name to rewrite the query result.
  • Putting argmin/max column inside order by clause (e.g. SELECT intCol, argmin(longCol, doubleCol) FROM table GROUP BY intCol ORDER BY argmin(longCol, doubleCol)) is not supported as semantically ordering multi-column multi-row argmin(longCol, doubleCol) results doesn't make sense
  • Currently projecting MV bytes column doesn't work because DataBlock is not able to serialize it correctly

@jasperjiaguo jasperjiaguo force-pushed the arg_min_max branch 5 times, most recently from 684df2d to c56153a Compare April 18, 2023 21:30
@codecov-commenter
Copy link

codecov-commenter commented Apr 18, 2023

Codecov Report

Merging #10636 (b8a15ad) into master (53cb451) will increase coverage by 0.15%.
The diff coverage is 86.31%.

@@             Coverage Diff              @@
##             master   #10636      +/-   ##
============================================
+ Coverage     70.28%   70.43%   +0.15%     
- Complexity     6430     6462      +32     
============================================
  Files          2112     2140      +28     
  Lines        113994   115088    +1094     
  Branches      17219    17348     +129     
============================================
+ Hits          80121    81063     +942     
- Misses        28275    28384     +109     
- Partials       5598     5641      +43     
Flag Coverage Δ
integration1 24.20% <5.30%> (-0.12%) ⬇️
integration2 23.92% <5.86%> (-0.18%) ⬇️
unittests1 68.04% <85.71%> (+0.20%) ⬆️
unittests2 13.74% <0.27%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...va/org/apache/pinot/spi/utils/CommonConstants.java 26.74% <0.00%> (-0.32%) ⬇️
...regation/groupby/DummyAggregationResultHolder.java 14.28% <14.28%> (ø)
.../aggregation/groupby/DummyGroupByResultHolder.java 25.00% <25.00%> (ø)
...org/apache/pinot/core/common/ObjectSerDeUtils.java 89.92% <53.33%> (-1.46%) ⬇️
...ls/argminmax/ArgMinMaxProjectionValSetWrapper.java 60.00% <60.00%> (ø)
...ils/argminmax/ArgMinMaxMeasuringValSetWrapper.java 66.66% <66.66%> (ø)
...gation/utils/argminmax/ArgMinMaxWrapperValSet.java 72.22% <72.22%> (ø)
...y/aggregation/utils/argminmax/ArgMinMaxObject.java 81.48% <81.48%> (ø)
...re/query/utils/rewriter/ResultRewriterFactory.java 84.21% <84.21%> (ø)
...n/function/ParentArgMinMaxAggregationFunction.java 87.55% <87.55%> (ø)
... and 13 more

... and 131 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jasperjiaguo jasperjiaguo force-pushed the arg_min_max branch 2 times, most recently from 9058adc to 8414629 Compare April 26, 2023 21:38
@jasperjiaguo jasperjiaguo force-pushed the arg_min_max branch 10 times, most recently from 5698824 to 9546e05 Compare May 2, 2023 19:57
@jasperjiaguo jasperjiaguo marked this pull request as ready for review May 2, 2023 19:59
Copy link
Contributor

@somandal somandal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good, left mostly minor comments

@jasperjiaguo
Copy link
Contributor Author

@somandal Address the code related comments, will add more test cases tomorrow.

case ARGMAX:
case ARGMIN:
throw new IllegalArgumentException("Aggregation function: " + function
+ " is only supported in selection without alias.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I am following this exception. Why do we need to throw this exception for ARG_MIN and ARG_MAX ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the argmin max that's not rewritten (invalid ones), i.e. the one not in selection or in selection but used with alias

return new ParentArgMinMaxAggregationFunction(arguments, false);
case PINOTCHILDAGGREGATIONARGMAX:
return new ChildArgMinMaxAggregationFunction(arguments, true);
case PINOTCHILDAGGREGATIONARGMIN:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember but we hit this in some recent work as part of multi stage as well. Aggregation functions that are not going to be used by the user in SQL also need to be exposed here and ideally they shouldn't. I think it happened for the 3rd / 4th moment / reduce functions

Is it possible to only add ARG_MIN and ARG_MAX (the user level AggregationFunctions) in this interface ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddharthteotia in that case we would need to use one specific argument in the argument list to denote if the function is parent or children and the factory here would need to look into the argument details, which IMO is not very clean.


// Test transformation function inside argmax/argmin, for both projection and measuring
// the max of 3000x-x^2 is 2250000, which is the max of 3000x-x^2
query = "SELECT sum(intColumn), argmax(3000 * doubleColumn - intColumn * intColumn, doubleColumn),"
Copy link
Contributor

@siddharthteotia siddharthteotia May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So essentially the first set of arguments for lex ordering (for min or max) can be a mix of identifier (column) or a transform (scalar or non-scalar). Correct ?

Is the same true for projectionColumn ? Can we project the transform instead of identifier ? Do we have tests for that ?

Copy link
Contributor Author

@jasperjiaguo jasperjiaguo May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So essentially the first set of arguments for lex ordering (for min or max) can be a mix of identifier (column) or a transform (scalar or non-scalar). Correct ?

Yes

there are tests for projection of transformed cols in the testcase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    query = "SELECT sum(intColumn), argmax(3000 * doubleColumn - intColumn * intColumn, doubleColumn),"
        + "argmax(3000 * doubleColumn - intColumn * intColumn, 3000 * doubleColumn - intColumn * intColumn),"
        + "argmax(3000 * doubleColumn - intColumn * intColumn, doubleColumn), "
        + "argmin(replace(stringColumn, \'a\', \'bb\'), replace(stringColumn, \'a\', \'bb\'))"
        + "FROM testTable";


// TODO: The following query throws an exception,
// requires fix for multi-value bytes column serialization in DataBlock
query = "SELECT arg_min(intColumn, mvBytesColumn) FROM testTable";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add other failure scenario tests ? For example, invalid number or types of arguments in the arg_min or arg_max function if not already added ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added

assertEquals(rows.get(3)[0], 1200);

// test1, with dedupe
query = "SELECT "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we have a query option or system option to bypass the default behavior and instead return just one of the rows for all duplicate / matching rows where min / max is happening ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maybe do this as a follow up. It will require the stability sorting/dedup of projection results, as would be better to do in a seperate PR.

/**
* Regression test for queries with result rewriter.
*/
public class ResultRewriterRegressionTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we testing here and curious why is this suffixed with RegressionTest ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the test cases to show existing aggregation function should not be impacted by the new pluggable result rewriter.

@siddharthteotia siddharthteotia changed the title Adding ArgMin/ArgMax Function Support for ARG_MIN and ARG_MAX Functions May 8, 2023
PINOTPARENTAGGREGATIONARGMIN(CommonConstants.RewriterConstants.PARENT_AGGREGATION_NAME_PREFIX + ARGMIN.getName()),
PINOTPARENTAGGREGATIONARGMAX(CommonConstants.RewriterConstants.PARENT_AGGREGATION_NAME_PREFIX + ARGMAX.getName()),
PINOTCHILDAGGREGATIONARGMIN(CommonConstants.RewriterConstants.CHILD_AGGREGATION_NAME_PREFIX + ARGMIN.getName()),
PINOTCHILDAGGREGATIONARGMAX(CommonConstants.RewriterConstants.CHILD_AGGREGATION_NAME_PREFIX + ARGMAX.getName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can 103-106 be encapsulated within the rewriter itself as opposed to exposing them in AggregationFunctionType.java ?

I feel we should only have user exposed aggregation functions which can be used in SQL in this file ?

Copy link
Contributor Author

@jasperjiaguo jasperjiaguo May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should only have user exposed aggregation functions which can be used in SQL in this file ?

Do we have a specific reason for doing this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main reason is clean interface. AggregationFunctionType is for user exposed in-built functions ideally.

So as a follow-up we should try to see how we can do this cleanly in future otherwise this file will end up having mix of things imo.

@@ -972,4 +974,11 @@ public static class Range {
public static class IdealState {
public static final String HYBRID_TABLE_TIME_BOUNDARY = "HYBRID_TABLE_TIME_BOUNDARY";
}

public static class RewriterConstants {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why should this be defined in CommonConstants where we typically define Broker or Server instance level config constants ? Can this be taken inside rewriter as public constants ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting this in rewriter will make it not accessible from SPI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Hmm we should try to fix this in follow-ups

@@ -303,6 +303,8 @@ public static class Broker {
"pinot.broker.instance.enableThreadAllocatedBytesMeasurement";
public static final boolean DEFAULT_ENABLE_THREAD_CPU_TIME_MEASUREMENT = false;
public static final boolean DEFAULT_THREAD_ALLOCATED_BYTES_MEASUREMENT = false;
public static final String CONFIG_OF_BROKER_RESULT_REWRITER_CLASS_NAMES
= "pinot.broker.result.rewriter.class.names";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we already had rewriter configuration ? Can we reuse the same config ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is for query rewriter, we should be using a different rewriter class for results right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks.

public static final String PARENT_AGGREGATION_NAME_PREFIX = "pinotparentaggregation";
public static final String CHILD_AGGREGATION_NAME_PREFIX = "pinotchildaggregation";
public static final String CHILD_AGGREGATION_SEPERATOR = "@";
public static final String CHILD_KEY_SEPERATOR = "_";
Copy link
Contributor

@siddharthteotia siddharthteotia May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on purpose of CHILD_AGGREGATION_SEPERATOR and CHILD_KEY_SEPERATOR ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 /**
   * The name of the column as follows:
   * CHILD_AGGREGATION_NAME_PREFIX + actual function type + operands + CHILD_AGGREGATION_SEPERATOR
   * + actual function type + parent aggregation function id + CHILD_KEY_SEPERATOR + column key in parent function
   * e.g. if the child aggregation function is "argmax(0,a,b,x)", the name of the column is
   * "pinotchildaggregationargmax(a,b,x)@argmax0_x"
   */

To easily associate the child aggregation function with it's parents and extract the result with key.

@@ -325,6 +325,18 @@ public static AggregationFunction getAggregationFunction(FunctionContext functio
return new FourthMomentAggregationFunction(firstArgument, FourthMomentAggregationFunction.Type.KURTOSIS);
case FOURTHMOMENT:
return new FourthMomentAggregationFunction(firstArgument, FourthMomentAggregationFunction.Type.MOMENT);
case PINOTPARENTAGGREGATIONARGMAX:
Copy link
Contributor

@siddharthteotia siddharthteotia May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove PINOT prefix and AGGREGATION as well ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • PARENTARGMAX
  • PARENTARGMIN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@jasperjiaguo
Copy link
Contributor Author

jasperjiaguo commented May 8, 2023

Thinking more on my previous comment.....

May be one way to workaround the NULL business is to output array when we have duplicates where the min and max is happening ?

This query

SELECT 
argmin(intCol, **stringCol**),  
argmin(intCol, **doubleCol**), 
sum(doubleCol)  
FROM table

can output

argmin(intCol, stringCol) argmin(intCol, doubleCol) sum(doubleCol)
["a2", "a11"] [2.0, 3.0] 9.0
Similarly, the following query

SELECT 
intCol, 
argmin(longCol, **doubleCol**),  
argmin(longCol, **longCol**)   
FROM table 
GROUP BY intCol

Can output

intCol argmin(longCol, doubleCol) argmin(longCol, longCol)
1 [2.0, 3.0] 2
2 4.0 1
This is probably a more intuitive way to reason about response and is more SQL friendly imo and avoids populating NULLs.

@jasperjiaguo wdyt ?

Agreed that null filling can be confusing for group ids. I have made a change for the group id value filling and it now behaves like:

SELECT 
intCol, 
argmin(longCol, **doubleCol**),  
argmin(longCol, **longCol**)   
FROM table 
GROUP BY intCol
intCol argmin(longCol, doubleCol) argmin(longCol, longCol)
1 2.0 2
1 3.0 2
2 4.0 1

SELECT
argmin(intCol, stringCol),
argmin(intCol, doubleCol),
sum(doubleCol)
FROM table

argmin(intCol, stringCol) argmin(intCol, doubleCol) sum(doubleCol)
"a2" 2.0 9.0
"a11"* 3.0 9.0

which is essentially flattened view of

intCol argmin(longCol, doubleCol) argmin(longCol, longCol)
1 [2.0, 3.0] 2
2 4.0 1

and

argmin(intCol, stringCol) argmin(intCol, doubleCol) sum(doubleCol)
["a2", "a11"] [2.0, 3.0] 9.0

respectively

Meanwhile, I have also considered the option array fashion of returning multiple rows of output, there are a few reasons I didn't use it:

  1. It wouldn't work for all MV types as we currently don't have sth like ARRAY[ARRAY[INT]] for returned results
  2. It would be easier for the user to parse the result when this is flattened, as the user side will not need to flatten + align them on their own when they are projecting multiple cols.
  3. Using the flattened view will keep the output column type the same as the data column type, which I feel is cleaner.

IMO we should allow other aggregation functions with argmin and argmax.

+1 on this, if we have a well-defined output scheme then the user should have the power to run 1 query instead of 2

cc @siddharthteotia @somandal

@siddharthteotia
Copy link
Contributor

When will we run into the problem of ARRAY[ARRAY[INT]] ?

@jasperjiaguo
Copy link
Contributor Author

When will we run into the problem of ARRAY[ARRAY[INT]] ?

When we are projecting multiple rows of an INT MV column

@Override
public final String getResultColumnName() {
String type = getType().getName().toLowerCase();
return CommonConstants.RewriterConstants.CHILD_AGGREGATION_NAME_PREFIX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be better to use StringBuilder in general but since this function will be called once per query, it should be fine for now

@siddharthteotia
Copy link
Contributor

I have some suggestions / questions on simplifying the implementation a bit. But don't want to hold this. Let's discuss them sometime soon.

@siddharthteotia siddharthteotia merged commit 7a673fd into apache:master May 9, 2023
@siddharthteotia
Copy link
Contributor

@jasperjiaguo please add user docs soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants