-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for ARG_MIN and ARG_MAX Functions #10636
Conversation
684df2d
to
c56153a
Compare
Codecov Report
@@ Coverage Diff @@
## master #10636 +/- ##
============================================
+ Coverage 70.28% 70.43% +0.15%
- Complexity 6430 6462 +32
============================================
Files 2112 2140 +28
Lines 113994 115088 +1094
Branches 17219 17348 +129
============================================
+ Hits 80121 81063 +942
- Misses 28275 28384 +109
- Partials 5598 5641 +43
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 131 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
9058adc
to
8414629
Compare
5698824
to
9546e05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good, left mostly minor comments
...src/main/java/org/apache/pinot/core/query/aggregation/function/ChildAggregationFunction.java
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java
Outdated
Show resolved
Hide resolved
...e/src/main/java/org/apache/pinot/core/query/aggregation/utils/argminmax/ArgMinMaxObject.java
Outdated
Show resolved
Hide resolved
pinot-core/src/test/java/org/apache/pinot/queries/ArgMinMaxTest.java
Outdated
Show resolved
Hide resolved
pinot-core/src/test/java/org/apache/pinot/queries/ArgMinMaxTest.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pinot/core/query/aggregation/function/ParentArgMinMaxAggregationFunction.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/pinot/core/query/aggregation/utils/argminmax/ArgMinMaxWrapperValSet.java
Show resolved
Hide resolved
...ava/org/apache/pinot/core/query/aggregation/function/ParentArgMinMaxAggregationFunction.java
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/query/aggregation/function/ChildAggregationFunction.java
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/core/query/aggregation/function/ParentAggregationFunction.java
Show resolved
Hide resolved
pinot-core/src/test/java/org/apache/pinot/queries/ArgMinMaxTest.java
Outdated
Show resolved
Hide resolved
@somandal Address the code related comments, will add more test cases tomorrow. |
case ARGMAX: | ||
case ARGMIN: | ||
throw new IllegalArgumentException("Aggregation function: " + function | ||
+ " is only supported in selection without alias."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I am following this exception. Why do we need to throw this exception for ARG_MIN
and ARG_MAX
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the argmin max that's not rewritten (invalid ones), i.e. the one not in selection or in selection but used with alias
return new ParentArgMinMaxAggregationFunction(arguments, false); | ||
case PINOTCHILDAGGREGATIONARGMAX: | ||
return new ChildArgMinMaxAggregationFunction(arguments, true); | ||
case PINOTCHILDAGGREGATIONARGMIN: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely remember but we hit this in some recent work as part of multi stage as well. Aggregation functions that are not going to be used by the user in SQL also need to be exposed here and ideally they shouldn't. I think it happened for the 3rd / 4th moment / reduce functions
Is it possible to only add ARG_MIN
and ARG_MAX
(the user level AggregationFunctions) in this interface ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddharthteotia in that case we would need to use one specific argument in the argument list to denote if the function is parent or children and the factory here would need to look into the argument details, which IMO is not very clean.
|
||
// Test transformation function inside argmax/argmin, for both projection and measuring | ||
// the max of 3000x-x^2 is 2250000, which is the max of 3000x-x^2 | ||
query = "SELECT sum(intColumn), argmax(3000 * doubleColumn - intColumn * intColumn, doubleColumn)," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So essentially the first set of arguments for lex ordering (for min or max) can be a mix of identifier (column) or a transform (scalar or non-scalar). Correct ?
Is the same true for projectionColumn
? Can we project the transform instead of identifier ? Do we have tests for that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So essentially the first set of arguments for lex ordering (for min or max) can be a mix of identifier (column) or a transform (scalar or non-scalar). Correct ?
Yes
there are tests for projection of transformed cols in the testcase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query = "SELECT sum(intColumn), argmax(3000 * doubleColumn - intColumn * intColumn, doubleColumn),"
+ "argmax(3000 * doubleColumn - intColumn * intColumn, 3000 * doubleColumn - intColumn * intColumn),"
+ "argmax(3000 * doubleColumn - intColumn * intColumn, doubleColumn), "
+ "argmin(replace(stringColumn, \'a\', \'bb\'), replace(stringColumn, \'a\', \'bb\'))"
+ "FROM testTable";
|
||
// TODO: The following query throws an exception, | ||
// requires fix for multi-value bytes column serialization in DataBlock | ||
query = "SELECT arg_min(intColumn, mvBytesColumn) FROM testTable"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add other failure scenario tests ? For example, invalid number or types of arguments in the arg_min
or arg_max
function if not already added ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, added
assertEquals(rows.get(3)[0], 1200); | ||
|
||
// test1, with dedupe | ||
query = "SELECT " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we have a query option or system option to bypass the default behavior and instead return just one of the rows for all duplicate / matching rows where min / max is happening ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can maybe do this as a follow up. It will require the stability sorting/dedup of projection results, as would be better to do in a seperate PR.
/** | ||
* Regression test for queries with result rewriter. | ||
*/ | ||
public class ResultRewriterRegressionTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we testing here and curious why is this suffixed with RegressionTest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the test cases to show existing aggregation function should not be impacted by the new pluggable result rewriter.
PINOTPARENTAGGREGATIONARGMIN(CommonConstants.RewriterConstants.PARENT_AGGREGATION_NAME_PREFIX + ARGMIN.getName()), | ||
PINOTPARENTAGGREGATIONARGMAX(CommonConstants.RewriterConstants.PARENT_AGGREGATION_NAME_PREFIX + ARGMAX.getName()), | ||
PINOTCHILDAGGREGATIONARGMIN(CommonConstants.RewriterConstants.CHILD_AGGREGATION_NAME_PREFIX + ARGMIN.getName()), | ||
PINOTCHILDAGGREGATIONARGMAX(CommonConstants.RewriterConstants.CHILD_AGGREGATION_NAME_PREFIX + ARGMAX.getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can 103-106 be encapsulated within the rewriter itself as opposed to exposing them in AggregationFunctionType.java ?
I feel we should only have user exposed aggregation functions which can be used in SQL in this file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should only have user exposed aggregation functions which can be used in SQL in this file ?
Do we have a specific reason for doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main reason is clean interface. AggregationFunctionType is for user exposed in-built functions ideally.
So as a follow-up we should try to see how we can do this cleanly in future otherwise this file will end up having mix of things imo.
@@ -972,4 +974,11 @@ public static class Range { | |||
public static class IdealState { | |||
public static final String HYBRID_TABLE_TIME_BOUNDARY = "HYBRID_TABLE_TIME_BOUNDARY"; | |||
} | |||
|
|||
public static class RewriterConstants { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why should this be defined in CommonConstants
where we typically define Broker or Server instance level config constants ? Can this be taken inside rewriter as public constants ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putting this in rewriter will make it not accessible from SPI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Hmm we should try to fix this in follow-ups
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java
Outdated
Show resolved
Hide resolved
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/AggregationFunctionType.java
Outdated
Show resolved
Hide resolved
@@ -303,6 +303,8 @@ public static class Broker { | |||
"pinot.broker.instance.enableThreadAllocatedBytesMeasurement"; | |||
public static final boolean DEFAULT_ENABLE_THREAD_CPU_TIME_MEASUREMENT = false; | |||
public static final boolean DEFAULT_THREAD_ALLOCATED_BYTES_MEASUREMENT = false; | |||
public static final String CONFIG_OF_BROKER_RESULT_REWRITER_CLASS_NAMES | |||
= "pinot.broker.result.rewriter.class.names"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we already had rewriter configuration ? Can we reuse the same config ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is for query rewriter, we should be using a different rewriter class for results right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Thanks.
public static final String PARENT_AGGREGATION_NAME_PREFIX = "pinotparentaggregation"; | ||
public static final String CHILD_AGGREGATION_NAME_PREFIX = "pinotchildaggregation"; | ||
public static final String CHILD_AGGREGATION_SEPERATOR = "@"; | ||
public static final String CHILD_KEY_SEPERATOR = "_"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on purpose of CHILD_AGGREGATION_SEPERATOR
and CHILD_KEY_SEPERATOR
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/**
* The name of the column as follows:
* CHILD_AGGREGATION_NAME_PREFIX + actual function type + operands + CHILD_AGGREGATION_SEPERATOR
* + actual function type + parent aggregation function id + CHILD_KEY_SEPERATOR + column key in parent function
* e.g. if the child aggregation function is "argmax(0,a,b,x)", the name of the column is
* "pinotchildaggregationargmax(a,b,x)@argmax0_x"
*/
To easily associate the child aggregation function with it's parents and extract the result with key.
@@ -325,6 +325,18 @@ public static AggregationFunction getAggregationFunction(FunctionContext functio | |||
return new FourthMomentAggregationFunction(firstArgument, FourthMomentAggregationFunction.Type.KURTOSIS); | |||
case FOURTHMOMENT: | |||
return new FourthMomentAggregationFunction(firstArgument, FourthMomentAggregationFunction.Type.MOMENT); | |||
case PINOTPARENTAGGREGATIONARGMAX: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove PINOT
prefix and AGGREGATION
as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PARENTARGMAX
PARENTARGMIN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Agreed that null filling can be confusing for group ids. I have made a change for the group id value filling and it now behaves like:
SELECT
which is essentially flattened view of
and
respectively Meanwhile, I have also considered the option array fashion of returning multiple rows of output, there are a few reasons I didn't use it:
+1 on this, if we have a well-defined output scheme then the user should have the power to run 1 query instead of 2 |
When will we run into the problem of |
When we are projecting multiple rows of an INT MV column |
…n names. Add more test cases. Refine error message.
@Override | ||
public final String getResultColumnName() { | ||
String type = getType().getName().toLowerCase(); | ||
return CommonConstants.RewriterConstants.CHILD_AGGREGATION_NAME_PREFIX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be better to use StringBuilder in general but since this function will be called once per query, it should be fine for now
I have some suggestions / questions on simplifying the implementation a bit. But don't want to hold this. Let's discuss them sometime soon. |
@jasperjiaguo please add user docs soon. |
This PR adds ArgMin/ArgMax function
Syntax:
SELECT ArgMin(measuringCol1, measuringCol2, measuringCol3, projectionCol1), ArgMin(measuringCol1, measuringCol2, measuringCol3, projectionCol2) FROM table
These two functions do lexicographical ordering on
<measuringCol1, measuringCol2, measuringCol3>
, and project projectionCol1 and projectionCol2 for all appearances of minimum<measuringCol1, measuringCol2, measuringCol3>
E.g. for input data
Query 1
Result table:
Query 2
Result table
Note
* Without dedup all the rows with the same extremum key will be output
** Regular aggregation functions will still output 1 field in the first row, the other rows will be filled by null, same applies when two different argmin/max functions has different number of projection rows
Query 3
Result table
Note
* note that we fill the fields where the group id is the same as the previous row
Notes:
argmin(longCol, doubleCol)
results doesn't make sense