-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Covariance Function #9236
Conversation
…before testing invalid inputs
Codecov Report
@@ Coverage Diff @@
## master #9236 +/- ##
============================================
- Coverage 69.99% 68.64% -1.36%
- Complexity 4682 5007 +325
============================================
Files 1848 1864 +16
Lines 98572 99469 +897
Branches 14967 15133 +166
============================================
- Hits 68999 68279 -720
- Misses 24723 26285 +1562
- Partials 4850 4905 +55
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
...segment-local/src/main/java/org/apache/pinot/segment/local/customobject/CovarianceTuple.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work on both SV and MV columns ?
Please update the PR description with a brief note on what the new function is about, what it does, how to use etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also mention the precision issue with float
and further extension to correlation?
Also probably paste a short description of population covariance vs sample covariance in the description (and future docs)(https://en.wikipedia.org/wiki/Bessel%27s_correction)
Minor but LGTM
pinot-core/src/test/java/org/apache/pinot/queries/CovarianceQueriesTest.java
Outdated
Show resolved
Hide resolved
pinot-core/src/test/java/org/apache/pinot/queries/CovarianceQueriesTest.java
Outdated
Show resolved
Hide resolved
Can you please clarify or add a link in the description to the formula you are using to calculate covariance? I don't see "average value" being used in your calculation? It will also be good to add support for "covariance of sample" (which is probably more commonly used) along with "covariance of population" |
Yes. This should be simple if we add a flag for switching Bessel's correction in |
...ain/java/org/apache/pinot/core/query/aggregation/function/CovarianceAggregationFunction.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/pinot/core/query/aggregation/function/CovarianceAggregationFunction.java
Show resolved
Hide resolved
...ain/java/org/apache/pinot/core/query/aggregation/function/CovarianceAggregationFunction.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Will wait for @jasperjiaguo to take another look before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just the previous minor comment on doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm overall. minor comments
This PR adds support for COVAR_POP and COVAR_SAMP function to Pinot. These functions only work on numerical SV columns.
Label:
feature
Issue: #8493
Use Cases
covar_pop(x, y)
returns the population covariance(formula:sum[(x_i - E(X))(y_i - E(Y))] / n
, type:DOUBLE
) of the input values.covar_samp(x, y)
returns the sample covariance(formula:sum[(x_i - E(X))(y_i - E(Y))] / (n - 1)
, type:DOUBLE
) of the input values.Note
. If any of the input columns has type
float
, it will be read asdouble
which can lead to discrepancy with Java's result due to loss of precision during type conversion between float and double.Future
CovarianceTuple
such as sum(x^2), sum(y^y), etc.