Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Covariance Function #9236

Merged
merged 26 commits into from
Aug 25, 2022
Merged

Conversation

SabrinaZhaozyf
Copy link
Contributor

@SabrinaZhaozyf SabrinaZhaozyf commented Aug 17, 2022

This PR adds support for COVAR_POP and COVAR_SAMP function to Pinot. These functions only work on numerical SV columns.

Label: feature
Issue: #8493
Use Cases

  • covar_pop(x, y) returns the population covariance(formula: sum[(x_i - E(X))(y_i - E(Y))] / n, type: DOUBLE) of the input values.
  • covar_samp(x, y) returns the sample covariance(formula: sum[(x_i - E(X))(y_i - E(Y))] / (n - 1), type: DOUBLE) of the input values.

Note

Future

  • Implementation can be extended to calculate Pearson's coefficient by adding extra states to CovarianceTuple such as sum(x^2), sum(y^y), etc.
  • Implementation for calculating variance and standard deviation should also be similar.

@SabrinaZhaozyf SabrinaZhaozyf changed the title [WIP] Add Support for Covariance Function Add Support for Covariance Function Aug 23, 2022
@SabrinaZhaozyf SabrinaZhaozyf marked this pull request as ready for review August 23, 2022 00:31
@codecov-commenter
Copy link

codecov-commenter commented Aug 23, 2022

Codecov Report

Merging #9236 (49d5929) into master (e8a8684) will decrease coverage by 1.35%.
The diff coverage is 63.34%.

❗ Current head 49d5929 differs from pull request most recent head b67b6d6. Consider uploading reports for the commit b67b6d6 to get more accurate results

@@             Coverage Diff              @@
##             master    #9236      +/-   ##
============================================
- Coverage     69.99%   68.64%   -1.36%     
- Complexity     4682     5007     +325     
============================================
  Files          1848     1864      +16     
  Lines         98572    99469     +897     
  Branches      14967    15133     +166     
============================================
- Hits          68999    68279     -720     
- Misses        24723    26285    +1562     
- Partials       4850     4905      +55     
Flag Coverage Δ
integration1 26.19% <23.06%> (-0.16%) ⬇️
integration2 ?
unittests1 67.09% <64.45%> (+0.07%) ⬆️
unittests2 15.29% <17.92%> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...apache/pinot/broker/api/HttpRequesterIdentity.java 28.57% <0.00%> (-57.15%) ⬇️
...org/apache/pinot/broker/api/RequesterIdentity.java 50.00% <0.00%> (-50.00%) ⬇️
.../pinot/broker/api/resources/PinotBrokerLogger.java 0.00% <0.00%> (ø)
...mmon/assignment/InstanceAssignmentConfigUtils.java 14.63% <0.00%> (-0.76%) ⬇️
...he/pinot/common/assignment/InstancePartitions.java 40.00% <0.00%> (-1.18%) ⬇️
...ot/common/function/scalar/ArithmeticFunctions.java 66.66% <0.00%> (-2.90%) ⬇️
...t/common/response/broker/BrokerResponseNative.java 96.44% <ø> (ø)
...er/api/access/ZkBasicAuthAccessControlFactory.java 0.00% <ø> (ø)
...ache/pinot/controller/api/resources/Constants.java 42.10% <ø> (ø)
...er/api/resources/LLCSegmentCompletionHandlers.java 62.37% <ø> (ø)
... and 306 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@siddharthteotia siddharthteotia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work on both SV and MV columns ?

@siddharthteotia
Copy link
Contributor

Please update the PR description with a brief note on what the new function is about, what it does, how to use etc.

Copy link
Contributor

@jasperjiaguo jasperjiaguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also mention the precision issue with float and further extension to correlation?
Also probably paste a short description of population covariance vs sample covariance in the description (and future docs)(https://en.wikipedia.org/wiki/Bessel%27s_correction)

Minor but LGTM

@amrishlal
Copy link
Contributor

Can you please clarify or add a link in the description to the formula you are using to calculate covariance? I don't see "average value" being used in your calculation? It will also be good to add support for "covariance of sample" (which is probably more commonly used) along with "covariance of population"

@jasperjiaguo
Copy link
Contributor

jasperjiaguo commented Aug 23, 2022

Can you please clarify or add a link in the description to the formula you are using to calculate covariance? I don't see "average value" being used in your calculation? It will also be good to add support for "covariance of sample" (which is probably more commonly used) along with "covariance of population"

Yes. This should be simple if we add a flag for switching Bessel's correction in extract result. We probably want to add that @SabrinaZhaozyf ?

Copy link
Contributor

@siddharthteotia siddharthteotia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Will wait for @jasperjiaguo to take another look before merging.

Copy link
Contributor

@jasperjiaguo jasperjiaguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just the previous minor comment on doc

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall. minor comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants