Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add theta sketch scalar #11153

Merged
merged 3 commits into from
Jul 27, 2023
Merged

Conversation

walterddr
Copy link
Contributor

@walterddr walterddr commented Jul 23, 2023

added several functions

  • GET_THETA_SKETCH_ESTIMATE to extract estimate from ThetaSketch
  • THETA_SKETCH_DIFF to differentiate 2 ThetaSketch
  • THETA_SKETCH_UNION/THETA_SKETCH_INTERSECT for ThetaSketch set operations

syntax examples:

select 
  GET_THETA_SKETCH_ESTIMATE(DISTINCTCOUNTRAWTHETASKETCH(runs, '')),
  GET_THETA_SKETCH_ESTIMATE(THETA_SKETCH_DIFF(DISTINCTCOUNTRAWTHETASKETCH(runs, ''), DISTINCTCOUNTRAWTHETASKETCH(runs, ''))),
  GET_THETA_SKETCH_ESTIMATE(THETA_SKETCH_UNION(DISTINCTCOUNTRAWTHETASKETCH(runs, ''), DISTINCTCOUNTRAWTHETASKETCH(runs, ''))),
  GET_THETA_SKETCH_ESTIMATE(THETA_SKETCH_INTERSECT(DISTINCTCOUNTRAWTHETASKETCH(runs, ''), DISTINCTCOUNTRAWTHETASKETCH(runs, '')))
from baseballStats limit 10

@codecov-commenter
Copy link

codecov-commenter commented Jul 24, 2023

Codecov Report

Merging #11153 (42f5bfc) into master (4d72eb5) will decrease coverage by 0.12%.
Report is 5 commits behind head on master.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #11153      +/-   ##
==========================================
- Coverage    0.11%    0.00%   -0.12%     
==========================================
  Files        2218     2207      -11     
  Lines      119113   118872     -241     
  Branches    18021    18004      -17     
==========================================
- Hits          137        0     -137     
+ Misses     118956   118872      -84     
+ Partials       20        0      -20     
Flag Coverage Δ
integration1temurin11 ?
integration1temurin17 ?
integration1temurin20 ?
integration2temurin11 0.00% <0.00%> (?)
integration2temurin17 0.00% <0.00%> (ø)
integration2temurin20 0.00% <0.00%> (ø)
unittests1temurin11 ?
unittests1temurin17 ?
unittests1temurin20 0.00% <0.00%> (ø)
unittests2temurin11 ?
unittests2temurin17 ?
unittests2temurin20 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...he/pinot/core/function/scalar/SketchFunctions.java 0.00% <0.00%> (ø)

... and 100 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@cbalci
Copy link
Contributor

cbalci commented Jul 24, 2023

This is pretty cool, but I'd suggest we prefix the function names with sketch types, ie THETA_SKETCH_GET_ESTIMATE, since we use different sketches (KLL, Tuples etc.) and more coming.

@walterddr
Copy link
Contributor Author

walterddr commented Jul 24, 2023

This is pretty cool, but I'd suggest we prefix the function names with sketch types, ie THETA_SKETCH_GET_ESTIMATE, since we use different sketches (KLL, Tuples etc.) and more coming.

Ahh. LOL i didn't know KLL and Tuples are actually a different sketches :-) thank you for the notes.

qq: does it matter what sketch is being used? the scalar function here doesn't really know what sketch is being used, all it does is

  1. extract the sketch from binary or stringified format
  2. apply set operation (optionally)
  3. compute the estimate.

do we need different handling for KLL or Tuple based sketches?

@cbalci
Copy link
Contributor

cbalci commented Jul 24, 2023

Right, 'sketch' is a generic term for various probabilistic data structures. Different sketches have different binary layouts and serializers/deserializers. It would be hard to write a generic function which handles all. The one you are using in this PR is for ThetaSketch only.

Even if we manage to identify they sketch type from serialized form, interfaces may differ. For example ThetaSketch and TuplesSketch are 'distinctcount' type sketches and offer a .getEstimate interface, while KLLSketch is a 'percentile' type sketch which has a .getQuantile interface.

Hope this makes sense.

@walterddr
Copy link
Contributor Author

ok so to summarize:

  • we should have a getSketchEstimate and getSketchQuantile method which should figure out if the sketch object allows such function (this can be done as a follow up with customized defined type
  • for setUnion/Intersect/Diff only the same type of sketches are allowed to perform set operation.

feels like this should be handled by type matching instead of function name.

@walterddr walterddr marked this pull request as ready for review July 27, 2023 16:36
@walterddr walterddr merged commit 11b85df into apache:master Jul 27, 2023
22 checks passed
walterddr added a commit that referenced this pull request Jul 29, 2023
enable theta sketch test for v2 after #11144 and #11153 

Co-authored-by: Rong Rong <rongr@startree.ai>
s0nskar pushed a commit to s0nskar/pinot that referenced this pull request Aug 10, 2023
* add theta sketch scalar functions
* fix sketch name to add "theta" specifically to differentiate other sketch types

---------

Co-authored-by: Rong Rong <rongr@startree.ai>
s0nskar pushed a commit to s0nskar/pinot that referenced this pull request Aug 10, 2023
enable theta sketch test for v2 after apache#11144 and apache#11153 

Co-authored-by: Rong Rong <rongr@startree.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants