-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize in clause evaluation #11557
Optimize in clause evaluation #11557
Conversation
a4aa804
to
c7a35ea
Compare
Thanks Jackie. lets add some test case. |
@kishoreg All the existing tests will use the new algorithm (it is on by default because it always outperform binary search) |
can we comment on the performance of this depending on the number of hits for example sorted values length (m) = 1000 What is the performance for each of the following strategy when we vary the hits from 0% match to 100% match.
|
Thats amazing. may be some numbers will help for future references. |
Codecov Report
@@ Coverage Diff @@
## master #11557 +/- ##
============================================
- Coverage 63.06% 63.04% -0.02%
+ Complexity 1107 1106 -1
============================================
Files 2326 2326
Lines 124928 124974 +46
Branches 19073 19078 +5
============================================
+ Hits 78787 78793 +6
- Misses 40542 40572 +30
- Partials 5599 5609 +10
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 9 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
if (queryContext == null || values.size() <= Integer.parseInt(queryContext.getQueryOptions() | ||
.getOrDefault(CommonConstants.Broker.Request.QueryOptionKey.IN_PREDICATE_SORT_THRESHOLD, | ||
CommonConstants.Broker.Request.QueryOptionValue.DEFAULT_IN_PREDICATE_SORT_THRESHOLD))) { | ||
if (queryContext == null || Boolean.parseBoolean( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If else block is always better, then we can remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to give it an option to turn it off in case sorting is dominating the latency (very unlikely though, but who knows)
In this specific scenario:
|
@Jackie-Jiang : are there any latency numbers for comparison? (this PR's approach vs no optimization vs sort then merge-sort) |
@Jackie-Jiang So the complexity of divide and conquer is It would be ideal if we can have some JMH benchmarks before making this the default behavior; as the benefit of the optimization might be offsetted by it's complexity when |
c7a35ea
to
c2254ce
Compare
Added a benchmark and here are the results: (value is OPS, higher the better)
As shown above, divide binary search always outperforms plain binary search. When the lookup value percentage is very high (over half of the dictionary), scan start outperforming divide binary search, but not by much. |
c2254ce
to
8ebcc80
Compare
#10254 introduced the sort then merge sort algorithm to handle large in clause. It has time complexity of
O(m + n)
withm
in clause values andn
dictionary size. If we do not count the fixed sorting cost (amortized because it is done only once), comparing to binary search with complexity ofO(mlogn)
, it can perform much worse whenn >> m
.This PR optimizes the algorithm to do divide and concur: we binary search the middle value of the in clause, then divide the search into 2 parts, each with half of the in values and dictionary range. Overall the time complexity should be within
O(m)
(when m is large) toO(mlogn)
(when m is small), but always better than binary search, and at least on par with the merge sort.Removed the old query option:
inPredicateSortThreshold
Added 2 query options:
inPredicatePreSorted
: indicate that the given in clause is already sortedinPredicateLookupAlgorithm
: algorithm to use to solve in clause (default isDIVIDE_BINARY_SEARCH
)DIVIDE_BINARY_SEARCH
: algorithm introduced in this PRSCAN
: old merge sort algorithm which scans all valuesPLAIN_BINARY_SEARCH
: do not sort the in clause values and use vanilla binary search