-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add TextMatchFilterOptimizer to maximally push down text_match
filters to Lucene
#12339
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #12339 +/- ##
============================================
- Coverage 61.66% 61.62% -0.04%
Complexity 207 207
============================================
Files 2421 2422 +1
Lines 131852 131974 +122
Branches 20345 20372 +27
============================================
+ Hits 81303 81331 +28
- Misses 44582 44666 +84
- Partials 5967 5977 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@siddharthteotia I'd appreciate your thoughts/review |
pinot-core/src/test/java/org/apache/pinot/core/query/optimizer/QueryOptimizerTest.java
Show resolved
Hide resolved
"SELECT * FROM testTable WHERE TEXT_MATCH(string1, 'foo1 AND bar1') AND TEXT_MATCH(string2, 'foo2 AND bar2')"); | ||
testCannotOptimizeQuery("SELECT * FROM testTable WHERE TEXT_MATCH(string1, 'foo') OR TEXT_MATCH(string2, 'bar')"); | ||
testCannotOptimizeQuery( | ||
"SELECT * FROM testTable WHERE int = 1 AND TEXT_MATCH(string, 'foo') OR TEXT_MATCH(string, 'bar')"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this one can not be optimized? the columns are the same "string"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int = 1 AND text_match(x) OR text_match(y)
wouldn't be equivalent to int = 1 AND text_match(x OR y)
…ers to Lucene (apache#12339) * add TextMatchFilterOptimizer * fix equivalence for all not
this looks awesome. is it possible to optimize regexp_like(x) OR regexp_like(y) ? |
Motivation:
Query performance against the Lucene index suffers when chaining multiple
text_match
predicates together. Our users often programmatically generate their queries, which exacerbates the issue as 10s/100s oftext_match
predicates can be included in a single query.Because of this, users are required to understand Pinot's Lucene implementation details for them to compose an efficient query. To remove this requirement, this PR adds a
TextMatchFilterOptmizer
that performs the optimization automatically.Summary:
This functionality is best understood through the unit testcases. In short:
text_match
operands when possible, without affecting query accuracytext_match
filters are inversed, then the NOT expression remains in PinotOpen question:
There is one edge case (that I can think of) where this optimization can hurt performance: if there are a number of
text_match OR text_match OR text_match
etc, early termination whenlimit
is reached might take longer since the entire mergedtext_match
query must now complete. For this reason, it might be prudent to put this behind a query option (or rather a query option to disable it, since I believe it makes more sense to enable by default).Ideally, the
LuceneDocIdCollector
could early terminate (but doesn't currently have the required context).Testing: unit tests (query performance separately verified via running the optimized vs unoptimized queries). Sample optimization improvements:
tags:
feature
,performance
(?)