Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add TextMatchFilterOptimizer to maximally push down text_match filters to Lucene #12339

Merged
merged 2 commits into from
Jan 31, 2024

Conversation

itschrispeck
Copy link
Collaborator

@itschrispeck itschrispeck commented Jan 30, 2024

Motivation:
Query performance against the Lucene index suffers when chaining multiple text_match predicates together. Our users often programmatically generate their queries, which exacerbates the issue as 10s/100s of text_match predicates can be included in a single query.

Because of this, users are required to understand Pinot's Lucene implementation details for them to compose an efficient query. To remove this requirement, this PR adds a TextMatchFilterOptmizer that performs the optimization automatically.

Summary:
This functionality is best understood through the unit testcases. In short:

  • Merge all AND's and OR's text_match operands when possible, without affecting query accuracy
  • Push down NOT into Lucene, unless all text_match filters are inversed, then the NOT expression remains in Pinot

Open question:
There is one edge case (that I can think of) where this optimization can hurt performance: if there are a number of text_match OR text_match OR text_match etc, early termination when limit is reached might take longer since the entire merged text_match query must now complete. For this reason, it might be prudent to put this behind a query option (or rather a query option to disable it, since I believe it makes more sense to enable by default).

Ideally, the LuceneDocIdCollector could early terminate (but doesn't currently have the required context).

Testing: unit tests (query performance separately verified via running the optimized vs unoptimized queries). Sample optimization improvements:

count(*) = 500_000_000
optimized: 2950ms
unoptmized: 8350ms
> select count(*) from table where text_match(column, '"message_logtype:storage-event" AND /.*c28487d062.*/')
> select count(*) from table where text_match(column, '"message_logtype:storage-event"') AND text_match(column, '/.*c28487d062.*/')


count(*) = 2
optimized: 12ms
unoptimized: 8400ms
> select count(*) from table where text_match(column, '"message_logtype:storage-event" AND /.*c28487d062.*/ AND "offset:46127612"')
> select count(*) from table where text_match(column, '"message_logtype:storage-event"') AND text_match(column, '/.*c28487d062.*/') AND text_match(column, '"offset:46127612"')

tags: feature, performance (?)

@codecov-commenter
Copy link

codecov-commenter commented Jan 30, 2024

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (4823802) 61.66% compared to head (290712e) 61.62%.
Report is 4 commits behind head on master.

Files Patch % Lines
...ery/optimizer/filter/TextMatchFilterOptimizer.java 86.31% 5 Missing and 8 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12339      +/-   ##
============================================
- Coverage     61.66%   61.62%   -0.04%     
  Complexity      207      207              
============================================
  Files          2421     2422       +1     
  Lines        131852   131974     +122     
  Branches      20345    20372      +27     
============================================
+ Hits          81303    81331      +28     
- Misses        44582    44666      +84     
- Partials       5967     5977      +10     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (ø)
integration <0.01% <0.00%> (ø)
integration1 <0.01% <0.00%> (ø)
integration2 ?
java-11 61.62% <86.31%> (+0.03%) ⬆️
java-21 27.70% <0.00%> (-33.85%) ⬇️
skip-bytebuffers-false 61.62% <86.31%> (-0.03%) ⬇️
skip-bytebuffers-true <0.01% <0.00%> (-61.51%) ⬇️
temurin 61.62% <86.31%> (-0.04%) ⬇️
unittests 61.62% <86.31%> (-0.04%) ⬇️
unittests1 46.75% <86.31%> (-0.01%) ⬇️
unittests2 27.70% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@itschrispeck
Copy link
Collaborator Author

@siddharthteotia I'd appreciate your thoughts/review

"SELECT * FROM testTable WHERE TEXT_MATCH(string1, 'foo1 AND bar1') AND TEXT_MATCH(string2, 'foo2 AND bar2')");
testCannotOptimizeQuery("SELECT * FROM testTable WHERE TEXT_MATCH(string1, 'foo') OR TEXT_MATCH(string2, 'bar')");
testCannotOptimizeQuery(
"SELECT * FROM testTable WHERE int = 1 AND TEXT_MATCH(string, 'foo') OR TEXT_MATCH(string, 'bar')");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this one can not be optimized? the columns are the same "string"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int = 1 AND text_match(x) OR text_match(y) wouldn't be equivalent to int = 1 AND text_match(x OR y)

@chenboat chenboat merged commit dd8be2a into apache:master Jan 31, 2024
19 checks passed
suyashpatel98 pushed a commit to suyashpatel98/pinot that referenced this pull request Feb 28, 2024
…ers to Lucene (apache#12339)

* add TextMatchFilterOptimizer

* fix equivalence for all not
@deemoliu
Copy link
Contributor

this looks awesome. is it possible to optimize regexp_like(x) OR regexp_like(y) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants