add TextMatchFilterOptimizer to maximally push down `text_match` filters to Lucene #12339

itschrispeck · 2024-01-30T03:30:11Z

Motivation:
Query performance against the Lucene index suffers when chaining multiple text_match predicates together. Our users often programmatically generate their queries, which exacerbates the issue as 10s/100s of text_match predicates can be included in a single query.

Because of this, users are required to understand Pinot's Lucene implementation details for them to compose an efficient query. To remove this requirement, this PR adds a TextMatchFilterOptmizer that performs the optimization automatically.

Summary:
This functionality is best understood through the unit testcases. In short:

Merge all AND's and OR's text_match operands when possible, without affecting query accuracy
Push down NOT into Lucene, unless all text_match filters are inversed, then the NOT expression remains in Pinot

Open question:
There is one edge case (that I can think of) where this optimization can hurt performance: if there are a number of text_match OR text_match OR text_match etc, early termination when limit is reached might take longer since the entire merged text_match query must now complete. For this reason, it might be prudent to put this behind a query option (or rather a query option to disable it, since I believe it makes more sense to enable by default).

Ideally, the LuceneDocIdCollector could early terminate (but doesn't currently have the required context).

Testing: unit tests (query performance separately verified via running the optimized vs unoptimized queries). Sample optimization improvements:

count(*) = 500_000_000
optimized: 2950ms
unoptmized: 8350ms
> select count(*) from table where text_match(column, '"message_logtype:storage-event" AND /.*c28487d062.*/')
> select count(*) from table where text_match(column, '"message_logtype:storage-event"') AND text_match(column, '/.*c28487d062.*/')


count(*) = 2
optimized: 12ms
unoptimized: 8400ms
> select count(*) from table where text_match(column, '"message_logtype:storage-event" AND /.*c28487d062.*/ AND "offset:46127612"')
> select count(*) from table where text_match(column, '"message_logtype:storage-event"') AND text_match(column, '/.*c28487d062.*/') AND text_match(column, '"offset:46127612"')

tags: feature, performance (?)

codecov-commenter · 2024-01-30T04:21:05Z

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (4823802) 61.66% compared to head (290712e) 61.62%.
Report is 4 commits behind head on master.

Files	Patch %	Lines
...ery/optimizer/filter/TextMatchFilterOptimizer.java	86.31%	5 Missing and 8 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #12339      +/-   ##
============================================
- Coverage     61.66%   61.62%   -0.04%     
  Complexity      207      207              
============================================
  Files          2421     2422       +1     
  Lines        131852   131974     +122     
  Branches      20345    20372      +27     
============================================
+ Hits          81303    81331      +28     
- Misses        44582    44666      +84     
- Partials       5967     5977      +10

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (ø)`
integration	`<0.01% <0.00%> (ø)`
integration1	`<0.01% <0.00%> (ø)`
integration2	`?`
java-11	`61.62% <86.31%> (+0.03%)`	⬆️
java-21	`27.70% <0.00%> (-33.85%)`	⬇️
skip-bytebuffers-false	`61.62% <86.31%> (-0.03%)`	⬇️
skip-bytebuffers-true	`<0.01% <0.00%> (-61.51%)`	⬇️
temurin	`61.62% <86.31%> (-0.04%)`	⬇️
unittests	`61.62% <86.31%> (-0.04%)`	⬇️
unittests1	`46.75% <86.31%> (-0.01%)`	⬇️
unittests2	`27.70% <0.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

itschrispeck · 2024-01-30T05:21:21Z

@siddharthteotia I'd appreciate your thoughts/review

pinot-core/src/test/java/org/apache/pinot/core/query/optimizer/QueryOptimizerTest.java

chenboat · 2024-01-30T18:03:01Z

pinot-core/src/test/java/org/apache/pinot/core/query/optimizer/QueryOptimizerTest.java

+        "SELECT * FROM testTable WHERE TEXT_MATCH(string1, 'foo1 AND bar1') AND TEXT_MATCH(string2, 'foo2 AND bar2')");
+    testCannotOptimizeQuery("SELECT * FROM testTable WHERE TEXT_MATCH(string1, 'foo') OR TEXT_MATCH(string2, 'bar')");
+    testCannotOptimizeQuery(
+        "SELECT * FROM testTable WHERE int = 1 AND TEXT_MATCH(string, 'foo') OR TEXT_MATCH(string, 'bar')");


why this one can not be optimized? the columns are the same "string"?

int = 1 AND text_match(x) OR text_match(y) wouldn't be equivalent to int = 1 AND text_match(x OR y)

…ers to Lucene (apache#12339) * add TextMatchFilterOptimizer * fix equivalence for all not

deemoliu · 2024-03-15T23:38:30Z

this looks awesome. is it possible to optimize regexp_like(x) OR regexp_like(y) ?

add TextMatchFilterOptimizer

c7c3b3f

chenboat requested a review from Jackie-Jiang January 30, 2024 17:41

chenboat self-assigned this Jan 30, 2024

chenboat requested review from siddharthteotia and chenboat January 30, 2024 17:41

chenboat reviewed Jan 30, 2024

View reviewed changes

pinot-core/src/test/java/org/apache/pinot/core/query/optimizer/QueryOptimizerTest.java Show resolved Hide resolved

chenboat reviewed Jan 30, 2024

View reviewed changes

fix equivalence for all not

290712e

chenboat approved these changes Jan 30, 2024

View reviewed changes

chenboat merged commit dd8be2a into apache:master Jan 31, 2024
19 checks passed

suyashpatel98 pushed a commit to suyashpatel98/pinot that referenced this pull request Feb 28, 2024

add TextMatchFilterOptimizer to maximally push down text_match filt…

7b9ab7b

…ers to Lucene (apache#12339) * add TextMatchFilterOptimizer * fix equivalence for all not

Jackie-Jiang added feature performance labels Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add TextMatchFilterOptimizer to maximally push down `text_match` filters to Lucene #12339

add TextMatchFilterOptimizer to maximally push down `text_match` filters to Lucene #12339

itschrispeck commented Jan 30, 2024 •

edited

Loading

codecov-commenter commented Jan 30, 2024 •

edited

Loading

itschrispeck commented Jan 30, 2024

chenboat Jan 30, 2024

itschrispeck Jan 30, 2024

deemoliu commented Mar 15, 2024

add TextMatchFilterOptimizer to maximally push down text_match filters to Lucene #12339

add TextMatchFilterOptimizer to maximally push down text_match filters to Lucene #12339

Conversation

itschrispeck commented Jan 30, 2024 • edited Loading

codecov-commenter commented Jan 30, 2024 • edited Loading

Codecov Report

itschrispeck commented Jan 30, 2024

chenboat Jan 30, 2024

Choose a reason for hiding this comment

itschrispeck Jan 30, 2024

Choose a reason for hiding this comment

deemoliu commented Mar 15, 2024

add TextMatchFilterOptimizer to maximally push down `text_match` filters to Lucene #12339

add TextMatchFilterOptimizer to maximally push down `text_match` filters to Lucene #12339

itschrispeck commented Jan 30, 2024 •

edited

Loading

codecov-commenter commented Jan 30, 2024 •

edited

Loading