-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add json_extract_index transform function to leverage json index for json value extraction #11739
Conversation
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very smart algorithm!
I don't think we are doing fair comparison though. We should always put a filter on the key before doing the group-by. I can see the new function can outperform json_extract_scalar when each JSON is large.
Another optimization on top of the current approach is to cache the posting list for the key. Current implementation is calculating the posting list per block, but we only need to compute it once
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
|
||
// add value to padded array | ||
for (int docId : postingList) { | ||
values[docId] = val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work if the same doc contains multiple flattened docs of the same key but different value. They will override each other. We need to document this limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to document this in the function reference documentation?
Looking at JSON standards there doesn't seem to be a standard behavior for handling duplicate keys
40a9ca1
to
381272d
Compare
Codecov Report
@@ Coverage Diff @@
## master #11739 +/- ##
=============================================
+ Coverage 14.45% 34.86% +20.41%
- Complexity 201 945 +744
=============================================
Files 2342 2298 -44
Lines 125917 124741 -1176
Branches 19370 19288 -82
=============================================
+ Hits 18205 43497 +25292
+ Misses 106170 78203 -27967
- Partials 1542 3041 +1499
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1689 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/JsonIndexReader.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
|
||
String[] values = new String[docIds.length]; | ||
for (int i = 0; i < docIds.length; i++) { | ||
values[i] = docIdToValues.get(docIds[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can iterate over the map instead of the doc ids to fill the values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I iterate over the map I don't know which index of values[]
corresponds to the docId
of the current Map.Entry, so I'd need to maintain another mapping from docId
to the index of the input docIds[]
array right?
Can you share some new perf numbers after the optimization? In the query, let's add the |
Thanks for the review and suggestions! Will update w/ perf numbers. For |
...ava/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
Updated perf results after adding caching, please take another look when convenient. Also added some short info about json_data blob size, and another query pattern w/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good
...ava/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
...al/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java
Outdated
Show resolved
Hide resolved
...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java
Outdated
Show resolved
Hide resolved
b14ae9b
to
7db865e
Compare
...ava/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor comments
This is a follow up PR to #11494
I worked on a poc for this, and took the approach of reading the JSON index through a new transform function
JSON_EXTRACT_INDEX
. This enables group by/regexp filtering/the majority of the functionality ofJSON_EXTRACT_SCALAR
and attempts to maintain the same syntax. This does not solve the generic problem of adding a code path to use inverted index to speed up group by.We've seen large improvements for large time ranges and large tables. This can reduce query latency for equivalent queries and massively improve memory pressure, preventing the OOM caused cluster crashes we encountered.
As expected, for small time ranges and highly filtered results the existing
JSON_EXTRACT_SCALAR
function is faster.Benchmark results:
The gray bars signal a cluster crash and no data was able to be recorded.
Thought not low latency, I think this would be a solid functionality to have available as it allows for queries that were otherwise unanswerable.
Testing: benchmarks for a few of our common query shapes (see screenshot) + validation in prod clusters