Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add json_extract_index transform function to leverage json index for json value extraction #11739

Merged
merged 5 commits into from
Oct 26, 2023

Conversation

itschrispeck
Copy link
Collaborator

@itschrispeck itschrispeck commented Oct 4, 2023

This is a follow up PR to #11494

I worked on a poc for this, and took the approach of reading the JSON index through a new transform function JSON_EXTRACT_INDEX. This enables group by/regexp filtering/the majority of the functionality of JSON_EXTRACT_SCALAR and attempts to maintain the same syntax. This does not solve the generic problem of adding a code path to use inverted index to speed up group by.

We've seen large improvements for large time ranges and large tables. This can reduce query latency for equivalent queries and massively improve memory pressure, preventing the OOM caused cluster crashes we encountered.

As expected, for small time ranges and highly filtered results the existing JSON_EXTRACT_SCALAR function is faster.

Benchmark results:
image
The gray bars signal a cluster crash and no data was able to be recorded.

A: 10TB table, group by json_extract_scalar(col, ‘$.keyA’, ‘STRING’, ‘null)    
B: 10TB table, group by json_extract_scalar(col, ‘$.keyB’, ‘STRING’, ‘null)     
C: 10TB table, regexp_like(json_extract_scalar(json_data, '$.keyA, 'STRING', 'null'), 'val') group by json_extract_scalar(json_data, '$.keyA, 'STRING', 'null') 
D: 10TB table, regexp_like(json_extract_scalar(json_data, '$.keyA, 'STRING', 'null'), 'val')
E: 20GB table, group by json_extract_scalar(col, ‘$.keyA’, ‘STRING’, ‘null)
F: 20GB table, regexp_like(json_extract_scalar(json_data, '$.keyA, 'STRING', 'null'), 'val') group by json_extract_scalar(json_data, '$.keyA, 'STRING', 'null')
G: 20GB table, select json_extract_scalar(json_data, '$.keyA, 'STRING', 'null') ... where json_match(json_data, '"$.keyA" = ''val''')

- queries are repeated w/ filters `ts > now() -1m/10m/1h/3h/6h/12h/24h`
- keyA is high cardinality
- keyB is low cardinality
- 10TB table avg json_data row len: 3.4KB
- 20GB table avg json_data row len: 5.4KB

Thought not low latency, I think this would be a solid functionality to have available as it allows for queries that were otherwise unanswerable.

Testing: benchmarks for a few of our common query shapes (see screenshot) + validation in prod clusters

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very smart algorithm!

I don't think we are doing fair comparison though. We should always put a filter on the key before doing the group-by. I can see the new function can outperform json_extract_scalar when each JSON is large.

Another optimization on top of the current approach is to cache the posting list for the key. Current implementation is calculating the posting list per block, but we only need to compute it once


// add value to padded array
for (int docId : postingList) {
values[docId] = val;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work if the same doc contains multiple flattened docs of the same key but different value. They will override each other. We need to document this limitation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to document this in the function reference documentation?

Looking at JSON standards there doesn't seem to be a standard behavior for handling duplicate keys

@codecov-commenter
Copy link

codecov-commenter commented Oct 10, 2023

Codecov Report

Merging #11739 (2ea21fb) into master (24af80d) will increase coverage by 20.41%.
Report is 69 commits behind head on master.
The diff coverage is 48.75%.

@@              Coverage Diff              @@
##             master   #11739       +/-   ##
=============================================
+ Coverage     14.45%   34.86%   +20.41%     
- Complexity      201      945      +744     
=============================================
  Files          2342     2298       -44     
  Lines        125917   124741     -1176     
  Branches      19370    19288       -82     
=============================================
+ Hits          18205    43497    +25292     
+ Misses       106170    78203    -27967     
- Partials       1542     3041     +1499     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (?)
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 34.81% <48.75%> (+20.39%) ⬆️
java-17 ?
java-20 ?
java-21 34.74% <48.75%> (?)
skip-bytebuffers-false 34.84% <48.75%> (?)
skip-bytebuffers-true 34.72% <48.75%> (?)
temurin 34.86% <48.75%> (+20.41%) ⬆️
unittests 46.66% <48.75%> (+32.21%) ⬆️
unittests1 46.66% <48.75%> (?)
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...r/transform/function/TransformFunctionFactory.java 89.65% <100.00%> (+89.65%) ⬆️
...e/pinot/common/function/TransformFunctionType.java 89.04% <75.00%> (+89.04%) ⬆️
...local/realtime/impl/json/MutableJsonIndexImpl.java 0.00% <0.00%> (ø)
...t/index/readers/json/ImmutableJsonIndexReader.java 0.00% <0.00%> (ø)
...rm/function/JsonExtractIndexTransformFunction.java 70.14% <70.14%> (ø)

... and 1689 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@itschrispeck itschrispeck marked this pull request as ready for review October 10, 2023 18:02
@itschrispeck itschrispeck changed the title draft: add json_index_extract transform function to leverage json index for json value extraction add json_index_extract transform function to leverage json index for json value extraction Oct 10, 2023

String[] values = new String[docIds.length];
for (int i = 0; i < docIds.length; i++) {
values[i] = docIdToValues.get(docIds[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can iterate over the map instead of the doc ids to fill the values

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I iterate over the map I don't know which index of values[] corresponds to the docId of the current Map.Entry, so I'd need to maintain another mapping from docId to the index of the input docIds[] array right?

@Jackie-Jiang
Copy link
Contributor

Can you share some new perf numbers after the optimization? In the query, let's add the json_match in the filter

@itschrispeck
Copy link
Collaborator Author

Thanks for the review and suggestions! Will update w/ perf numbers.

For json_match filter are you referring to some query like where json_match(col, '$.keyA = ''val''') AND group by json_extract_index(col, ‘$.keyB’, ‘STRING’, ‘null)? Where key filtered != key aggregated on so we still have some large number of groups?

@itschrispeck
Copy link
Collaborator Author

Updated perf results after adding caching, please take another look when convenient. Also added some short info about json_data blob size, and another query pattern w/ json_match filtering.

@chenboat @Jackie-Jiang

@itschrispeck itschrispeck changed the title add json_index_extract transform function to leverage json index for json value extraction add json_extract_index transform function to leverage json index for json value extraction Oct 13, 2023
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments

@chenboat chenboat merged commit 0663166 into apache:master Oct 26, 2023
16 of 19 checks passed
@itschrispeck itschrispeck deleted the group_by_json_index branch April 17, 2024 04:29
@Jackie-Jiang Jackie-Jiang added feature documentation release-notes Referenced by PRs that need attention when compiling the next release notes labels Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation feature release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants