add json_extract_index transform function to leverage json index for json value extraction #11739

itschrispeck · 2023-10-04T21:01:52Z

This is a follow up PR to #11494

I worked on a poc for this, and took the approach of reading the JSON index through a new transform function JSON_EXTRACT_INDEX. This enables group by/regexp filtering/the majority of the functionality of JSON_EXTRACT_SCALAR and attempts to maintain the same syntax. This does not solve the generic problem of adding a code path to use inverted index to speed up group by.

We've seen large improvements for large time ranges and large tables. This can reduce query latency for equivalent queries and massively improve memory pressure, preventing the OOM caused cluster crashes we encountered.

As expected, for small time ranges and highly filtered results the existing JSON_EXTRACT_SCALAR function is faster.

Benchmark results:

The gray bars signal a cluster crash and no data was able to be recorded.

A: 10TB table, group by json_extract_scalar(col, ‘$.keyA’, ‘STRING’, ‘null)    
B: 10TB table, group by json_extract_scalar(col, ‘$.keyB’, ‘STRING’, ‘null)     
C: 10TB table, regexp_like(json_extract_scalar(json_data, '$.keyA, 'STRING', 'null'), 'val') group by json_extract_scalar(json_data, '$.keyA, 'STRING', 'null') 
D: 10TB table, regexp_like(json_extract_scalar(json_data, '$.keyA, 'STRING', 'null'), 'val')
E: 20GB table, group by json_extract_scalar(col, ‘$.keyA’, ‘STRING’, ‘null)
F: 20GB table, regexp_like(json_extract_scalar(json_data, '$.keyA, 'STRING', 'null'), 'val') group by json_extract_scalar(json_data, '$.keyA, 'STRING', 'null')
G: 20GB table, select json_extract_scalar(json_data, '$.keyA, 'STRING', 'null') ... where json_match(json_data, '"$.keyA" = ''val''')

- queries are repeated w/ filters `ts > now() -1m/10m/1h/3h/6h/12h/24h`
- keyA is high cardinality
- keyB is low cardinality
- 10TB table avg json_data row len: 3.4KB
- 20GB table avg json_data row len: 5.4KB

Thought not low latency, I think this would be a solid functionality to have available as it allows for queries that were otherwise unanswerable.

Testing: benchmarks for a few of our common query shapes (see screenshot) + validation in prod clusters

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java

Jackie-Jiang

Very smart algorithm!

I don't think we are doing fair comparison though. We should always put a filter on the key before doing the group-by. I can see the new function can outperform json_extract_scalar when each JSON is large.

Another optimization on top of the current approach is to cache the posting list for the key. Current implementation is calculating the posting list per block, but we only need to compute it once

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java

Jackie-Jiang · 2023-10-04T22:02:07Z

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java

+
+      // add value to padded array
+      for (int docId : postingList) {
+        values[docId] = val;


This won't work if the same doc contains multiple flattened docs of the same key but different value. They will override each other. We need to document this limitation.

Is it okay to document this in the function reference documentation?

Looking at JSON standards there doesn't seem to be a standard behavior for handling duplicate keys

codecov-commenter · 2023-10-10T06:40:45Z

Codecov Report

Merging #11739 (2ea21fb) into master (24af80d) will increase coverage by 20.41%.
Report is 69 commits behind head on master.
The diff coverage is 48.75%.

@@              Coverage Diff              @@
##             master   #11739       +/-   ##
=============================================
+ Coverage     14.45%   34.86%   +20.41%     
- Complexity      201      945      +744     
=============================================
  Files          2342     2298       -44     
  Lines        125917   124741     -1176     
  Branches      19370    19288       -82     
=============================================
+ Hits          18205    43497    +25292     
+ Misses       106170    78203    -27967     
- Partials       1542     3041     +1499

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (?)`
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`34.81% <48.75%> (+20.39%)`	⬆️
java-17	`?`
java-20	`?`
java-21	`34.74% <48.75%> (?)`
skip-bytebuffers-false	`34.84% <48.75%> (?)`
skip-bytebuffers-true	`34.72% <48.75%> (?)`
temurin	`34.86% <48.75%> (+20.41%)`	⬆️
unittests	`46.66% <48.75%> (+32.21%)`	⬆️
unittests1	`46.66% <48.75%> (?)`
unittests2	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...r/transform/function/TransformFunctionFactory.java	`89.65% <100.00%> (+89.65%)`	⬆️
...e/pinot/common/function/TransformFunctionType.java	`89.04% <75.00%> (+89.04%)`	⬆️
...local/realtime/impl/json/MutableJsonIndexImpl.java	`0.00% <0.00%> (ø)`
...t/index/readers/json/ImmutableJsonIndexReader.java	`0.00% <0.00%> (ø)`
...rm/function/JsonExtractIndexTransformFunction.java	`70.14% <70.14%> (ø)`

... and 1689 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/JsonIndexReader.java

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java

Jackie-Jiang · 2023-10-10T22:20:14Z

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java

+
+    String[] values = new String[docIds.length];
+    for (int i = 0; i < docIds.length; i++) {
+      values[i] = docIdToValues.get(docIds[i]);


We can iterate over the map instead of the doc ids to fill the values

If I iterate over the map I don't know which index of values[] corresponds to the docId of the current Map.Entry, so I'd need to maintain another mapping from docId to the index of the input docIds[] array right?

Jackie-Jiang · 2023-10-10T22:21:22Z

Can you share some new perf numbers after the optimization? In the query, let's add the json_match in the filter

itschrispeck · 2023-10-10T22:53:27Z

Thanks for the review and suggestions! Will update w/ perf numbers.

For json_match filter are you referring to some query like where json_match(col, '$.keyA = ''val''') AND group by json_extract_index(col, ‘$.keyB’, ‘STRING’, ‘null)? Where key filtered != key aggregated on so we still have some large number of groups?

...ava/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java

itschrispeck · 2023-10-13T00:08:11Z

Updated perf results after adding caching, please take another look when convenient. Also added some short info about json_data blob size, and another query pattern w/ json_match filtering.

@chenboat @Jackie-Jiang

Jackie-Jiang

Overall looks good

...ava/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java

...al/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java

...al/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java

...ava/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java

Jackie-Jiang

LGTM with minor comments

...ava/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java

...al/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java

pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/JsonIndexReader.java

itschrispeck commented Oct 4, 2023

View reviewed changes

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java Outdated Show resolved Hide resolved

itschrispeck commented Oct 4, 2023

View reviewed changes

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java Outdated Show resolved Hide resolved

Jackie-Jiang reviewed Oct 4, 2023

View reviewed changes

add json_extract_index transform function

381272d

itschrispeck force-pushed the group_by_json_index branch from 40a9ca1 to 381272d Compare October 10, 2023 05:36

ignore jsonextractindex in function registry test

bcf8c18

itschrispeck marked this pull request as ready for review October 10, 2023 18:02

itschrispeck changed the title ~~draft: add json_index_extract transform function to leverage json index for json value extraction~~ add json_index_extract transform function to leverage json index for json value extraction Oct 10, 2023

Jackie-Jiang reviewed Oct 10, 2023

View reviewed changes

chenboat reviewed Oct 11, 2023

View reviewed changes

...ava/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java Show resolved Hide resolved

chenboat reviewed Oct 11, 2023

View reviewed changes

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java Outdated Show resolved Hide resolved

chenboat reviewed Oct 11, 2023

View reviewed changes

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java Outdated Show resolved Hide resolved

chenboat reviewed Oct 11, 2023

View reviewed changes

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java Outdated Show resolved Hide resolved

chenboat reviewed Oct 12, 2023

View reviewed changes

...java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java Outdated Show resolved Hide resolved

itschrispeck changed the title ~~add json_index_extract transform function to leverage json index for json value extraction~~ add json_extract_index transform function to leverage json index for json value extraction Oct 13, 2023

itschrispeck requested a review from Jackie-Jiang October 16, 2023 17:38

chenboat approved these changes Oct 17, 2023

View reviewed changes

Jackie-Jiang reviewed Oct 18, 2023

View reviewed changes

Jackie-Jiang reviewed Oct 21, 2023

View reviewed changes

helper method to cache intermediate results

7db865e

itschrispeck force-pushed the group_by_json_index branch from b14ae9b to 7db865e Compare October 22, 2023 23:27

chenboat reviewed Oct 23, 2023

View reviewed changes

...ava/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java Outdated Show resolved Hide resolved

Jackie-Jiang approved these changes Oct 25, 2023

View reviewed changes

itschrispeck added 2 commits October 25, 2023 23:14

address comments

759715f

rename missed var

2ea21fb

chenboat merged commit 0663166 into apache:master Oct 26, 2023
16 of 19 checks passed

itschrispeck deleted the group_by_json_index branch April 17, 2024 04:29

Jackie-Jiang added feature documentation release-notes Referenced by PRs that need attention when compiling the next release notes labels Oct 10, 2024

Jackie-Jiang mentioned this pull request Oct 10, 2024

Add documentation for JSON_EXTRACT_INDEX #14202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add json_extract_index transform function to leverage json index for json value extraction #11739

add json_extract_index transform function to leverage json index for json value extraction #11739

itschrispeck commented Oct 4, 2023 •

edited

Loading

Jackie-Jiang left a comment

Jackie-Jiang Oct 4, 2023

itschrispeck Oct 10, 2023

codecov-commenter commented Oct 10, 2023 •

edited

Loading

Jackie-Jiang Oct 10, 2023

itschrispeck Oct 11, 2023

Jackie-Jiang commented Oct 10, 2023

itschrispeck commented Oct 10, 2023

itschrispeck commented Oct 13, 2023

Jackie-Jiang left a comment

Jackie-Jiang left a comment

add json_extract_index transform function to leverage json index for json value extraction #11739

add json_extract_index transform function to leverage json index for json value extraction #11739

Conversation

itschrispeck commented Oct 4, 2023 • edited Loading

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Jackie-Jiang Oct 4, 2023

Choose a reason for hiding this comment

itschrispeck Oct 10, 2023

Choose a reason for hiding this comment

codecov-commenter commented Oct 10, 2023 • edited Loading

Codecov Report

Jackie-Jiang Oct 10, 2023

Choose a reason for hiding this comment

itschrispeck Oct 11, 2023

Choose a reason for hiding this comment

Jackie-Jiang commented Oct 10, 2023

itschrispeck commented Oct 10, 2023

itschrispeck commented Oct 13, 2023

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

itschrispeck commented Oct 4, 2023 •

edited

Loading

codecov-commenter commented Oct 10, 2023 •

edited

Loading