Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leverage Json index in the group-by clause #11494

Closed
chenboat opened this issue Sep 1, 2023 · 5 comments
Closed

Leverage Json index in the group-by clause #11494

chenboat opened this issue Sep 1, 2023 · 5 comments
Assignees

Comments

@chenboat
Copy link
Contributor

chenboat commented Sep 1, 2023

Today Pinot's Json index works in the Filter-By clause (doc).

When a group-by clause involves an indexed Json path, it is NOT clear if/how we can leverage Pinot json index support. The closest we can utilize syntax-wise is JSON_EXTRACT_SCALAR(json_data, '$.country', 'STRING', 'null'). But apparently it does not use Json index.

E.g.
select count(*)
from table
group by JSON_EXTRACT_SCALAR(json_data, '$.country', 'STRING', 'null')

@chenboat
Copy link
Contributor Author

chenboat commented Sep 1, 2023

@Jackie-Jiang @mayankshriv

@Jackie-Jiang
Copy link
Contributor

The current implementation of json index cannot be used within the group-by clause (or select clause) because it doesn't support extracting values from it, but only support checking if the key-value pair exist.
In order to support functionality similar to JSON_EXTRACT_SCALAR, we need to re-design the json index.

@chenboat
Copy link
Contributor Author

chenboat commented Sep 8, 2023

@Jackie-Jiang Today's json index stores the list of mapping from json_path+value to their corresponding posting lists. To make json index work for group-by, I think we can open the json index dictionary and scan the list of json_paths for those matching the input json path. If I understand correctly, the result can then be returned as part of the GroupKeyGenerator interface. Why should we re-design the json index?

@chenboat chenboat reopened this Sep 8, 2023
@chenboat chenboat self-assigned this Sep 8, 2023
@Jackie-Jiang
Copy link
Contributor

Think of json index as an inverted index. It is extremely expensive to use inverted index as forward index because each value read requires scanning the whole inverted index.
For group-by clause itself, we can potentially use inverted index to accelerate the query processing, but that is also not easy and we don't have that support for regular inverted index yet.

@chenboat
Copy link
Contributor Author

Today Pinot supports group-by json-extract-scalar already but in the worst possible by parsing every possible json string. It caused massively oom and timeout even with 1 query. Using the json index would be a huge improvement over that. Pinot json index today already collects and sort json path+value. Group by json path (e.g. $a.b) can take average of the sorted dictionary to quickly locate the portion of the dictionary ids. I think doing this will provide great speed up compared with what we have today. We are working on a proof of concept and will share the test results soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants