Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(SqlSmith): gen Table Subquery EXISTS #4431

Merged
merged 14 commits into from
Nov 26, 2022
Merged

Conversation

marvenlee2486
Copy link
Contributor

@marvenlee2486 marvenlee2486 commented Aug 4, 2022

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

EXISTS implement

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

close #4296

@marvenlee2486
Copy link
Contributor Author

There is 1 out of 1024 query that I run have this issue #4434 . Need to fix this first

@codecov
Copy link

codecov bot commented Aug 4, 2022

Codecov Report

Merging #4431 (1845a80) into main (a07f267) will decrease coverage by 0.02%.
The diff coverage is 74.61%.

@@            Coverage Diff             @@
##             main    #4431      +/-   ##
==========================================
- Coverage   73.86%   73.83%   -0.03%     
==========================================
  Files        1002     1004       +2     
  Lines      162583   162749     +166     
==========================================
+ Hits       120088   120166      +78     
- Misses      42495    42583      +88     
Flag Coverage Δ
rust 73.83% <74.61%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/batch/src/executor/group_top_n.rs 68.42% <ø> (-6.44%) ⬇️
...batch/src/executor/join/distributed_lookup_join.rs 0.00% <0.00%> (ø)
src/batch/src/executor/join/nested_loop_join.rs 91.73% <ø> (ø)
src/batch/src/executor/join/sort_merge_join.rs 79.04% <ø> (ø)
src/batch/src/executor/order_by.rs 95.32% <ø> (ø)
src/batch/src/executor/project_set.rs 76.15% <ø> (ø)
src/batch/src/executor/sys_row_seq_scan.rs 0.00% <0.00%> (ø)
src/batch/src/executor/top_n.rs 75.00% <ø> (ø)
src/batch/src/executor/update.rs 81.36% <ø> (ø)
.../batch/src/task/consistent_hash_shuffle_channel.rs 0.00% <0.00%> (ø)
... and 107 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@marvenlee2486
Copy link
Contributor Author

There is 1 out of 1024 query that I run have this issue #4434 . Need to fix this first

alright, I think it is nothing related to exists, we can merge it. Because the problem occur in subquery with aggregate. Probably this probability is very low, until now it is notice.

Copy link
Contributor

@lmatz lmatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some exists query, e.g.

SELECT t1.* FROM t1
WHERE EXISTS (SELECT * FROM t3 WHERE t1.c1 = t3.c1)

It can be transformed into a hash join, not necessarily a nested loop join.

SELECT t1.* FROM t1, t3 WHERE t1.c1 = t3.c1

So we can probably do more than this.

@kwannoel
Copy link
Contributor

kwannoel commented Aug 5, 2022

For some exists query, e.g.

SELECT t1.* FROM t1
WHERE EXISTS (SELECT * FROM t3 WHERE t1.c1 = t3.c1)

It can be transformed into a hash join, not necessarily a nested loop join.

SELECT t1.* FROM t1, t3 WHERE t1.c1 = t3.c1

So we can probably do more than this.

Clarification: Do you mean that exists can be transformed into something we already test (hash join), so we probably don't need to generate it, there are other more important things to generate?

@lmatz
Copy link
Contributor

lmatz commented Aug 5, 2022

We still need to generate in these cases as we can test the transformed into something.
I mean, we can generate correlated subqueries in the exists clause.

There is a comment in this PR saying:

// TODO: Streaming nested loop join is not implemented yet.
// Tracked by: #2655.`

I thought it tries to say that because of this limitation, there is a set of queries we cannot generate in exists clause.
So I just try to clarify that there is a subset of queries we still can generate.

@kwannoel
Copy link
Contributor

kwannoel commented Aug 22, 2022

Any updates? Edit: Zongyu's laptop is broken unfortunately. Resume after new laptop is setup.

@lmatz lmatz changed the title feat(SqlSmith): gen Table Subquery ' EXISTS' feat(SqlSmith): gen Table Subquery EXISTS Sep 28, 2022
@marvenlee2486
Copy link
Contributor Author

Currently a few feature not yet implemented

  1. Local subquery inside Aggregate
  2. Correlated subquery in HAVING or SELECT with agg is not implemented

so I decided to workaround, as such
If currently the subquery is inside an aggregate function, then it won't generate EXISTS subquery

With this, when I want to generate correlated subquery. The following problem shown up #6350

As for Imatz comment on the possible EXISTS subquery generated by materialised view. I will look into what I can do in a later time.

@marvenlee2486 marvenlee2486 marked this pull request as draft November 14, 2022 10:07
@marvenlee2486
Copy link
Contributor Author

marvenlee2486 commented Nov 25, 2022

The problem I facing is "Feature is not yet implemented: correlated subquery in HAVING or SELECT with agg,"

This show that if I would want to generate correlated query (Not the local query), I need to pass the information about "Cannot have agg" all the way until gen_expr which is too much of parameter passing.

I am planning to refactor some code to able to workaround unsupported feature.

  1. put the two variable 'inside_agg' and 'can_agg' (
    /// In generating expression, there are two execution modes:
    ) as internal state instead of passing around gen_expr. Meaning that the variable will become global to all the methods (functions).
  2. Since the state is "global" to the method so I need to ensure correctness, the same way as how the gen_local_query done is, saving all the state and restoring the state upon leaving the function. (This may need some some refactor).
  3. By doing so, we can actually think of a few more internal state that can help us with workaround for example, if it is mv, then we avoid generating Nested Loop Join.

Previously, I didn't, store it as internal state because the function is keep on being called with in different context so to avoid mistake when I put them in internal state I didn't put them in internal state. But then now I feel it would be better to refactor this part to allow more workaround.

Since this PR is not about correlated query, and it is about EXISTS, would it be alright if I dun generate correlated query (meaning I generate local context) and have another PR about generating correlated query.

I need some advise on this as I think the refactor may need to change not less part.

@kwannoel
Copy link
Contributor

Since this PR is not about correlated query, and it is about EXISTS, would it be alright if I dun generate correlated query (meaning I generate local context) and have another PR about generating correlated query.

Sounds good.

@kwannoel
Copy link
Contributor

kwannoel commented Nov 25, 2022

This show that if I would want to generate correlated query (Not the local query), I need to pass the information about "Cannot have agg" all the way until gen_expr which is too much of parameter passing.

This is a good point, I think the way we workaround unsupported features have added complexity to the code.
I think an alternative to your suggestion is to handle by matching on the error unimplemented!(). I think this will handle unimplemented expressions without having to manually check can_agg, inside_agg depending on the unsupported expression.

So sqlsmith will still generate these invalid queries, but will just ignore them.

This is similar to our approach for permissible runtime errors:

fn is_permissible_error(db_error: &DbError) -> bool {
let is_internal_error = *db_error.code() == SqlState::INTERNAL_ERROR;
is_internal_error
&& (is_numeric_out_of_range_err(db_error)
|| is_broken_chan_err(db_error)
|| is_division_by_zero_err(db_error))
}

Drawback is if sqlsmith generates too many of these. We can verify if this is the case by writing sqlsmith test to check if percentage of generated unimplemented!() queries is above a certain threshold.

@marvenlee2486
Copy link
Contributor Author

Lets continue discussion under this issue #6592

@mergify
Copy link
Contributor

mergify bot commented Nov 25, 2022

Hey @marvenlee2486, this pull request failed to merge and has been dequeued from the merge train. If you believe your PR failed in the merge train because of a flaky test, requeue it by clicking "Update branch" or pushing an empty commit with git commit --allow-empty -m "rerun" && git push.

@mergify mergify bot merged commit cbdab8c into main Nov 26, 2022
@mergify mergify bot deleted the zongyu/sqlsmith_table_sub branch November 26, 2022 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(sqlsmith): Implement Table subqueries in where statement
3 participants