-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Multi-stage] Support partition based leaf stage processing #11234
[Multi-stage] Support partition based leaf stage processing #11234
Conversation
Codecov Report
@@ Coverage Diff @@
## master #11234 +/- ##
=========================================
Coverage 0.11% 0.11%
=========================================
Files 2229 2155 -74
Lines 119951 116802 -3149
Branches 18171 17727 -444
=========================================
Hits 137 137
+ Misses 119794 116645 -3149
Partials 20 20
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 74 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me. minor comments
pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java
Outdated
Show resolved
Hide resolved
...-planner/src/main/java/org/apache/pinot/query/planner/physical/MailboxAssignmentVisitor.java
Show resolved
Hide resolved
pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java
Outdated
Show resolved
Hide resolved
@@ -354,14 +349,12 @@ private ColocatedTableInfo getColocatedTableInfo(String tableName) { | |||
TimeBoundaryInfo timeBoundaryInfo = _routingManager.getTimeBoundaryInfo(offlineTableName); | |||
// Ignore OFFLINE side when time boundary info is unavailable | |||
if (timeBoundaryInfo == null) { | |||
return getRealtimeColocatedTableInfo(realtimeTableName); | |||
return getRealtimeColocatedTableInfo(realtimeTableName, partitionKey, numPartitions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a integration-test for hybrid table colocated join? i am not sure how we can mock that in unit-test but an integration test should be super helpful to cover this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. That won't be trivial, so probably can be addressed in a separate PR.
...y-runtime/src/test/java/org/apache/pinot/query/runtime/queries/ResourceBasedQueriesTest.java
Outdated
Show resolved
Hide resolved
}, | ||
{ | ||
"description": "Colocated, Dynamic broadcast SEMI-JOIN with partially empty right table result for some servers", | ||
"sql": "SELECT /*+ joinOptions(join_strategy='dynamic_broadcast', is_colocated_by_join_keys='true') */ {tbl1}.name, COUNT(*) FROM {tbl1} WHERE {tbl1}.num IN (SELECT {tbl2}.num FROM {tbl2} WHERE {tbl2}.val = 'z') GROUP BY {tbl1}.name" | ||
"sql": "SELECT /*+ joinOptions(join_strategy='dynamic_broadcast') */ {tbl1}.name, COUNT(*) FROM {tbl1} /*+ tableOptions(partition_key='num', partition_size='4') */ WHERE {tbl1}.num IN (SELECT {tbl2}.num FROM {tbl2} /*+ tableOptions(partition_key='num', partition_size='4') */ WHERE {tbl2}.val = 'z') GROUP BY {tbl1}.name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add test with set stageParallelism = 2;
variance to the hinted tests here. we never tested option + hints at the same time, good to add. (but can be followed up, i just tried and they all passed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently stageParallelism
is ignored. Can be added when we add the support to further split on single partition data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will not be if we do agg on top of join that joins on one key, which is colocated, but group by on a different key -- that one will still be run with the stage parallelism setting, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, even though they won't be applied to the same stage at the same time. Added one query
5f966f8
to
e23c7d0
Compare
{ | ||
"description": "Inner join with group by", | ||
"sql": "EXPLAIN PLAN FOR SELECT /*+ joinOptions(is_colocated_by_join_keys='true'), aggOptions(is_partitioned_by_group_by_keys='true') */a.col1, AVG(b.col3) FROM a JOIN b ON a.col1 = b.col2 WHERE a.col3 >= 0 AND a.col2 = 'a' AND b.col3 < 0 GROUP BY a.col1", | ||
"sql": "EXPLAIN PLAN FOR SELECT /*+ aggOptions(is_partitioned_by_group_by_keys='true') */ a.col1, AVG(b.col3) FROM a JOIN b ON a.col1 = b.col2 WHERE a.col3 >= 0 AND a.col2 = 'a' AND b.col3 < 0 GROUP BY a.col1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'll be good to have a few planner side tests that use both the aggOptions
for is_partitioned_by_group_by_keys
and use the newly table options for partitioning. One which also uses the join option to set dynamic_broadcast with table partitioning and is_partitioned_by_group_by_keys
aggOption
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I didn't add plan test for this hint is because this hint is not applied during the planning time, so the plan is always the same with/without the hint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as @Jackie-Jiang mentioned the table hints are not altering the logical plan so the only thing we can test here is verify the hints were attached properly (e.g. modify the TableScanNode toString method)
however, since we added physical plan in #11052 we can try to verify the physical plan but that's definitely a follow up IMO
Preconditions.checkState(tablePartitionInfo.getPartitionColumn().equals(partitionKey), | ||
"Partition key: %s does not match partition column: %s for table: %s", partitionKey, | ||
tablePartitionInfo.getPartitionColumn(), tableNameWithType); | ||
Preconditions.checkState(tablePartitionInfo.getNumPartitions() == numPartitions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was just curious, why even allow users to pass in a numPartitions
if we expect it to match the tablePartitionInfo
? Can't we just look this up and set the numPartitions ourselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think we keep it this way for future proof, when we want different partitions within query vs table. cc @walterddr for more details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes technically speaking the table partition hint is used for 2 reasons (mixed)
- to indicate what's the table partition - which we can in the future improved to be automatic
- indicate that data is already partitioned and no need to reshuffle the data when unnecessary - this we can debate whether we want to create a new hint for such behavior change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the explanation
@@ -1,33 +1,19 @@ | |||
{ | |||
"pinot_hint_option_tests": { | |||
"queries": [ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not add some planner tests for the tableOptions for partition key + num partitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #11234 (comment)
Add supports for the new
tableOptions
withpartition_column
andpartition_size
.When this table option (hint) is attached to the leaf stage, we will honor it, and process the leaf stage for each partition with a separate thread. In order to do so, the table should be partitioned, and all the segments for any partition must be served by the same server. It will throw exception if table fails to reach this condition.
Without the hint, the leaf stage will always be processed as a whole part, which is less efficient. More importantly, when the leaf stage result is partitioned, the intermediate stage can also benefit from it by increasing the parallelism and avoid shuffling the data. It can benefit JOIN (achieve colocated join) and GROUP BY (higher parallelism) the most.
The old
joinOptions
is_colocated_by_join_keys
is removed because it can be achieved with the new hint on both left and right table.See some example queries in
QueryHints.json