-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow missing intervals for Parallel task with hash/range partitioning #10592
Conversation
No comment on the code itself, but very exciting to remove the extra requirement from the user! |
final boolean needsInputSampling = | ||
partitionsSpec.getNumShards() == null | ||
|| ingestionSchemaToUse.getDataSchema().getGranularitySpec().inputIntervals().isEmpty(); | ||
if (needsInputSampling) { | ||
// 0. need to determine numShards by scanning the data | ||
LOG.info("numShards is unspecified, beginning %s phase.", PartialDimensionCardinalityTask.TYPE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this log statement needs a change now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
LOG.info("Automatically determined numShards: " + numShardsOverride); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an equivalent info being logged somewhere now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I didn't add one for intervalToNumShards
because it could have lots of intervals.
@abhishekagarwal87 do you have more comments? |
@jihoonson LGTM |
@clintropolis @himanshug @abhishekagarwal87 thanks for the review 👍 |
apache#10592) * Allow missing intervals for Parallel task * fix row filter * fix tests * fix log
Description
This PR allows Parallel task to run without explicit intervals in granularitySpec. If intervals are missing, the parallel task executes an extra step for input sampling which collects the intervals to index.
This PR additionally fixes a bug when
numShards
is missing in hash partitioning. WhennumShards
is missing, the parallel task computes it by scanning the whole input. However, the computed numShards was ignored when it's serialized into JSON. To fix it, this PR adds another fieldintervalToNumShardsOverride
which stores the computed numShards per interval so that we can handle data skew well across intervals.This PR has: