-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail MSQ compaction if multi-valued partition dimensions are found #17344
Fail MSQ compaction if multi-valued partition dimensions are found #17344
Conversation
@gargvishesh , is this something MSQ always does silently? |
...-core/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQCompactionRunner.java
Outdated
Show resolved
Hide resolved
...-core/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQCompactionRunner.java
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/client/indexing/ClientCompactionRunnerInfo.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/client/indexing/ClientCompactionRunnerInfo.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/client/indexing/ClientCompactionRunnerInfo.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/client/indexing/ClientCompactionRunnerInfo.java
Outdated
Show resolved
Hide resolved
@kfaraz: yes, if MVD fields are found, there is just a log that states shard spec is changed to druid/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerImpl.java Line 1104 in c2149d5
This in fact also means that compaction will keeping continuing in a loop. I was thinking about looking for a solution to this problem post this PR. |
Oh, that can be a problem. When can this happen? |
Right. So this case can happen when there is no need to download the segments, ie. user specifies dimensions, metrics, granularity and rollup in the compaction spec itself, but one of the dimensions happens to be MVD. |
Can we ensure we download the segments in this case too? |
Drawback was mainly the high cost of downloading segments. But yes I think this would be the only viable way for now. |
@kfaraz: I have now updated the logic to download segments when we need MVD info. The check is also updated to be more fine grained to avoid redundant download. |
// Segments are downloaded even when just needMultiValuedDimensions=true since MSQ switches to dynamic | ||
// partitioning on finding any 'range' partition dimension to be multi-valued at runtime, which ends up in a | ||
// mismatch between the compaction config and the actual segments (lastCompactionState), leading to | ||
// repeated compactions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put this in the javadoc of this method instead.
static boolean isPossiblyRollup(@Nullable ClientCompactionTaskGranularitySpec granularitySpec) | ||
{ | ||
return !(granularitySpec != null && Boolean.FALSE.equals(granularitySpec.isRollup())); | ||
} | ||
|
||
static boolean isRangePartitioned(@Nullable TuningConfig tuningConfig) | ||
{ | ||
return tuningConfig != null && tuningConfig.getPartitionsSpec() instanceof DimensionRangePartitionsSpec; | ||
} | ||
|
||
static boolean isRollupOnStringDimension( | ||
DimensionsSpec dimensionsSpec, | ||
@Nullable ClientCompactionTaskGranularitySpec granularitySpec | ||
) | ||
{ | ||
if (!isPossiblyRollup(granularitySpec)) { | ||
return false; | ||
} | ||
return dimensionsSpec.getDimensions() | ||
.stream() | ||
.anyMatch(dim -> ColumnType.STRING.equals(dim.getColumnType())); | ||
} | ||
|
||
static boolean isPartitionedOnStringDimension(DimensionsSpec dimensionsSpec, @Nullable TuningConfig tuningConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think all of these need to be separate methods as they are not being reused anywhere afaict. They can just be variables inside the needMVD
method.
* therefore is particularly for MSQCompactionRunner; it returns true when dimension types aren't known | ||
* from CompactionTask spec, or if either rollup or partition dimensions contain any string-type column. | ||
*/ | ||
public static boolean needMultiValuedDimensions(CompactionTask compactionTask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No point making this method static if it requires passing the CompactionTask
into it.
Also, instead of making this method public or package private, CompactionTask
should instead expose a method needsToDownloadSegments()
or something. That can internally call this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No point making this method static if it requires passing the CompactionTask into it.
Agree. One possibility is that its value can be assigned to a final field and the getter is exposed
Also, instead of making this method public or package private, CompactionTask should instead expose a method needsToDownloadSegments() or something. That can internally call this one.
This would need some refactoring since the effective granularitySpec on which download depends is computed inside createDataSchemaForIntervals
, just before passing this computed value to ExistingSegmentAnalyzer
where the final need to download is evaluated. Also, each need*
flag is separately communicated to the Analyzer class to do selective processing.
Does the proposed approach of assigning needMultiValuedDimensions
to a private field and creating tests on it work? In fact, all such fields apart from needGranularitySpec
can be populated in CompactionTask constructor and passed to the Analyzer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just relooked at code. Only segmentGranularity is computed later, so all need* flags and an additional new needsDownload flag can be computed in the CompactionTask constructor itself. The needsDownload can have a public getter for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only needed in test, keep the method as package private rather than public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public is needed since the tests reside in the MSQ extension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach looks good. Minor cleanup suggestions.
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java
Outdated
Show resolved
Hide resolved
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java
Outdated
Show resolved
Hide resolved
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java
Outdated
Show resolved
Hide resolved
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java
Outdated
Show resolved
Hide resolved
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java
Outdated
Show resolved
Hide resolved
indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java
Outdated
Show resolved
Hide resolved
@@ -2015,6 +2100,37 @@ public SettableColumnValueSelector<?> makeNewSettableColumnValueSelector() | |||
|
|||
} | |||
|
|||
/** | |||
* A class to mimic MSQCompactionRunner behaviour, since the actual class resides in the MSQ extn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we mimicking the behaviour? The methods seem to return only null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just mimics for validations. Updated the comment
...-core/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQCompactionRunner.java
Outdated
Show resolved
Hide resolved
...-core/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQCompactionRunner.java
Outdated
Show resolved
Hide resolved
...-core/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQCompactionRunner.java
Outdated
Show resolved
Hide resolved
…pache#17344) MSQ currently supports only single-valued string dimensions as partition keys. This patch adds a check to ensure that partition keys are single-valued in case this info is available by virtue of segment download for schema inference. During compaction, if MSQ finds multi-valued dimensions (MVDs) declared as part of `range` partitionsSpec, it switches partitioning type to dynamic, ending up in repeated compactions of the same interval. To avoid this scenario, the segment download logic is also updated to always download segments if info on multi-valued dimensions is required.
MSQ currently supports only single-valued string dimensions as partition keys. This patch adds a check to ensure that partition keys are single-valued in case this info is available by virtue of segment download for schema inference.
During compaction, if MSQ finds multi-valued dimensions (MVDs) declared as part of
range
partitionsSpec, it switches partitioning type todynamic
, ending up in repeated compactions of the same interval. To avoid this scenario, the segment download logic is also updated to always download segments if info on multi-valued dimensions is required.