Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail MSQ compaction if multi-valued partition dimensions are found #17344

Merged

Conversation

gargvishesh
Copy link
Contributor

@gargvishesh gargvishesh commented Oct 15, 2024

MSQ currently supports only single-valued string dimensions as partition keys. This patch adds a check to ensure that partition keys are single-valued in case this info is available by virtue of segment download for schema inference.

During compaction, if MSQ finds multi-valued dimensions (MVDs) declared as part of range partitionsSpec, it switches partitioning type to dynamic, ending up in repeated compactions of the same interval. To avoid this scenario, the segment download logic is also updated to always download segments if info on multi-valued dimensions is required.

@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Oct 15, 2024
@kfaraz
Copy link
Contributor

kfaraz commented Oct 15, 2024

the partitioning eventually ends up getting switched to dynamic type.

@gargvishesh , is this something MSQ always does silently?

@gargvishesh
Copy link
Contributor Author

@kfaraz: yes, if MVD fields are found, there is just a log that states shard spec is changed to NumberedShardSpec type:

This in fact also means that compaction will keeping continuing in a loop. I was thinking about looking for a solution to this problem post this PR.

@kfaraz
Copy link
Contributor

kfaraz commented Oct 15, 2024

This in fact also means that compaction will keeping continuing in a loop. I was thinking about looking for a solution to this problem post this PR.

Oh, that can be a problem. When can this happen?
Once we have downloaded the segments, we know if it has multi-valued dimensions or not. So the compaction task would always fail, correct?

@gargvishesh
Copy link
Contributor Author

Right. So this case can happen when there is no need to download the segments, ie. user specifies dimensions, metrics, granularity and rollup in the compaction spec itself, but one of the dimensions happens to be MVD.

@kfaraz
Copy link
Contributor

kfaraz commented Oct 15, 2024

Can we ensure we download the segments in this case too?
What would be the drawback?

@gargvishesh
Copy link
Contributor Author

Drawback was mainly the high cost of downloading segments. But yes I think this would be the only viable way for now.

@gargvishesh gargvishesh changed the title Add multi-valued check for partition dimensions Fail MSQ compaction if multi-valued partition dimensions are found Oct 15, 2024
@gargvishesh
Copy link
Contributor Author

@kfaraz: I have now updated the logic to download segments when we need MVD info. The check is also updated to be more fine grained to avoid redundant download.

Comment on lines 861 to 864
// Segments are downloaded even when just needMultiValuedDimensions=true since MSQ switches to dynamic
// partitioning on finding any 'range' partition dimension to be multi-valued at runtime, which ends up in a
// mismatch between the compaction config and the actual segments (lastCompactionState), leading to
// repeated compactions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this in the javadoc of this method instead.

Comment on lines 460 to 483
static boolean isPossiblyRollup(@Nullable ClientCompactionTaskGranularitySpec granularitySpec)
{
return !(granularitySpec != null && Boolean.FALSE.equals(granularitySpec.isRollup()));
}

static boolean isRangePartitioned(@Nullable TuningConfig tuningConfig)
{
return tuningConfig != null && tuningConfig.getPartitionsSpec() instanceof DimensionRangePartitionsSpec;
}

static boolean isRollupOnStringDimension(
DimensionsSpec dimensionsSpec,
@Nullable ClientCompactionTaskGranularitySpec granularitySpec
)
{
if (!isPossiblyRollup(granularitySpec)) {
return false;
}
return dimensionsSpec.getDimensions()
.stream()
.anyMatch(dim -> ColumnType.STRING.equals(dim.getColumnType()));
}

static boolean isPartitionedOnStringDimension(DimensionsSpec dimensionsSpec, @Nullable TuningConfig tuningConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think all of these need to be separate methods as they are not being reused anywhere afaict. They can just be variables inside the needMVD method.

* therefore is particularly for MSQCompactionRunner; it returns true when dimension types aren't known
* from CompactionTask spec, or if either rollup or partition dimensions contain any string-type column.
*/
public static boolean needMultiValuedDimensions(CompactionTask compactionTask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point making this method static if it requires passing the CompactionTask into it.
Also, instead of making this method public or package private, CompactionTask should instead expose a method needsToDownloadSegments() or something. That can internally call this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point making this method static if it requires passing the CompactionTask into it.

Agree. One possibility is that its value can be assigned to a final field and the getter is exposed

Also, instead of making this method public or package private, CompactionTask should instead expose a method needsToDownloadSegments() or something. That can internally call this one.

This would need some refactoring since the effective granularitySpec on which download depends is computed inside createDataSchemaForIntervals, just before passing this computed value to ExistingSegmentAnalyzer where the final need to download is evaluated. Also, each need* flag is separately communicated to the Analyzer class to do selective processing.

Does the proposed approach of assigning needMultiValuedDimensions to a private field and creating tests on it work? In fact, all such fields apart from needGranularitySpec can be populated in CompactionTask constructor and passed to the Analyzer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just relooked at code. Only segmentGranularity is computed later, so all need* flags and an additional new needsDownload flag can be computed in the CompactionTask constructor itself. The needsDownload can have a public getter for tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only needed in test, keep the method as package private rather than public.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public is needed since the tests reside in the MSQ extension

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks good. Minor cleanup suggestions.

@@ -2015,6 +2100,37 @@ public SettableColumnValueSelector<?> makeNewSettableColumnValueSelector()

}

/**
* A class to mimic MSQCompactionRunner behaviour, since the actual class resides in the MSQ extn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we mimicking the behaviour? The methods seem to return only null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just mimics for validations. Updated the comment

@kfaraz kfaraz merged commit 5da9949 into apache:master Oct 19, 2024
93 of 94 checks passed
kirangadhave-imply pushed a commit to kirangadhave-imply/druid that referenced this pull request Oct 22, 2024
…pache#17344)

MSQ currently supports only single-valued string dimensions as partition keys.
This patch adds a check to ensure that partition keys are single-valued in case
this info is available by virtue of segment download for schema inference.

During compaction, if MSQ finds multi-valued dimensions (MVDs) declared as part
of `range` partitionsSpec, it switches partitioning type to dynamic, ending up in
repeated compactions of the same interval. To avoid this scenario, the segment
download logic is also updated to always download segments if info on multi-valued
dimensions is required.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants