Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow missing intervals for Parallel task with hash/range partitioning #10592

Merged
merged 4 commits into from
Nov 25, 2020

Conversation

jihoonson
Copy link
Contributor

Description

This PR allows Parallel task to run without explicit intervals in granularitySpec. If intervals are missing, the parallel task executes an extra step for input sampling which collects the intervals to index.

This PR additionally fixes a bug when numShards is missing in hash partitioning. When numShards is missing, the parallel task computes it by scanning the whole input. However, the computed numShards was ignored when it's serialized into JSON. To fix it, this PR adds another field intervalToNumShardsOverride which stores the computed numShards per interval so that we can handle data skew well across intervals.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@jihoonson jihoonson changed the title Allow missing intervals for Parallel task Allow missing intervals for Parallel task with hash/range partitioning Nov 18, 2020
@vogievetsky
Copy link
Contributor

No comment on the code itself, but very exciting to remove the extra requirement from the user!

final boolean needsInputSampling =
partitionsSpec.getNumShards() == null
|| ingestionSchemaToUse.getDataSchema().getGranularitySpec().inputIntervals().isEmpty();
if (needsInputSampling) {
// 0. need to determine numShards by scanning the data
LOG.info("numShards is unspecified, beginning %s phase.", PartialDimensionCardinalityTask.TYPE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this log statement needs a change now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


LOG.info("Automatically determined numShards: " + numShardsOverride);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an equivalent info being logged somewhere now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I didn't add one for intervalToNumShards because it could have lots of intervals.

@jihoonson
Copy link
Contributor Author

@abhishekagarwal87 do you have more comments?

@abhishekagarwal87
Copy link
Contributor

@abhishekagarwal87 do you have more comments?

@jihoonson LGTM

@jihoonson
Copy link
Contributor Author

@clintropolis @himanshug @abhishekagarwal87 thanks for the review 👍

@jihoonson jihoonson merged commit 7462b0b into apache:master Nov 25, 2020
@jihoonson jihoonson added this to the 0.21.0 milestone Jan 4, 2021
JulianJaffePinterest pushed a commit to JulianJaffePinterest/druid that referenced this pull request Jan 22, 2021
apache#10592)

* Allow missing intervals for Parallel task

* fix row filter

* fix tests

* fix log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants