Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post build index creation #11711

Merged
merged 6 commits into from
Oct 4, 2023

Conversation

saurabhd336
Copy link
Contributor

@saurabhd336 saurabhd336 commented Sep 29, 2023

It may be desirable for some index types to be build post segment file has been created. This PR allows for that.
The behaviour can be achieved for a custom index being added (using index-spi) by overriding the getIndexBuildLifecycle method to return BuildLifecycle.POST_SEGMENT_CREATION

Release notes

Adds support for building indexes post segment file creation, allowing indexes that may depend on a completed segment to be built as part of the segment creation process.

@saurabhd336 saurabhd336 marked this pull request as ready for review September 29, 2023 10:01
@codecov-commenter
Copy link

codecov-commenter commented Sep 29, 2023

Codecov Report

Merging #11711 (de5f04e) into master (e0a1f62) will decrease coverage by 0.04%.
The diff coverage is 37.50%.

@@             Coverage Diff              @@
##             master   #11711      +/-   ##
============================================
- Coverage     63.13%   63.09%   -0.04%     
+ Complexity     1118     1117       -1     
============================================
  Files          2342     2342              
  Lines        125883   125913      +30     
  Branches      19357    19359       +2     
============================================
- Hits          79477    79446      -31     
- Misses        40749    40811      +62     
+ Partials       5657     5656       -1     
Flag Coverage Δ
integration <0.01% <0.00%> (ø)
integration1 <0.01% <0.00%> (ø)
integration2 0.00% <0.00%> (ø)
java-11 63.03% <37.50%> (-0.03%) ⬇️
java-17 62.96% <37.50%> (+0.02%) ⬆️
java-20 62.94% <37.50%> (-0.05%) ⬇️
temurin 63.09% <37.50%> (-0.04%) ⬇️
unittests 63.09% <37.50%> (-0.04%) ⬇️
unittests1 67.24% <37.50%> (-0.04%) ⬇️
unittests2 14.43% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...ment/creator/impl/SegmentColumnarIndexCreator.java 86.21% <100.00%> (-0.05%) ⬇️
.../segment/index/dictionary/DictionaryIndexType.java 83.92% <100.00%> (+0.09%) ⬆️
...al/segment/index/nullvalue/NullValueIndexType.java 84.61% <100.00%> (+0.61%) ⬆️
.../org/apache/pinot/segment/spi/index/IndexType.java 87.50% <100.00%> (+20.83%) ⬆️
...t/creator/impl/SegmentIndexCreationDriverImpl.java 72.68% <16.66%> (-7.43%) ⬇️

... and 14 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

if (postSegCreationIndexes.size() > 0) {
// Build other indexes
Map<String, Object> props = new HashMap<>();
props.put(IndexLoadingConfig.READ_MODE_KEY, ReadMode.mmap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive question: is mmap mode alway the best choice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say so.. This method will be called by any component when building a segment (eg by the server when commiting consuming segment, by the minions when building a segment, by spark job when building building and pushing data etc). For example the PinotRecordReader always uses mmap (used by minion jobs to read segment and build new segments out of it) https://github.com/apache/pinot/blob/master/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/readers/PinotSegmentRecordReader.java#L116.

Any thoughts on this? I guess a new config can be used to specify the load type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! I think it's a good starting point if PinotRecordReader always uses mmap. We can revisit this later.

@zhtaoxiang
Copy link
Contributor

I approved the PR by accident 😛

@Jackie-Jiang Jackie-Jiang added enhancement Configuration Config changes (addition/deletion/change in behavior) labels Sep 30, 2023
@Jackie-Jiang
Copy link
Contributor

cc @gortiz

enum IndexBuildLifecycle {
DURING_SEGMENT_CREATION,
POST_SEGMENT_CREATION,
CUSTOM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a javadoc explaining the meaning of each literal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just call itBuildTime as this enum is defined as part of IndexType

Copy link
Contributor

@gortiz gortiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said, I think it would be great to have a javadoc explaining the semantics of each lifecycle for each enum. Apart from that, I like the PR. Nice improvement!

enum IndexBuildLifecycle {
DURING_SEGMENT_CREATION,
POST_SEGMENT_CREATION,
CUSTOM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just call itBuildTime as this enum is defined as part of IndexType

@@ -313,6 +325,36 @@ private void handlePostCreation()
buildStarTreeV2IfNecessary(segmentOutputDir);
}

Set<IndexType> postSegCreationIndexes = IndexService.getInstance().getAllIndexes().stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a bit more readability, wrap those new changes to a method like addIndexIfNecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@Jackie-Jiang Jackie-Jiang added the release-notes Referenced by PRs that need attention when compiling the next release notes label Oct 3, 2023
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@saurabhd336 saurabhd336 merged commit 873992f into apache:master Oct 4, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) enhancement ingestion release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants