Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically delete Tmp Segment file brought by the SplitCommit End phase #10815

Merged
merged 23 commits into from
Sep 19, 2023

Conversation

wirybeaver
Copy link
Contributor

@wirybeaver wirybeaver commented May 29, 2023

issue #10243

  • Unify the tmp file naming format
  • Delete tmp files at a regular cadence by extending the ControllerPeriodicTask

Unify the tmp file naming format

In the 2nd phase (Segment Upload) of segment split commit, Server2ControllerSegmentUploader and PinotFSSegmentUploader diverges on the naming pattern of temporary segment file, which are <segmentName>.tmp.<UUID> and <segmentName><UUID> respectively. In the 3rd phase (CommitEnd), the controller moves the temporary file to permanent file and then synchronously delete other potential temporary files that were incidentally generated due to the flaky network. However, the deletion currently only filters tmp files matches Server2ControllerSegmentUploader's pattern. Thus, we need to unify the tmp file pattern for both uploaders.

Periodic File Deletion

Add a clusters level knob to turn on async tmp file deletion. If it's turned on, the time consuming FS.listFiles() won't be invoked and thus increase the throughput of realtime consuming.

Deletion files only if the following conditions are meet:
(1) the name ends with .tmp.<UUID>
(2) out of the configurable retention period, e.g. 10 min. Don't want to make the retention too short or zero. Because it's likely that the current consuming segment is being committed by the lead server and the lead server will generate an intermediate tmp segment, which is a valid case.
(3) the file path is not in the set of segment metadata's download url

New config

  • controller.realtime.segment.tmpFileAsyncDeletionEnabled (default false)
  • controller.realtime.segment.tmpFileRetentionInSeconds (default 3600)

@wirybeaver wirybeaver marked this pull request as draft May 29, 2023 16:57
@wirybeaver wirybeaver changed the title Periodical Tmp Segment file deletion [Draft] Periodical Tmp Segment file deletion May 29, 2023
Copy link
Contributor Author

@wirybeaver wirybeaver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankitsultana Please check my self-review. Wanted to solicit your idea.

@@ -74,7 +73,7 @@ public URI uploadSegment(File segmentFile, LLCSegmentName segmentName, int timeo
final String rawTableName = TableNameBuilder.extractRawTableName(segmentName.getTableName());
Callable<URI> uploadTask = () -> {
URI destUri = new URI(StringUtil.join(File.separator, _segmentStoreUriStr, segmentName.getTableName(),
segmentName.getSegmentName() + UUID.randomUUID().toString()));
SegmentCompletionUtils.generateSegmentFileName(segmentName.getSegmentName())));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Server2ControllerSegmentUploader and PinotFSSegmentUploader has different naming pattern for tmp segments before.
Server2ControllerSegmentUploader: segment_name.tmp.
PinotFSSegmentUploader: segment_name.

@ankitsultana
Copy link
Contributor

@wirybeaver Let's follow-up offline

@Jackie-Jiang
Copy link
Contributor

Do we need to clean up tmp files periodically, or just when the controller starts? Will controller create extra tmp files during the regular life cycle without shut down?

@wirybeaver
Copy link
Contributor Author

wirybeaver commented Jun 28, 2023

Do we need to clean up tmp files periodically, or just when the controller starts? Will controller create extra tmp files during the regular life cycle without shut down?

Hi Jackie, there are two kind of segment uploader during SegmentUpload phase for split commit.

  1. Server2ControllerSegmentUploader, controller upload the tmp file <seg_name>.tmp.<uuid> to deep store.
  2. PinotFSSegmentUploader, server upload the tmp file <seg_name><uuid> to deep store.

In the CommitEnd phase, the deep storage can have outage due to flaky network when controller attempt to rename the tmp file. If peer downloading is enabled, controller will continue to commit the segment in ZK after timeout. The tmp file named with unique <uuid> is left there in the regular lifecycle.

@ankitsultana
Copy link
Contributor

Do we need to clean up tmp files periodically, or just when the controller starts?

To add to @wirybeaver 's response, we will need to add a periodic job for this clean-up. This PR is corresponding to this doc we had written a while back: https://docs.google.com/document/u/1/d/18PqeN7eUP9Yc2MqwihSrGMmadJ3nuDsrcWf6F_ks_hQ/edit

In our internal clusters, we found that 50% of all deep-store data for realtime tables was just leaked data.

@codecov-commenter
Copy link

codecov-commenter commented Jul 6, 2023

Codecov Report

Merging #10815 (7a91eff) into master (75ab18b) will increase coverage by 1.55%.
Report is 135 commits behind head on master.
The diff coverage is 56.09%.

@@             Coverage Diff              @@
##             master   #10815      +/-   ##
============================================
+ Coverage     61.52%   63.07%   +1.55%     
+ Complexity     6513     1106    -5407     
============================================
  Files          2233     2326      +93     
  Lines        120083   125049    +4966     
  Branches      18223    19094     +871     
============================================
+ Hits          73878    78873    +4995     
+ Misses        40798    40563     -235     
- Partials       5407     5613     +206     
Flag Coverage Δ
integration <0.01% <0.00%> (?)
integration1 <0.01% <0.00%> (+<0.01%) ⬆️
integration2 0.00% <0.00%> (ø)
java-11 63.03% <56.09%> (+1.54%) ⬆️
java-17 62.93% <56.09%> (+1.57%) ⬆️
java-20 62.92% <56.09%> (+1.53%) ⬆️
temurin 63.07% <56.09%> (+1.55%) ⬆️
unittests 63.06% <56.09%> (?)
unittests1 67.49% <66.66%> (+0.50%) ⬆️
unittests2 14.51% <41.46%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...g/apache/pinot/common/metrics/ControllerMeter.java 0.00% <0.00%> (ø)
...er/api/resources/LLCSegmentCompletionHandlers.java 0.00% <0.00%> (ø)
...r/validation/RealtimeSegmentValidationManager.java 39.02% <12.50%> (-2.87%) ⬇️
...g/apache/pinot/common/metrics/AbstractMetrics.java 39.51% <37.50%> (-0.39%) ⬇️
.../core/realtime/PinotLLCRealtimeSegmentManager.java 59.66% <57.69%> (-0.61%) ⬇️
...apache/pinot/controller/BaseControllerStarter.java 78.16% <100.00%> (+0.06%) ⬆️
...va/org/apache/pinot/controller/ControllerConf.java 57.78% <100.00%> (+0.03%) ⬆️
.../data/manager/realtime/PinotFSSegmentUploader.java 75.55% <100.00%> (ø)
.../data/manager/realtime/SegmentCompletionUtils.java 100.00% <100.00%> (ø)

... and 523 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@wirybeaver wirybeaver marked this pull request as ready for review July 7, 2023 04:15
@wirybeaver wirybeaver changed the title [Draft] Periodical Tmp Segment file deletion Periodical Tmp Segment file deletion Jul 7, 2023
@wirybeaver wirybeaver changed the title Periodical Tmp Segment file deletion Periodically delete Tmp Segment file brought the allowed failure in SplitCommit End phase Jul 7, 2023
@wirybeaver wirybeaver changed the title Periodically delete Tmp Segment file brought the allowed failure in SplitCommit End phase Periodically delete Tmp Segment file brought by the SplitCommit End phase Jul 7, 2023
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high level questions:

  1. Should we always rely on the async cleanup and remove the sync cleanup (assuming listing files is expensive)? I can see we can integrate more clean up logic into this periodic task in the future. Also, let's make the periodic task name more generic for future prove
  2. Do we create temp file for non-split-commit scenario?

@wirybeaver
Copy link
Contributor Author

wirybeaver commented Jul 14, 2023

Some high level questions:

  1. Should we always rely on the async cleanup and remove the sync cleanup (assuming listing files is expensive)? I can see we can integrate more clean up logic into this periodic task in the future. Also, let's make the periodic task name more generic for future prove
  2. Do we create temp file for non-split-commit scenario?

For question 1, I already ensured the sync cleanup will be skipped if async deletion is enabled.

    if (!isTmpSegmentAsyncDeletionEnabled()) 
      try {
        for (String uri : pinotFS.listFiles(tableDirURI, false)) {
          if (uri.contains(SegmentCompletionUtils.getSegmentNamePrefix(segmentName))) {
            LOGGER.warn("Deleting temporary segment file: {}", uri);
            Preconditions.checkState(pinotFS.delete(new URI(uri), true), "Failed to delete file: %s", uri);

Rename it as TmpSegmentCleaner? From my point of view, have a dedicated class for a specific job is more modular and gets better parallel. Put multiple tasks into the same Thread.run() code will result a lot of if/else, try/catch logic. For example, the failure of task1 should not block task2.

For question 2, I need to inspect the source code.

@Jackie-Jiang
Copy link
Contributor

What I meant is to always enable this periodic task and remove the sync cleanup logic.
Adding a dedicated periodic task per specific job is more modular, but we also need to consider if it can cause problem when running in parallel with another periodic task (e.g. we had encountered some problems when 2 periodic task working on the same table at the same time). I feel this clean up logic can be fit into the existing RealtimeSegmentValidationManager

@wirybeaver
Copy link
Contributor Author

wirybeaver commented Jul 20, 2023

always enable this periodic task and remove the sync cleanup logic

I tried to set the default value to true but a bunch of unit test failure. directly set to true seems to be incompatible.

I feel this clean up logic can be fit into the existing RealtimeSegmentValidationManager

In the initial version, I append the clean up task at the end of validation manager. I am good to rollback.

@wirybeaver
Copy link
Contributor Author

Have an manual integration test with Uber internal pinotFS and the log shows that the tmp file was deleted successfully.

Succeed to delete file: path/to/<table>/<segment>.tmp.<UUID> 

Copy link
Contributor

@ankitsultana ankitsultana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the feedback.

List<SegmentZKMetadata> segmentsZKMetadata = _pinotHelixResourceManager.getSegmentsZKMetadata(realtimeTableName);

// Delete tmp segments
try {
long numDeleteTmpSegments = _llcRealtimeSegmentManager.deleteTmpSegments(realtimeTableName, segmentsZKMetadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: one alternative for this may be to check isTmpSegmentAsyncDeletionEnabled outside deleteTmpSegments and avoid calling deleteTmpSegments altogether if the feature is not enabled. Also deleteTmpSegments could add a Precondition(isTmpSegmentAsyncDeletionEnabled(), "called even though not enabled"). Though this also works so feel free to ignore.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@@ -239,6 +239,14 @@ public void cleanupSegmentCountGauge(final String resource) {
removeGauge(resource, ValidationMetricName.SEGMENT_COUNT);
}

public void updateTmpSegmentCountGauge(final String resource, final long tmpSegmentCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we define it as a gauge? Should it be a meter instead?

Let's match the name of the method and the metric

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public static <T> PinotGauge<T> makeGauge(PinotMetricsRegistry registry, PinotMetricName name, PinotGauge<T> gauge) {
    return registry.newGauge(name, gauge);
  }

  public static PinotMeter makePinotMeter(PinotMetricsRegistry registry, PinotMetricName name, String eventType,
      TimeUnit unit) {
    return registry.newMeter(name, eventType, unit);
  }

Meter metrics would output .m1, .m5, .m15 I am not whether we need those statics. Existing count related metrics of ValidationMetrics also use gauge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gauge means refreshing the value, so that can be used to for total docs, total segments etc, but not for incremental values such as segments deleted in this round

Copy link
Contributor Author

@wirybeaver wirybeaver Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate for sharing the knowledge. I already replaced gauge with meter and add removeTableMeter method into AbstractMetrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackie-Jiang Could you have another review?

Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
@Jackie-Jiang Jackie-Jiang merged commit d9c0790 into apache:master Sep 19, 2023
21 checks passed
&& getCurrentTimeMs() - lastModified > _controllerConf.getTmpSegmentRetentionInSeconds() * 1000L;
}

private boolean isLowLevelConsumer(String tableNameWithType, TableConfig tableConfig) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isLowLevelConsumer is deprecated.

@Jackie-Jiang Jackie-Jiang added Configuration Config changes (addition/deletion/change in behavior) release-notes Referenced by PRs that need attention when compiling the next release notes documentation labels Sep 23, 2023
@wirybeaver wirybeaver deleted the periodicDeepStoreClean branch January 8, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) documentation enhancement release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants