Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around the problem of Helix sending 2 transitions for CONSUMING -> DROPPED #12351

Merged

Conversation

Jackie-Jiang
Copy link
Contributor

Helix might send 2 state transitions when a segment is dropped from a server. This behavior breaks the onConsumingToDropped() callback because it is only invoked when processing CONSUMING -> DROPPED state transition.
This PR added a cache for segments that just went through CONSUMING -> OFFLINE state transition to work around this problem.

@Jackie-Jiang
Copy link
Contributor Author

Related to #12343 which has more context.
cc @gortiz

Copy link
Contributor

@zhtaoxiang zhtaoxiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jackie-Jiang
Copy link
Contributor Author

Submitted a Helix issue to verify the observation: apache/helix#2751

@codecov-commenter
Copy link

Codecov Report

Attention: 58 lines in your changes are missing coverage. Please review.

Comparison is base (17db0fd) 61.69% compared to head (d5ab682) 61.74%.
Report is 3 commits behind head on master.

Files Patch % Lines
...e/assignment/instance/InstanceTagPoolSelector.java 32.35% 22 Missing and 1 partial ⚠️
...nstance/InstanceReplicaGroupPartitionSelector.java 94.58% 9 Missing and 6 partials ⚠️
...r/helix/SegmentOnlineOfflineStateModelFactory.java 0.00% 15 Missing ⚠️
...elix/core/periodictask/ControllerPeriodicTask.java 83.33% 1 Missing and 1 partial ⚠️
...mmon/assignment/InstanceAssignmentConfigUtils.java 0.00% 1 Missing ⚠️
...e/pinot/controller/helix/SegmentStatusChecker.java 0.00% 1 Missing ⚠️
...assignment/instance/InstancePartitionSelector.java 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12351      +/-   ##
============================================
+ Coverage     61.69%   61.74%   +0.04%     
  Complexity      207      207              
============================================
  Files          2424     2424              
  Lines        132340   132512     +172     
  Branches      20436    20481      +45     
============================================
+ Hits          81651    81820     +169     
- Misses        44693    44694       +1     
- Partials       5996     5998       +2     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.70% <83.42%> (+26.67%) ⬆️
java-21 61.61% <83.42%> (+0.03%) ⬆️
skip-bytebuffers-false 61.73% <83.42%> (+0.04%) ⬆️
skip-bytebuffers-true 61.58% <83.42%> (+0.02%) ⬆️
temurin 61.74% <83.42%> (+0.04%) ⬆️
unittests 61.74% <83.42%> (+0.04%) ⬆️
unittests1 46.92% <75.00%> (+0.01%) ⬆️
unittests2 27.70% <82.57%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@snleee snleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you for catching this.

// by OFFLINE -> DROPPED. Use this cache to track the segments that just went through CONSUMING -> OFFLINE transition
// to detect CONSUMING -> DROPPED transition.
// TODO: Check how Helix handle CONSUMING -> DROPPED transition and remove this cache if it's not needed.
private final Cache<Pair<String, String>, Boolean> _recentlyOffloadedConsumingSegments =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about move the cache into the tableDataMgr? Considering the key of the map is pair<tableName, segmentName> anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer keeping it here since it is just a workaround for Helix state transition handling. Once Helix fixes the problem we should be able to remove it

@Jackie-Jiang Jackie-Jiang merged commit 33074e1 into apache:master Feb 1, 2024
19 checks passed
@Jackie-Jiang Jackie-Jiang deleted the fix_consuming_drop_state_transition branch February 1, 2024 22:13
if (_recentlyOffloadedConsumingSegments.getIfPresent(tableSegmentPair) != null) {
_recentlyOffloadedConsumingSegments.invalidate(tableSegmentPair);
onConsumingToDropped(tableNameWithType, segmentName);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an actual problem, but it is a bit confusing.

_recentlyOffloadedConsumingSegments is only populated with realtime tables, why do we need to check if (TableNameBuilder.isRealtimeTableResource(tableNameWithType))?

Is the reason to do not create the pair?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and to reduce a cache lookup. OFFLINE table doesn't really have CONSUMING state

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants