Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with realtime partition mismatch metric #11871

Conversation

sajjad-moradi
Copy link
Contributor

Fixes the issue described in #11870

@@ -1577,7 +1577,6 @@ private void setPartitionParameters(RealtimeSegmentConfig.Builder realtimeSegmen
_segmentLogger.warn(
"Number of stream partitions: {} does not match number of partitions in the partition config: {}, "
+ "using number of stream " + "partitions", numPartitionGroups, numPartitions);
_serverMetrics.addMeteredTableValue(_tableNameWithType, ServerMeter.REALTIME_PARTITION_MISMATCH, 1);
Copy link
Contributor Author

@sajjad-moradi sajjad-moradi Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emitting the metric here is wrong, because numPartitions fetched form the stream is used if there's a mismatch between table config and stream metadata. So in reality no event with incorrect partition is consumed here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should change the logger to an info log.

Better yet, drop the comparison altogether between the tableconfig value and the number of partitions we come across in the stream. Two reasons:

  • The tableconfig value is not used while routing queries to realtime tables
  • Stream partitions change dynamically, and we don't update the table config when we detect new partitions. So, it is impossible to keep them in sync.

Copy link
Contributor Author

@sajjad-moradi sajjad-moradi Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the best action item here is to drop numPartition from table config altogether as it shouldn't be used anywhere, and does not provide any value.
To keep the scope of this PR limited to the metric emission fix, I'm just changing the log level to info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numPartitions in the table config seems to be used for instance assignment when a table is created (or rebalanced). I am not able to determine easily how the actual number of partitions in the topic is used for instance creation instead of the configured value (for realtime tables). The value is also used for table rebalancing.

Lastly, realtime and offline tables have the same table config definition.

So we cannot really remove the num partitions from table config definitions.

But it is clear that there is no use for this field for realtime tables, and hence there is really no need to compare this with the actual number of partitions, and hence the suggestion to remove the compare.

@Jackie-Jiang what do you think?

I am also ok to keep it as an info, but I feel it is extraneous code. If we can remove it, we should.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a closer look at the logic and I feel we are making wrong assumptions based on Kafka behavior. numPartitions is used to let pinot know how the data is supposed to be partitioned. For Kafka, data is always partitioned by the upstream Kafka partition, but this assumption doesn't hold for all streams. The clean solution is to introduce an interface in the stream metadata provider to provide the actual numPartitions to use by passing in the value from the table config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackie-Jiang are you ok with creating an issue to clean this up, because we are facing this problem in production now.

@codecov-commenter
Copy link

codecov-commenter commented Oct 25, 2023

Codecov Report

Merging #11871 (db11c9c) into master (649477a) will increase coverage by 26.58%.
Report is 1 commits behind head on master.
The diff coverage is 37.50%.

@@              Coverage Diff              @@
##             master   #11871       +/-   ##
=============================================
+ Coverage     34.84%   61.43%   +26.58%     
- Complexity      946     1147      +201     
=============================================
  Files          2297     2373       +76     
  Lines        124532   128277     +3745     
  Branches      19245    19804      +559     
=============================================
+ Hits          43395    78808    +35413     
+ Misses        78111    43771    -34340     
- Partials       3026     5698     +2672     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (ø)
integration <0.01% <0.00%> (ø)
integration1 <0.01% <0.00%> (ø)
integration2 0.00% <0.00%> (ø)
java-11 61.33% <37.50%> (+14.70%) ⬆️
java-21 61.31% <37.50%> (+26.59%) ⬆️
skip-bytebuffers-false 61.42% <37.50%> (+26.58%) ⬆️
skip-bytebuffers-true 61.23% <37.50%> (+26.54%) ⬆️
temurin 61.43% <37.50%> (+26.58%) ⬆️
unittests 61.43% <37.50%> (+14.76%) ⬆️
unittests1 46.64% <0.00%> (-0.03%) ⬇️
unittests2 27.61% <37.50%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...a/manager/realtime/RealtimeSegmentDataManager.java 50.66% <0.00%> (+0.06%) ⬆️
...local/indexsegment/mutable/MutableSegmentImpl.java 71.84% <42.85%> (+71.84%) ⬆️

... and 825 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@@ -138,6 +138,7 @@ public class MutableSegmentImpl implements MutableSegment {
private final RealtimeSegmentStatsHistory _statsHistory;
private final String _partitionColumn;
private final PartitionFunction _partitionFunction;
private final int _mainPartitionId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final int _mainPartitionId;
private final int _consumingPartitionId;

Just a suggestion. Or else, add a comment there as to what you mean by "main"

@@ -1577,7 +1577,6 @@ private void setPartitionParameters(RealtimeSegmentConfig.Builder realtimeSegmen
_segmentLogger.warn(
"Number of stream partitions: {} does not match number of partitions in the partition config: {}, "
+ "using number of stream " + "partitions", numPartitionGroups, numPartitions);
_serverMetrics.addMeteredTableValue(_tableNameWithType, ServerMeter.REALTIME_PARTITION_MISMATCH, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should change the logger to an info log.

Better yet, drop the comparison altogether between the tableconfig value and the number of partitions we come across in the stream. Two reasons:

  • The tableconfig value is not used while routing queries to realtime tables
  • Stream partitions change dynamically, and we don't update the table config when we detect new partitions. So, it is impossible to keep them in sync.

@mcvsubbu
Copy link
Contributor

A minor ask: Can you add a comment in the line where the metric REALTIME_PARTITION_MISMATCH is defined that it is a count of the number of records that are mismatched? It may help prevent someone from adding it back to this if statement.
Thanks, otherwise I am fine with this PR.

Copy link
Contributor

@mcvsubbu mcvsubbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of time, I am approving this. Please do the following:

  1. Create an issue (or keep the current one open) with Jackie's suggestion in it.
  2. Add a comment in ServerMetrics to reflect the meaning of REALTIME_PARTITION_MISMATCH metric.

Thanks!

@sajjad-moradi sajjad-moradi merged commit 8feec41 into apache:master Oct 27, 2023
16 of 19 checks passed
@sajjad-moradi sajjad-moradi deleted the bug/fix.realtime.partition.mismatch.metric branch October 27, 2023 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants