-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue with realtime partition mismatch metric #11871
Fix issue with realtime partition mismatch metric #11871
Conversation
@@ -1577,7 +1577,6 @@ private void setPartitionParameters(RealtimeSegmentConfig.Builder realtimeSegmen | |||
_segmentLogger.warn( | |||
"Number of stream partitions: {} does not match number of partitions in the partition config: {}, " | |||
+ "using number of stream " + "partitions", numPartitionGroups, numPartitions); | |||
_serverMetrics.addMeteredTableValue(_tableNameWithType, ServerMeter.REALTIME_PARTITION_MISMATCH, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emitting the metric here is wrong, because numPartitions fetched form the stream is used if there's a mismatch between table config and stream metadata. So in reality no event with incorrect partition is consumed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we should change the logger to an info log.
Better yet, drop the comparison altogether between the tableconfig value and the number of partitions we come across in the stream. Two reasons:
- The tableconfig value is not used while routing queries to realtime tables
- Stream partitions change dynamically, and we don't update the table config when we detect new partitions. So, it is impossible to keep them in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the best action item here is to drop numPartition from table config altogether as it shouldn't be used anywhere, and does not provide any value.
To keep the scope of this PR limited to the metric emission fix, I'm just changing the log level to info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numPartitions in the table config seems to be used for instance assignment when a table is created (or rebalanced). I am not able to determine easily how the actual number of partitions in the topic is used for instance creation instead of the configured value (for realtime tables). The value is also used for table rebalancing.
Lastly, realtime and offline tables have the same table config definition.
So we cannot really remove the num partitions from table config definitions.
But it is clear that there is no use for this field for realtime tables, and hence there is really no need to compare this with the actual number of partitions, and hence the suggestion to remove the compare.
@Jackie-Jiang what do you think?
I am also ok to keep it as an info, but I feel it is extraneous code. If we can remove it, we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a closer look at the logic and I feel we are making wrong assumptions based on Kafka behavior. numPartitions
is used to let pinot know how the data is supposed to be partitioned. For Kafka, data is always partitioned by the upstream Kafka partition, but this assumption doesn't hold for all streams. The clean solution is to introduce an interface in the stream metadata provider to provide the actual numPartitions
to use by passing in the value from the table config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jackie-Jiang are you ok with creating an issue to clean this up, because we are facing this problem in production now.
Codecov Report
@@ Coverage Diff @@
## master #11871 +/- ##
=============================================
+ Coverage 34.84% 61.43% +26.58%
- Complexity 946 1147 +201
=============================================
Files 2297 2373 +76
Lines 124532 128277 +3745
Branches 19245 19804 +559
=============================================
+ Hits 43395 78808 +35413
+ Misses 78111 43771 -34340
- Partials 3026 5698 +2672
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 825 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@@ -138,6 +138,7 @@ public class MutableSegmentImpl implements MutableSegment { | |||
private final RealtimeSegmentStatsHistory _statsHistory; | |||
private final String _partitionColumn; | |||
private final PartitionFunction _partitionFunction; | |||
private final int _mainPartitionId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private final int _mainPartitionId; | |
private final int _consumingPartitionId; |
Just a suggestion. Or else, add a comment there as to what you mean by "main"
@@ -1577,7 +1577,6 @@ private void setPartitionParameters(RealtimeSegmentConfig.Builder realtimeSegmen | |||
_segmentLogger.warn( | |||
"Number of stream partitions: {} does not match number of partitions in the partition config: {}, " | |||
+ "using number of stream " + "partitions", numPartitionGroups, numPartitions); | |||
_serverMetrics.addMeteredTableValue(_tableNameWithType, ServerMeter.REALTIME_PARTITION_MISMATCH, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we should change the logger to an info log.
Better yet, drop the comparison altogether between the tableconfig value and the number of partitions we come across in the stream. Two reasons:
- The tableconfig value is not used while routing queries to realtime tables
- Stream partitions change dynamically, and we don't update the table config when we detect new partitions. So, it is impossible to keep them in sync.
A minor ask: Can you add a comment in the line where the metric REALTIME_PARTITION_MISMATCH is defined that it is a count of the number of records that are mismatched? It may help prevent someone from adding it back to this if statement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of time, I am approving this. Please do the following:
- Create an issue (or keep the current one open) with Jackie's suggestion in it.
- Add a comment in ServerMetrics to reflect the meaning of REALTIME_PARTITION_MISMATCH metric.
Thanks!
Fixes the issue described in #11870