-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize segment commit to not read partition group metadata #11943
Optimize segment commit to not read partition group metadata #11943
Conversation
27b7d4e
to
ce7c148
Compare
cc @jadami10 |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #11943 +/- ##
============================================
- Coverage 61.63% 61.58% -0.06%
+ Complexity 1153 1146 -7
============================================
Files 2385 2385
Lines 129324 129359 +35
Branches 20022 20025 +3
============================================
- Hits 79707 79660 -47
- Misses 43808 43881 +73
- Partials 5809 5818 +9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
00ca463
to
1dc45b7
Compare
[Without looking at the code changes] |
Let's say I have 8 partitions, all partitions commit at roughly the same time but partition 0 is committed last, we will lose the segment size tuning for all 7 partitions. This PR changes it so that all segment commit will contribute to the segment size tuning, and IMO it will give better segment size prediction |
No, you want to count only one segment in the group that finishes together. That is because of the formula we use where we pay most attention to the past segments and less attention to the most recent segment size (R_next = R_prev * 0.75 + R_current * 0.25). (It could be 0.9 and 0.1, not sure). So, the most recent segment counted 8 times (in your example) will soil the "past" value to be incorrectly weighted as the present value. |
The past weight is 0.9 and the current weight is 0.1 |
I wrote this program to test out some hypothesis. According to this, it takes slightly more number of iterations to stabilize to the right segment size if we apply the algorithm for all partitions. I tried with numPartitions = 1 and numPartitions=32. Assumptions made here:
I think all partitions are somewhat same in behavior, so we should count each partition once so as to learn best from previous completions. One way we could improve is (with the assumptions that partitions complete roughly at the same time), pick the partition that completes first (may not be partition 0, of course). This was considered at the time, but was hard to do, given that partitions may complete within minutes in some cases.
|
@mcvsubbu Thanks for taking time writing this program!
Conceptually, with Based on the above analysis, do you think we can simplify this handling to just count all committing segments? |
...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java
Outdated
Show resolved
Hide resolved
1dc45b7
to
314d747
Compare
pinot-spi/src/main/java/org/apache/pinot/spi/stream/StreamMetadataProvider.java
Show resolved
Hide resolved
314d747
to
b65dddf
Compare
Solves #11812
Currently when committing a real-time segment, controller needs to read partition group metadata for all partitions from upstream, which can be very slow for stream with lots of partitions.
The partition group metadata is used only to extract the partition ids, which can be simply derived from partition count except for stream that closes partitions such as Kinesis.
In this PR, we made the following changes:
SegmentFlushThresholdComputer
, remove the logic of only counting the segments with smallest partition id for the size ratio because it complicates the handling (quite anti-pattern as any segment commit requires info from all partitions) a lot, and I don't see much value from it. Quickly converging to the size ratio of recent data trend should be a pro instead of a con because this ratio is used to decide the segment size to consume data for the same period of time