You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want to change this to off-heap implementation. So the tasks would be:
serialize the generic row (not the default object serde, but a custom one).
Use a PinotDataBuffer to store those off-heap
sort the rows in the buffer
Refer to OffHeapSingleTreeBuilder’s sortAndAggregateSegmentRecords method. It pretty much does a similar thing.
As a decent first step, we can also start by simply changing the intermediate record format from GenericRow to something else like Record. GenericRow stores all values in a map, so the column names get repeated with every record, which unnecessarily inflates the memory needed.
The text was updated successfully, but these errors were encountered:
We added this SegmentProcessorFramework: https://github.com/apache/incubator-pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentProcessorFramework.java
Here’s some background about the framework: https://docs.google.com/document/d/1-e_9aHQB4HXS38ONtofdxNvMsGmAoYfSnc2LP88MbIc/edit#heading=h.ths3qhhyyy7w
This framework takes M segments and converts them into N segments. It has 3 stages - segmentmapper, segmentreducer, and creation of segments. In the reducer stage, we are doing aggregations and sorting: https://github.com/apache/incubator-pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentReducer.java#L114
This collection of GenericRows sorting happens in-memory, and can get very expensive.
We want to change this to off-heap implementation. So the tasks would be:
Refer to OffHeapSingleTreeBuilder’s sortAndAggregateSegmentRecords method. It pretty much does a similar thing.
As a decent first step, we can also start by simply changing the intermediate record format from GenericRow to something else like Record. GenericRow stores all values in a map, so the column names get repeated with every record, which unnecessarily inflates the memory needed.
The text was updated successfully, but these errors were encountered: