Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory used for collection/sorting in SegmentReducer in the SegmentProcessorFramework #6770

Closed
npawar opened this issue Apr 9, 2021 · 1 comment

Comments

@npawar
Copy link
Contributor

npawar commented Apr 9, 2021

We added this SegmentProcessorFramework: https://github.com/apache/incubator-pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentProcessorFramework.java
Here’s some background about the framework: https://docs.google.com/document/d/1-e_9aHQB4HXS38ONtofdxNvMsGmAoYfSnc2LP88MbIc/edit#heading=h.ths3qhhyyy7w

This framework takes M segments and converts them into N segments. It has 3 stages - segmentmapper, segmentreducer, and creation of segments. In the reducer stage, we are doing aggregations and sorting: https://github.com/apache/incubator-pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentReducer.java#L114
This collection of GenericRows sorting happens in-memory, and can get very expensive.

We want to change this to off-heap implementation. So the tasks would be:

  1. serialize the generic row (not the default object serde, but a custom one).
  2. Use a PinotDataBuffer to store those off-heap
  3. sort the rows in the buffer
    Refer to OffHeapSingleTreeBuilder’s sortAndAggregateSegmentRecords method. It pretty much does a similar thing.

As a decent first step, we can also start by simply changing the intermediate record format from GenericRow to something else like Record. GenericRow stores all values in a map, so the column names get repeated with every record, which unnecessarily inflates the memory needed.

@npawar
Copy link
Contributor Author

npawar commented Aug 4, 2021

Closing this, as these issues are resolved after the PR above and Jackie's refactorings.

@npawar npawar closed this as completed Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant