Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instrumentation for DataTable Creation #11942

Merged

Conversation

SabrinaZhaozyf
Copy link
Contributor

@SabrinaZhaozyf SabrinaZhaozyf commented Nov 3, 2023

When the InstanceResponseBlock is creating DataTable based on the final results in the combine operator, that data table creation is hitting OOM (see examples below). This PR adds instrumentation for this path under the query killing framework.

Instrument is added the impl of BaseResultsBlock.getDataTable(), specially to GroupByResultsBlock, SelectionResultsBlock, and DistinctResultsBlock which are the most memory-intensive.
Skipped AggregationResultsBlock, ExceptionResultsBlock, ExplainResultsBlock, MetadataResultsBlock as they don't have as much memory overhead.

Examples from heap dumps
Selection

java.lang.OutOfMemoryError.<init>(OutOfMemoryError.java:48)
java.util.Arrays.copyOf(Arrays.java:3745)
  Local variables
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120)
  Local variables
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
  Local variables
java.io.OutputStream.write(OutputStream.java:122)
org.apache.pinot.core.common.datatable.BaseDataTableBuilder.finishRow(BaseDataTableBuilder.java:163)
org.apache.pinot.core.query.selection.SelectionOperatorUtils.getDataTableFromRows(SelectionOperatorUtils.java:342)
  Local variables
org.apache.pinot.core.operator.blocks.results.SelectionResultsBlock.getDataTable(SelectionResultsBlock.java:85)
org.apache.pinot.core.operator.blocks.InstanceResponseBlock.toDataOnlyDataTable(InstanceResponseBlock.java:128)
org.apache.pinot.core.operator.blocks.InstanceResponseBlock.toDataTable(InstanceResponseBlock.java:121)
  Local variables
org.apache.pinot.core.query.scheduler.QueryScheduler.serializeResponse(QueryScheduler.java:337)

GroupBy

java.lang.OutOfMemoryError.<init>(OutOfMemoryError.java:48)
java.util.Arrays.copyOf(Arrays.java:3745)
  Local variables
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120)
  Local variables
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
  Local variables
java.io.OutputStream.write(OutputStream.java:122)
org.apache.pinot.core.common.datatable.BaseDataTableBuilder.setColumn(BaseDataTableBuilder.java:112)
org.apache.pinot.core.operator.blocks.results.GroupByResultsBlock.setDataTableColumn(GroupByResultsBlock.java:262)
org.apache.pinot.core.operator.blocks.results.GroupByResultsBlock.getDataTable(GroupByResultsBlock.java:209)
  Local variables
org.apache.pinot.core.operator.blocks.InstanceResponseBlock.toDataOnlyDataTable(InstanceResponseBlock.java:117)
org.apache.pinot.core.operator.blocks.InstanceResponseBlock.toDataTable(InstanceResponseBlock.java:110)
  Local variables
org.apache.pinot.core.query.scheduler.QueryScheduler.serializeResponse(QueryScheduler.java:337)
  Local variables

Testing

  • Tests added toResourceManagerAccountingTest verify that memory usage is sampled, interruption is checked, and EarlyTerminationException is thrown in getDataTable by manually setting up the accountant after rows/result blocks are created (only samples for data table creation). Without instrumentation, tests will throw OOM exception.
  • Deterministic integration tests to check that queries are killed exactly at data table creation can be hard. Reasons being that triggering threshold too low can cause queries to be killed at combine level before hitting data table creation while a higher threshold can only kill the query on the server nondeterministically.

Below is an example of query exception in the response when a kill happened exactly at data table creation. While it is not deterministic, to reproduce this, you can set server accounting.oom.critical.heap.usage.ratio to be higher in OfflineClusterMemBasedServerQueryKillingTest and try running testSelectionOnlyOOM .

{"message":"QueryCancellationError:\norg.apache.pinot.spi.exception.QueryCancelledException: Cancelled while building data table java.lang.RuntimeException:  Query Broker_172.25.200.137_18099_752082720000000006_O got killed because using 2353711296 bytes of memory on SERVER: Server_localhost_8098, exceeding the quota\n\tat org.apache.pinot.core.query.scheduler.QueryScheduler.serializeResponse(QueryScheduler.java:227)\n\tat org.apache.pinot.core.query.scheduler.QueryScheduler.processQueryAndSerialize(QueryScheduler.java:156)\n\tat org.apache.pinot.core.query.scheduler.QueryScheduler.lambda$createQueryFutureTask$0(QueryScheduler.java:124)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n...\nCaused by: org.apache.pinot.spi.exception.EarlyTerminationException: Interrupted while merging records\n\tat org.apache.pinot.spi.trace.Tracing$ThreadAccountantOps.sampleAndCheckInterruption(Tracing.java:288)\n\tat org.apache.pinot.spi.trace.Tracing$ThreadAccountantOps.sampleAndCheckInterruptionPeriodically(Tracing.java:304)\n\tat org.apache.pinot.core.query.selection.SelectionOperatorUtils.getDataTableFromRows(SelectionOperatorUtils.java:388)\n\tat org.apache.pinot.core.operator.blocks.results.SelectionResultsBlock.getDataTable(SelectionResultsBlock.java:84)","errorCode":503}

cc @jasperjiaguo @siddharthteotia

@codecov-commenter
Copy link

codecov-commenter commented Nov 3, 2023

Codecov Report

Merging #11942 (d3dc89f) into master (972b555) will increase coverage by 0.05%.
Report is 6 commits behind head on master.
The diff coverage is 66.66%.

@@             Coverage Diff              @@
##             master   #11942      +/-   ##
============================================
+ Coverage     61.44%   61.49%   +0.05%     
+ Complexity     1141      207     -934     
============================================
  Files          2385     2385              
  Lines        129189   129216      +27     
  Branches      19998    20001       +3     
============================================
+ Hits          79381    79465      +84     
+ Misses        44061    43951     -110     
- Partials       5747     5800      +53     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (ø)
integration <0.01% <0.00%> (ø)
integration1 <0.01% <0.00%> (ø)
integration2 0.00% <0.00%> (ø)
java-11 27.58% <0.00%> (-33.82%) ⬇️
java-21 61.49% <66.66%> (+0.18%) ⬆️
skip-bytebuffers-false 61.48% <66.66%> (+0.04%) ⬆️
skip-bytebuffers-true 27.58% <0.00%> (-33.68%) ⬇️
temurin 61.49% <66.66%> (+0.05%) ⬆️
unittests 61.49% <66.66%> (+0.05%) ⬆️
unittests1 46.77% <66.66%> (+0.08%) ⬆️
unittests2 27.59% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...g/apache/pinot/common/datatable/BaseDataTable.java 92.50% <100.00%> (+0.19%) ⬆️
...apache/pinot/common/datatable/DataTableImplV4.java 91.54% <100.00%> (+0.09%) ⬆️
...e/operator/blocks/results/GroupByResultsBlock.java 78.67% <100.00%> (+0.81%) ⬆️
...t/core/query/selection/SelectionOperatorUtils.java 92.83% <100.00%> (+0.07%) ⬆️
...che/pinot/core/query/scheduler/QueryScheduler.java 56.96% <0.00%> (-5.54%) ⬇️

... and 38 files with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

@@ -215,6 +219,14 @@ private byte[] serializeResponse(ServerQueryRequest queryRequest, InstanceRespon
byte[] responseByte = null;
try {
responseByte = instanceResponse.toDataTable().toBytes();
Copy link
Contributor

@jasperjiaguo jasperjiaguo Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we verify whether there's memory allocation in toBytes()? Do we need to instrument that part as well? Meanwhile, IMO it might be helpful to very briefly explain why we only added instrumentation only for groupby and selection operation utils in the PR description @SabrinaZhaozyf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we verify whether there's memory allocation in toBytes()? Do we need to instrument that part as well?

Good point. There's memory allocation in toBytes (mostly from writeLeadingSections) and we have seen OOM here as well. Added instrumentation to serializeStringDictionary in toBytes.

IMO it might be helpful to very briefly explain why we only added instrumentation only for groupby and selection operation utils in the PR description

Done.

List<DataTable> dataTables = new ArrayList<>();
while (numIterations < 100) {
// build data table with sampling
dataTables.add(SelectionOperatorUtils.getDataTableFromRows(rows, dataSchema, false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add DataTable.toBytes() as well here? Meanwhile, the testing would probably fit better to the real use case if we call SelectionOperatorUtils.getDataTableFromRows(rows, dataSchema, false) only once in each thread and submit multiple runner threads. Similar for the groupby case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated both tests.

Copy link
Contributor

@jasperjiaguo jasperjiaguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing the comments.

Could you take a look at the tests? They are failing and the log size looks pretty large,
Maybe set the log level to info in the new tests?

@SabrinaZhaozyf
Copy link
Contributor Author

SabrinaZhaozyf commented Nov 9, 2023

Could you take a look at the tests? They are failing and the log size looks pretty large, Maybe set the log level to info in the new tests?

The failed tests were due to a flaky test which has been fixed here.
Rebased and turned off logging for the tests added.

Thank you for the review! @jasperjiaguo

@siddharthteotia siddharthteotia merged commit bb52625 into apache:master Nov 9, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants