Limit the subquery results by memory usage #13952

LakshSingla · 2023-03-20T06:30:43Z

Description

Overview

Currently, in the ClientQuerySegmentWalker, when the data sources get inlined, they can be limited by the number of rows to prevent a query (subquery) from hogging up the broker's memory. This however doesn't have a proper correspondence with the memory used, since a row can have multiple columns with varying amounts of data in them. Therefore it would be better if a memory limit is also available, which prevents the subquery's results from exploding beyond a certain memory limit.

This PR aims to use the Frame which was introduced along with the MSQ to store the inline results. Since the Frames are backed by memory, we can fetch the memory used by the frame, and correspondingly the data source to estimate the size that is consumed by the inline data source. This is a close estimate of the size taken by the subquery results.

Configuration

User can set the maxSubqueryBytes key in the query context with the value that is the upper bound on the number of bytes the subquery's results can take.
Often, the result types of the query's results are unknown to the broker. Materializing to frames require that the types be present, which can cause results to not be materialized. Therefore there is an additional undocumented parameter useNestedForUnknownTypeInSubquery parameter that by default is set to false but can be set to true, which controls how to handle the column's whose types are unknown (i.e. the columnType is empty/null). If the parameter is set, then the null types are serded as nested JSON data, which should handle most of the common cases.

Behaviour

As proposed in the review comment if

if maxSubqueryBytes is not set
- We use maxSubqueryRows to limit the results of the subquery by the number of rows. Also, we execute the older code path, which doesn't materialize the results to frames
if maxSubqueryBytes is set
- if we can materialize the results of all the subqueries to frames
  - We limit the results of the subquery by the number of bytes
- if we can't materialize the results of all the subqueries to frames
  - We default to the old code path and limit the results of the subquery by the number of rows (if set).

Proposed changes

ClientQuerySegmentWalker#toInlineDataSource now has distinct code paths which convert the results respectively to frames (new changes) or the iterable of rows (old changes)
The query tool chests have been updated to materialize the results as frames if possible.

Supporting changes

FrameBasedIndexedTable - An indexed table that works on the FrameBasedInlineDataSource. It indexes the key columns and provides a way to extract the columnReader for the columns of the data source
FrameBasedInlineDataSource - Inline data source which is based upon an underlying list of frames. Frames can be written using individual row signatures. The datasource itself has its own signature which is "cumulative" signature of the underlying frames.
IterableRowsCursorHelper - Creates a cursor from an iterable representing the rows of a datasource.
ConcatCursor - Cursor representing concatenation of multiple underlying cursors.
ScanResultValue has been updated to provide the row signature whenever present in the underlying datasources.
FrameBasedInlineDataSourceSerializer - Serializes a FrameBasedInlineDataSource as if it was the traditional InlineDataSource. This is done to keep the communication protocol between the broker and the data servers (historicals) unchanged.

Testing

The set of changes has been tested on the existing CalciteQueryTests stack.
The changes have been tested on a local deployment

Impact on existing deployments

This change shouldn't affect the existing deployments. The communication protocol between the broker and the historical is unchanged (the frame-based datasources, inline as existing traditional datasources).
The change to ScanResultValue shouldn't affect the upgrade process since historicals are updated before the brokers in the proper upgrade method, therefore brokers can always assume that the ScanResultValue should contain the updated information (of signature). Broker's currently don't break even if the fetched ScanResultvalue has rowSignature as null, therefore existing deployments shouldn't be affected
The new code paths are only deployed when the memory limit is mentioned by the user in the queries, therefore existing queries should continue to work as is, without any performance impacts.

Follow up

Create an automatic user-friendly configuration based on the guidelines.

Release note

Users can now add a guardrail to prevent subquery’s results from exceeding the set number of bytes by setting druid.server.http.maxSubqueryRows in Broker's config or maxSubqueryRows in the query context. This feature is experimental for now and would default back to row-based limiting in case it fails to get the accurate size of the results consumed by the query.

This PR has:

server/src/main/java/org/apache/druid/server/InlineResultsCursor.java

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

cryptoe

Partial review!

processing/src/main/java/org/apache/druid/query/InlineDataSource.java

cryptoe · 2023-03-20T12:20:28Z

processing/src/main/java/org/apache/druid/query/InlineDataSource.java

@@ -300,7 +363,7 @@ public boolean equals(Object o)
      return false;
    }
    InlineDataSource that = (InlineDataSource) o;
-    return rowsEqual(rows, that.rows) &&


maybe compare frames directly if possible and save on the cost of initializing various frameReaders and then de serializing them to rows.

I'll check if this works since that would be faster.
InlineDatasource's equality is done only in the tests afaik, therefore we should be fine if it doesn't work as such.

cryptoe · 2023-03-20T12:22:44Z

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

@@ -565,10 +616,14 @@ private static <T, QueryType extends Query<T>> InlineDataSource toInlineDataSour
      final Sequence<T> results,
      final QueryToolChest<T, QueryType> toolChest,
      final AtomicInteger limitAccumulator,


we should never be using both limits I guess. Can we remove the additional 3 params and just pass another param called type ? and re-use the same limit variables ?

I am not sure of the use case. Theoretically, we can pass both the limits and error the query out if any one of them is reached. Is that the behavior we can encourage or do we want the user to give only one of the limit?

If we have two limits, we might want to give them names so we know which limit that limitAccumulator is accumulating: row count or memory bytes?

cryptoe · 2023-03-20T12:24:19Z

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

+    Frame frame = null;
+
+    // Try to serialize the results into a frame only if the memory limit is set on the server or the query
+    if (memoryLimitSet) {


We should only serialize the frames only when memoryLimit Is set else the old code path should be invoked.

cryptoe · 2023-03-20T12:34:25Z

server/src/test/java/org/apache/druid/server/ClientQuerySegmentWalkerTest.java

+
+
+  @Test
+  public void testTimeseriesOnGroupByOnTableErrorTooLarge()


I think tests around

String cols

Long cols

Double/float cols

Complex cols

Array cols

Nested cols
Would help us build confidence in the feature.

Some tests to checkout:

CalciteQueryTests#testMaxSubqueryRows

GroupByQueryRunnerTest#testGroupByMaxRowsLimitContextOverride

Ideally all tests which have a subquery should be executed using the new code path but since its feature flagged it might not be a hard requirement.

processing/src/main/java/org/apache/druid/query/InlineDataSource.java

paul-rogers · 2023-03-20T19:26:03Z

processing/src/main/java/org/apache/druid/query/InlineDataSource.java

@@ -300,7 +363,7 @@ public boolean equals(Object o)
      return false;
    }
    InlineDataSource that = (InlineDataSource) o;
-    return rowsEqual(rows, that.rows) &&
+    return rowsEqual(getRowsAsList(), that.getRowsAsList()) &&


Does it even make semantic sense to compare two input sources for equality? Are we adding this become some static check told us we need it, but not because we actually use it?

paul-rogers · 2023-03-20T19:27:58Z

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

@@ -565,10 +616,14 @@ private static <T, QueryType extends Query<T>> InlineDataSource toInlineDataSour
      final Sequence<T> results,
      final QueryToolChest<T, QueryType> toolChest,
      final AtomicInteger limitAccumulator,


If we have two limits, we might want to give them names so we know which limit that limitAccumulator is accumulating: row count or memory bytes?

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

paul-rogers · 2023-03-20T19:31:01Z

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

+            new ArrayList<>()
+        );
+
+        final Cursor cursor = new InlineResultsCursor(resultList, signature);


All of this seems too complex to put inside the segment walker. For one thing, it is hard to test if it is an implementation detail. Perhaps pull out this logic into a separate class that can be unit tested extensively. For example, we'd want tests for hitting each of the limits, for handling variable-width columns, etc.

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

2. Add the resultAsFrames to other toolchests 3. Refactoring 4. Handling null types

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

2. Checkstyle 3. Precheck in the toolchest 4. Conversion to iterables for the frames

cryptoe

Left some comments.
Looking forward to the UT's.

processing/src/main/java/org/apache/druid/frame/field/FieldWriters.java

processing/src/main/java/org/apache/druid/frame/read/FrameReader.java

cryptoe · 2023-04-04T12:38:26Z

processing/src/main/java/org/apache/druid/frame/write/FrameWriters.java

      default:
        throw new ISE("Unrecognized frame type [%s]", frameType);
    }
  }

+  public static FrameWriterFactory makeFrameWriterFactory(


This method seems weird. You can call the base method directly.

Added this as a separate method originally because we don't require the boolean in the rest of the cases (i.e. the original ones in MSQ). Therefore it made sense to me to hide this complexity from the callers of this method that arent residing in the broker.

cryptoe · 2023-04-04T12:44:49Z

processing/src/main/java/org/apache/druid/jackson/DruidDefaultSerializersModule.java

@@ -52,6 +54,8 @@ public DruidDefaultSerializersModule()

    JodaStuff.register(this);

+    addSerializer(FramesBackedInlineDataSource.class, new FramesBackedInlineDataSourceSerializer());


Can you let us know the reason why do you think adding the serializer here makes sense.

I would think we don't need to serialize this, as it should exist in-memory only. So I'm also wondering where this was needed.

This serialization is required when the broker inlines the subquery results and sends the inlined query to the historicals. In that case, we serialize the Frames and FrameBasedInlineDatasource to behave as if an equivalent InlineDatasource (based on rows) would have been serialized.

cryptoe · 2023-04-05T05:00:50Z

processing/src/main/java/org/apache/druid/query/scan/ScanQueryQueryToolChest.java

+            frame = Frame.wrap(frameWriter.toByteArray());
+          }
+
+          return new FrameSignaturePair(frame, result.getRowSignature());


We are making one frame per result sequence and each result sequence represents one segment. This does not seem very scalable.
Lets leave a note here so that we can get back to this in a future PR.

Since we can have different row sigs per segment, what we can do is only start a new frame when the row signature is different. This will reduce the number of frames by a lot.

processing/src/main/java/org/apache/druid/query/timeseries/TimeseriesQueryQueryToolChest.java

processing/src/main/java/org/apache/druid/query/topn/TopNQueryQueryToolChest.java

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

processing/src/main/java/org/apache/druid/query/QueryToolChest.java

refactor add unit tests for serdes, and helper classes

… semantics of creating the frames, config changes

processing/src/main/java/org/apache/druid/frame/segment/FrameCursorUtils.java

processing/src/main/java/org/apache/druid/segment/join/table/FrameBasedIndexedTable.java

processing/src/test/java/org/apache/druid/segment/join/table/FrameBasedIndexedTableTest.java

processing/src/main/java/org/apache/druid/segment/join/table/FrameBasedIndexedTable.java

LakshSingla · 2023-05-24T16:29:07Z

Thanks for the reviews!
Post the latest batch of changes, here are the major things that I have updated:

Create a FrameBasedIndexedTable to create an indexed table based on Frame datasource. This also changed the Frame type to COLUMNAR since to index the table we need to get COLUMNs corresponding to the indexed keys.
Use a memory allocator factory to allow flexibility in converting the inline results to the frames
Added configurations, need to provide the fallback code paths as well.
Refactored the code and stylistic changes

processing/src/main/java/org/apache/druid/query/QueryToolChest.java

cryptoe

Changes LGTM.
Thanks @LakshSingla !!

cryptoe · 2023-06-14T10:07:30Z

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalkerUtils.java

+{
+  public enum SubqueryResultLimit
+  {
+    ROW_LIMIT,


Please add java docs here.

cryptoe · 2023-06-14T10:08:12Z

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

+          limitAccumulator.addAndGet(frame.getFrame().numRows());
+          if (memoryLimitAccumulator.addAndGet(frame.getFrame().numBytes()) >= memoryLimit) {
+            throw ResourceLimitExceededException.withMessage(
+                "Subquery generated results beyond maximum[%d] bytes",


This method needs java docs. Line 730 is eating up exceptions to fall back. Lets document this.

whats the expected action from user?

Updating with a more appropriate error message

cryptoe · 2023-06-14T10:13:50Z

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

+      );
+    }
+    catch (Exception e) {
+      return Optional.empty();


Please add some debug line so that we know the exception here.

why is this a debug log though? It should be WARN.

This will be executed per query so DEBUG is more appropriate according to me, otherwise, the logs will be cluttered with the exception message. Either we should:

Keep it as DEBUG info so that we don't have to see cluttered logs. This has the disadvantage that we won't be able to readily observe if we fallback to the default method/code

Don't catch the exception and let it propagate. The user will then report the issue and we can fix it.

2nd option means that there won't be a fallback in case we aren't able to convert it to frames. Since this is a newer feature, I think we should still have a fallback till we are confident that we can convert each query, and once it is more mature and the frames can handle array types (currently it can handle string arrays only), we can remove this fallback altogether and let the exception pass through.

Yeah. We shouldn't be doing 2nd option. We can do something like below

log the thing as Info.

log the exception stack trace if debug is set in the query context.

abhishekagarwal87 · 2023-06-23T09:46:52Z

processing/src/main/java/org/apache/druid/frame/segment/FrameCursorUtils.java

+              }
+
+              if (!firstRowWritten) {
+                throw new ISE("Row size is greater than the frame size.");


will a user ever see this error message? Please use the DruidException class instead and add an error message thats more actionable.

Yes this can be seen at the top level. Refactored with the DruidException and a more actionable error message

abhishekagarwal87 · 2023-06-23T11:22:33Z

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

+      );
+    }
+    catch (Exception e) {
+      return Optional.empty();


why is this a debug log though? It should be WARN.

abhishekagarwal87 · 2023-06-23T11:23:10Z

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java

+          limitAccumulator.addAndGet(frame.getFrame().numRows());
+          if (memoryLimitAccumulator.addAndGet(frame.getFrame().numBytes()) >= memoryLimit) {
+            throw ResourceLimitExceededException.withMessage(
+                "Subquery generated results beyond maximum[%d] bytes",


whats the expected action from user?

cryptoe · 2023-06-26T12:41:47Z

Since this PR is liable to break due to merge conflicts, going ahead and merging this.
@LakshSingla Please address the logging feedback from @abhishekagarwal87 as part of a separate PR.

cryptoe · 2023-06-26T12:42:51Z

Thanks for the contribution @LakshSingla !!

Users can now add a guardrail to prevent subquery’s results from exceeding the set number of bytes by setting druid.server.http.maxSubqueryRows in Broker's config or maxSubqueryRows in the query context. This feature is experimental for now and would default back to row-based limiting in case it fails to get the accurate size of the results consumed by the query.

LakshSingla added 7 commits January 11, 2022 11:21

Initial commit

3e4935a

Merge branch 'master' into broker-mem

04bb5b8

initial commit, build inline data source using frames

e7f0f9f

Merge branch 'master' into broker-mem

b96a671

convert rows to frames

6b119a3

fallback to row based inline data source if cannot convert to frames

36e1f37

spurious copy paste fix

633f83a

github-advanced-security bot found potential problems Mar 20, 2023

View reviewed changes

server/src/main/java/org/apache/druid/server/InlineResultsCursor.java Fixed Show fixed Hide fixed

LakshSingla added 2 commits March 20, 2023 15:08

revert pom.xml changes

9cc62d2

cleanup, feature gate the new config

f3db103

github-advanced-security bot found potential problems Mar 20, 2023

View reviewed changes

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java Fixed Show resolved Hide resolved

cryptoe reviewed Mar 20, 2023

View reviewed changes

paul-rogers requested changes Mar 20, 2023

View reviewed changes

LakshSingla added 4 commits March 22, 2023 11:43

add frame generation logic to query and scan toolchests

038b3b5

add IterableBackedInlineDataSource

96ad351

add ScanQueryQueryToolChest

74f06cd

1. Rename InlineDataSource to IterableBackedInlineDataSource

2866de3

2. Add the resultAsFrames to other toolchests 3. Refactoring 4. Handling null types

github-advanced-security bot found potential problems Mar 30, 2023

View reviewed changes

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java Fixed Show fixed Hide fixed

server/src/main/java/org/apache/druid/server/ClientQuerySegmentWalker.java Fixed Show fixed Hide fixed

LakshSingla added 5 commits April 3, 2023 11:20

1. Custom serde for frames backed datasource

eda4c42

2. Checkstyle 3. Precheck in the toolchest 4. Conversion to iterables for the frames

remove debugging short circuit

8a07da1

add comments, remove unused classes

fd32127

better comments

ea56f4f

add test for IterableRowsCursor

15b49b0

cryptoe reviewed Apr 5, 2023

View reviewed changes

cryptoe added the Area - Querying label Apr 5, 2023

LakshSingla added 5 commits April 6, 2023 15:47

review comments

8e44ec6

refactor add unit tests for serdes, and helper classes

add joinable factory, update test cases, remove from the server config

b834f8e

add tests, address review comments

42d0698

Merge branch 'master' into broker-mem

9bc6b3a

refactor

fe5d203

LakshSingla added 2 commits May 23, 2023 11:56

review comments

cc99705

review comments - checkstyle, add tests for the indexed table, change…

7243ce3

… semantics of creating the frames, config changes

github-advanced-security bot found potential problems May 24, 2023

View reviewed changes

codeql, static checks

d432dfe

refactor, test fix

4e7a540

github-advanced-security bot found potential problems May 29, 2023

View reviewed changes

processing/src/main/java/org/apache/druid/query/QueryToolChest.java Fixed Show fixed Hide fixed

LakshSingla added 7 commits May 30, 2023 08:36

static check

27a8566

add server config, refactor the code

6a45ea9

test fix

8b3ab89

Merge branch 'master' into broker-mem

1d110b8

update server config for testing

bc346c9

build fix

235fdb3

ignore a failing test, refactors and logs

755a41d

cryptoe approved these changes Jun 19, 2023

View reviewed changes

LakshSingla added 4 commits June 19, 2023 14:28

static check

0f60d97

static check

0feebee

Merge branch 'master' into broker-mem

7f62179

build fix

97a8197

abhishekagarwal87 reviewed Jun 23, 2023

View reviewed changes

docs

92c3b52

github-actions bot added the Area - Documentation label Jun 26, 2023

LakshSingla added 2 commits June 26, 2023 11:51

spelling and review comments

4d7b3c4

limit frame size

dd3e140

cryptoe merged commit 1647d5f into apache:master Jun 26, 2023

abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023

AmatyaAvadhanula mentioned this pull request Aug 6, 2023

[DRAFT] 27.0.0 release notes #14761

Closed



		@Test
		public void testTimeseriesOnGroupByOnTableErrorTooLarge()

		@@ -52,6 +54,8 @@ public DruidDefaultSerializersModule()

		JodaStuff.register(this);

		addSerializer(FramesBackedInlineDataSource.class, new FramesBackedInlineDataSourceSerializer());

Limit the subquery results by memory usage #13952

Limit the subquery results by memory usage #13952

Conversation

LakshSingla commented Mar 20, 2023 • edited Loading

Description

Overview

Configuration

Behaviour

Proposed changes

Supporting changes

Testing

Impact on existing deployments

Follow up

Release note

cryptoe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cryptoe Mar 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cryptoe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LakshSingla commented May 24, 2023

cryptoe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cryptoe commented Jun 26, 2023

cryptoe commented Jun 26, 2023

LakshSingla commented Mar 20, 2023 •

edited

Loading

cryptoe Mar 20, 2023 •

edited

Loading