MSQ controller: Support in-memory shuffles; towards JVM reuse. #16168

gianm · 2024-03-19T09:49:26Z

This patch contains two controller changes that make progress towards a lower-latency MSQ.

These are both larger changes, but I developed them together and it was most straightforward to put them into a single PR rather than separating into multiple PRs, especially given they both involved changes in common files like ControllerImpl and ControllerContext.

Key classes:

These files are the most interesting IMO.

Controller
ControllerImpl
ControllerQueryKernel
ControllerQueryKernelConfig
ControllerQueryKernelUtils (especially computeStageGroups)
ControllerContext
MSQWorkerTaskLauncher

Key changes:

First, support for in-memory shuffles. The main feature of in-memory shuffles, as far as the controller is concerned, is that they are not fully buffered. That means that whenever a producer stage uses in-memory output, its consumer must run concurrently. The controller determines which stages run concurrently, and when they start and stop.

"Leapfrogging" allows any chain of sort-based stages to use in-memory shuffles even if we can only run two stages at once. For example, in a linear chain of stages 0 -> 1 -> 2 where all do sort-based shuffles, we can use in-memory shuffling for each one while only running two at once. (When stage 1 is done reading input and about to start writing its output, we can stop 0 and start 2.)

New OutputChannelMode enum attached to WorkOrder that tells workers
whether stage output should be in memory (MEMORY), or use local or durable
storage.
New logic in the ControllerQueryKernel to determine which stages can use
in-memory shuffling (ControllerUtils#computeStageGroups) and to launch them
at the appropriate time (ControllerQueryKernel#createNewKernels).
New doneReadingInput method on Controller (passed down to the stage kernels)
which allows stages to transition to POST_READING even if they are not
gathering statistics. This is important because it enables "leapfrogging"
for HASH_LOCAL_SORT shuffles, and for GLOBAL_SORT shuffles with 1 partition.
Moved result-reading from ControllerContext#writeReports to new QueryListener
interface, which ControllerImpl feeds results to row-by-row while the query
is still running. Important so we can read query results from the final
stage using an in-memory channel.
New class ControllerQueryKernelConfig holds configs that control kernel
behavior (such as whether to pipeline, maximum number of concurrent stages,
etc). Generated by the ControllerContext.

Second, a refactor towards running workers in persistent JVMs that are able to cache data across queries. This is helpful because I believe we'll want to reuse JVMs and cached data for latency reasons.

Move creation of WorkerManager and TableInputSpecSlicer to the
ControllerContext, rather than ControllerImpl. This allows managing workers and
work assignment differently when JVMs are reusable.
Lift the Controller Jersey resource out from ControllerChatHandler to a
reusable resource ControllerResource.
Move memory introspection to a MemoryIntrospector interface, and introduce
ControllerMemoryParameters that uses it. This makes it easier to run MSQ in
process types other than Indexer and Peon.

Both of these areas will have follow-ups that make similar changes on the worker side.

This patch contains two controller changes that make progress towards a lower-latency MSQ. First, support for in-memory shuffles. The main feature of in-memory shuffles, as far as the controller is concerned, is that they are not fully buffered. That means that whenever a producer stage uses in-memory output, its consumer must run concurrently. The controller determines which stages run concurrently, and when they start and stop. "Leapfrogging" allows any chain of sort-based stages to use in-memory shuffles even if we can only run two stages at once. For example, in a linear chain of stages 0 -> 1 -> 2 where all do sort-based shuffles, we can use in-memory shuffling for each one while only running two at once. (When stage 1 is done reading input and about to start writing its output, we can stop 0 and start 2.) 1) New OutputChannelMode enum attached to WorkOrders that tells workers whether stage output should be in memory (MEMORY), or use local or durable storage. 2) New logic in the ControllerQueryKernel to determine which stages can use in-memory shuffling (ControllerUtils#computeStageGroups) and to launch them at the appropriate time (ControllerQueryKernel#createNewKernels). 3) New "doneReadingInput" method on Controller (passed down to the stage kernels) which allows stages to transition to POST_READING even if they are not gathering statistics. This is important because it enables "leapfrogging" for HASH_LOCAL_SORT shuffles, and for GLOBAL_SORT shuffles with 1 partition. 4) Moved result-reading from ControllerContext#writeReports to new QueryListener interface, which ControllerImpl feeds results to row-by-row while the query is still running. Important so we can read query results from the final stage using an in-memory channel. 5) New class ControllerQueryKernelConfig holds configs that control kernel behavior (such as whether to pipeline, maximum number of concurrent stages, etc). Generated by the ControllerContext. Second, a refactor towards running workers in persistent JVMs that are able to cache data across queries. This is helpful because I believe we'll want to reuse JVMs and cached data for latency reasons. 1) Move creation of WorkerManager and TableInputSpecSlicer to the ControllerContext, rather than ControllerImpl. This allows managing workers and work assignment differently when JVMs are reusable. 2) Lift the Controller Jersey resource out from ControllerChatHandler to a reusable resource. 3) Move memory introspection to a MemoryIntrospector interface, and introduce ControllerMemoryParameters that uses it. This makes it easier to run MSQ in process types other than Indexer and Peon. Both of these areas will have follow-ups that make similar changes on the worker side.

...nsions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerContext.java

gianm · 2024-04-09T18:22:03Z

...nsions-core/multi-stage-query/src/main/java/org/apache/druid/msq/rpc/ControllerResource.java

+  @Consumes(MediaType.APPLICATION_JSON)
+  public Response httpPostWorkerError(
+      final MSQErrorReport errorReport,
+      @PathParam("taskId") final String taskId,


This code was just moved; the parameter is unused in master as well

gianm · 2024-04-09T18:22:09Z

...nsions-core/multi-stage-query/src/main/java/org/apache/druid/msq/rpc/ControllerResource.java

+  @Consumes(MediaType.APPLICATION_JSON)
+  public Response httpPostPartialKeyStatistics(
+      final Object partialKeyStatisticsObject,
+      @PathParam("queryId") final String queryId,


This code was just moved; the parameter is unused in master as well

...e-query/src/test/java/org/apache/druid/msq/kernel/controller/MockQueryDefinitionBuilder.java

gianm · 2024-04-09T18:23:06Z

...e/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/report/MSQResultsReport.java

@@ -132,9 +72,9 @@
  }

  @JsonProperty("results")
-  public Yielder<Object[]> getResultYielder()
+  public List<Object[]> getResults()


We trust the caller to not modify the arrays here.

extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerImpl.java

…roller-refactor

LakshSingla

Thanks for the PR! I am leaving a partial review, primarily centered around the computeStageGroups method. My understanding is incomplete, and there seem to be some assumptions baked into the code, which if clarified, would make the code easier to reason about for someone unfamiliar with the logic.

LakshSingla · 2024-04-19T19:37:34Z

...-core/multi-stage-query/src/main/java/org/apache/druid/msq/kernel/controller/StageGroup.java

+ */
+public class StageGroup
+{
+  private final List<StageId> stageIds;


It seems that there's an implicit contract that the flow of the data between the stage group is linear (A -> B -> C ...) , and cannot be branched (
A
|
B
|
C D
| /
E
)
We should probably mention that. Unrelated, but is it subject to change in the future?

Yes, it is required to be linear at this point (mostly for simplicity). It could be changed in the future. I have added a comment.

LakshSingla · 2024-04-19T19:38:19Z

...-core/multi-stage-query/src/main/java/org/apache/druid/msq/kernel/controller/StageGroup.java

+  }
+
+  /**
+   * List of stage IDs in this group.


As mentioned previously, there's an implicit relation between the stages in the list, and [A, B, C] is not the same as [A, C, B]

That's true. I have added a comment.

LakshSingla · 2024-04-19T19:43:10Z