Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction with MSQ #14810

Closed
wants to merge 4 commits into from
Closed

Conversation

AmatyaAvadhanula
Copy link
Contributor

@AmatyaAvadhanula AmatyaAvadhanula commented Aug 14, 2023

Adds support for compaction using MSQ.

Description

This PR adds logic to convert the parallel index specs generated by compact tasks into MSQ REPLACE statements.
It then uses the newly added RouterClient to submit these MSQ controllers to /druid/v2/sql/task

Release note

Add the following context to the compact task or auto-compaction config to enable compaction using MSQ.
"context": {
"useMSQCompaction": true
}



This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@AmatyaAvadhanula AmatyaAvadhanula changed the title [WIP] Compaction with MSQ Compaction with MSQ Aug 15, 2023
@gianm
Copy link
Contributor

gianm commented Aug 16, 2023

Architecturally it seems weirdly complicated for the compact task to start a new top-level controller task. It will also probably lead to issues with locking.

One of these seems like a better approach to me. What do you think of these approaches?

  1. Have the compact task itself be an MSQ controller task (similar to how it can itself be an index_parallel controller task). It would need to generate a native query rather than a SQL query for this to work, because it doesn't have direct access to the SQL planner.
  2. Forget the compact task, focus on auto-compaction. Have the Coordinator submit SQL statements rather than compact tasks.

@gianm
Copy link
Contributor

gianm commented Aug 16, 2023

Adding a bit to the prior comment: personally I think the second approach, where we don't use compact tasks at all, is best. IMO, the ideal way to do it is something like this:

First, incorporate enough metadata into the metadata store, such that compaction doesn't need to fetch segments from deep storage to figure out what to do. There are a couple of approaches we could take here, including a catalog-based approaches (where metadata is explicitly specified) or a metadata-stashing approach where we save segment row signatures in the metadata store when segments are published. This latter idea would be useful for a bunch of other reasons.

Second, introduce a VACUUM [table] or COMPACT [table] command in SQL. It should take an optional interval and it should leverage the metadata from the first step in order to determine what to do.

Third, have auto-compaction at the Coordinator issue one of these SQL commands.

@cryptoe
Copy link
Contributor

cryptoe commented Dec 17, 2023

@AmatyaAvadhanula , I am closing this PR since the approach mentioned above would require patches unrelated to this PR.
Please feel free to reopen the PR in-case the work is revived.

@cryptoe cryptoe closed this Dec 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants