-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: query from deep storage #14609
Merged
+1,453
−61
Merged
Changes from all commits
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
6dadb9d
cold tier wip
317brian 04e784d
Merge remote-tracking branch 'upstream/master' into hybrid-query-docs
317brian 671d31c
Merge remote-tracking branch 'upstream/master' into hybrid-query-docs
317brian 8dbad1d
wip
317brian 9ae0893
copyedits
317brian 20f808b
wip
317brian 7dc14c1
copyedits
317brian 9d0d4a3
copyedits
317brian 05057df
wip
317brian 3c1d839
wip
317brian 8fb3675
update rules page
317brian 11c749a
typo
317brian 9b45bfa
typo
317brian 52e0d2f
update sidebar
317brian ac0f39e
moves durable storage info to its own page in operations
317brian ef8039c
update screenshots
317brian cf24f72
add apache license
317brian a771418
Merge branch 'master' into hybrid-query-docs
317brian 7880654
Apply suggestions from code review
317brian a1af5ca
add query from deep storage tutorial stub
317brian f2b1526
address some of the feedback
317brian ca26e60
revert screenshot update. handled in separate pr
317brian 61dc630
load rule update
317brian 7afd40f
wip tutorial
317brian 334cf62
reformat deep storage endpoints
demo-kratia dcf98ef
Merge remote-tracking branch 'origin/hybrid-query-docs' into hybrid-q…
demo-kratia 102f421
rest of tutorial
317brian d310277
typo
317brian c2c12a7
cleanup
317brian bc4d974
screenshot and sidebar for tutorial
317brian 110e840
add license
317brian 5dd46b1
typos
317brian 26a9032
Apply suggestions from code review
317brian c8ae087
rest of review comments
317brian 832a58d
clarify where results are stored
317brian 774adca
update api reference for durablestorage context param
317brian ff632a4
Apply suggestions from code review
317brian 9e3109c
comments
317brian 9840cb5
incorporate #14720
317brian a5d1e44
address rest of comments
317brian a17a19d
missed one
317brian 231171b
Update docs/api-reference/sql-api.md
cryptoe 39e3633
Update docs/api-reference/sql-api.md
cryptoe File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
--- | ||
id: durable-storage | ||
title: "Durable storage for the multi-stage query engine" | ||
sidebar_label: "Durable storage" | ||
--- | ||
|
||
<!-- | ||
~ Licensed to the Apache Software Foundation (ASF) under one | ||
~ or more contributor license agreements. See the NOTICE file | ||
~ distributed with this work for additional information | ||
~ regarding copyright ownership. The ASF licenses this file | ||
~ to you under the Apache License, Version 2.0 (the | ||
~ "License"); you may not use this file except in compliance | ||
~ with the License. You may obtain a copy of the License at | ||
~ | ||
~ http://www.apache.org/licenses/LICENSE-2.0 | ||
~ | ||
~ Unless required by applicable law or agreed to in writing, | ||
~ software distributed under the License is distributed on an | ||
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
~ KIND, either express or implied. See the License for the | ||
~ specific language governing permissions and limitations | ||
~ under the License. | ||
--> | ||
|
||
You can use durable storage to improve querying from deep storage and SQL-based ingestion. | ||
|
||
> Note that only S3 is supported as a durable storage location. | ||
Durable storage for queries from deep storage provides a location where you can write the results of deep storage queries to. Durable storage for SQL-based ingestion is used to temporarily house intermediate files, which can improve reliability. | ||
|
||
Enabling durable storage also enables the use of local disk to store temporary files, such as the intermediate files produced | ||
while sorting the data. Tasks will use whatever has been configured for their temporary usage as described in [Configuring task storage sizes](../ingestion/tasks.md#configuring-task-storage-sizes). | ||
If the configured limit is too low, Druid may throw the error, `NotEnoughTemporaryStorageFault`. | ||
|
||
## Enable durable storage | ||
|
||
To enable durable storage, you need to set the following common service properties: | ||
|
||
``` | ||
druid.msq.intermediate.storage.enable=true | ||
druid.msq.intermediate.storage.type=s3 | ||
druid.msq.intermediate.storage.bucket=YOUR_BUCKET | ||
druid.msq.intermediate.storage.prefix=YOUR_PREFIX | ||
druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir | ||
``` | ||
|
||
For detailed information about the settings related to durable storage, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). | ||
|
||
|
||
## Use durable storage for SQL-based ingestion queries | ||
|
||
When you run a query, include the context parameter `durableShuffleStorage` and set it to `true`. | ||
|
||
For queries where you want to use fault tolerance for workers, set `faultTolerance` to `true`, which automatically sets `durableShuffleStorage` to `true`. | ||
|
||
## Use durable storage for queries from deep storage | ||
317brian marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Depending on the size of the results you're expecting, saving the final results for queries from deep storage to durable storage might be needed. | ||
|
||
By default, Druid saves the final results for queries from deep storage to task reports. Generally, this is acceptable for smaller result sets but may lead to timeouts for larger result sets. | ||
|
||
When you run a query, include the context parameter `selectDestination` and set it to `DURABLESTORAGE`: | ||
|
||
```json | ||
"context":{ | ||
... | ||
"selectDestination": "DURABLESTORAGE" | ||
} | ||
``` | ||
|
||
You can also write intermediate results to durable storage (`durableShuffleStorage`) for better reliability. The location where workers write intermediate results is different than the location where final results get stored. This means that durable storage for results can be enabled even if you don't write intermediate results to durable storage. | ||
|
||
If you write the results for queries from deep storage to durable storage, the results are cleaned up when the task is removed from the metadata store. | ||
|
||
## Durable storage clean up | ||
|
||
To prevent durable storage from getting filled up with temporary files in case the tasks fail to clean them up, a periodic | ||
cleaner can be scheduled to clean the directories corresponding to which there isn't a controller task running. It utilizes | ||
317brian marked this conversation as resolved.
Show resolved
Hide resolved
|
||
the storage connector to work upon the durable storage. The durable storage location should only be utilized to store the output | ||
for the cluster's MSQ tasks. If the location contains other files or directories, then they will get cleaned up as well. | ||
|
||
Use `druid.msq.intermediate.storage.cleaner.enabled` and `druid.msq.intermediate.storage.cleaner.delaySEconds` to configure the cleaner. For more information, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). | ||
|
||
Note that if you choose to write query results to durable storage,the results are cleaned up when the task is removed from the metadata store. | ||
|
||
317brian marked this conversation as resolved.
Show resolved
Hide resolved
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cryptoe are there any new durable storage configs specifically for query from deep storage? Or are results written to the
tempDir
property?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tempDir is used more as a staging directory before pushing out bytes to s3. Its not related to results of the query.