diff --git a/docs/api-reference/sql-api.md b/docs/api-reference/sql-api.md index aaaf499851d5..bfb74a4e02c0 100644 --- a/docs/api-reference/sql-api.md +++ b/docs/api-reference/sql-api.md @@ -4,7 +4,12 @@ title: Druid SQL API sidebar_label: Druid SQL --- +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + + -Deep storage is where segments are stored. It is a storage mechanism that Apache Druid does not provide. This deep storage infrastructure defines the level of durability of your data, as long as Druid processes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented. +Deep storage is where segments are stored. It is a storage mechanism that Apache Druid does not provide. This deep storage infrastructure defines the level of durability of your data. As long as Druid processes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented. -## Local +In addition to being the backing store for segments, you can use [query from deep storage](#querying-from-deep-storage) and run queries against segments stored primarily in deep storage. The [load rules](../operations/rule-configuration.md#load-rules) you configure determine whether segments exist primarily in deep storage or in a combination of deep storage and Historical processes. + +## Deep storage options + +Druid supports multiple options for deep storage, including blob storage from major cloud providers. Select the one that fits your environment. + +### Local Local storage is intended for use in the following situations: @@ -55,22 +61,28 @@ druid.storage.storageDirectory=/tmp/druid/localStorage The `druid.storage.storageDirectory` must be set to a different path than `druid.segmentCache.locations` or `druid.segmentCache.infoDir`. -## Amazon S3 or S3-compatible +### Amazon S3 or S3-compatible See [`druid-s3-extensions`](../development/extensions-core/s3.md). -## Google Cloud Storage +### Google Cloud Storage See [`druid-google-extensions`](../development/extensions-core/google.md). -## Azure Blob Storage +### Azure Blob Storage See [`druid-azure-extensions`](../development/extensions-core/azure.md). -## HDFS +### HDFS See [druid-hdfs-storage extension documentation](../development/extensions-core/hdfs.md). -## Additional options +### Additional options For additional deep storage options, please see our [extensions list](../configuration/extensions.md). + +## Querying from deep storage + +Although not as performant as querying segments stored on disk for Historical processes, you can query from deep storage to access segments that you may not need frequently or with the extreme low latency Druid queries traditionally provide. You trade some performance for a total lower storage cost because you can access more of your data without the need to increase the number or capacity of your Historical processes. + +For information about how to run queries, see [Query from deep storage](../querying/query-from-deep-storage.md). \ No newline at end of file diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index 8f9e53b557bd..0cab3a2fd836 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -342,59 +342,24 @@ CLUSTERED BY user The context parameter that sets `sqlJoinAlgorithm` to `sortMerge` is not shown in the above example. -## Durable Storage +## Durable storage -Using durable storage with your SQL-based ingestion can improve their reliability by writing intermediate files to a storage location temporarily. +SQL-based ingestion supports using durable storage to store intermediate files temporarily. Enabling it can improve reliability. For more information, see [Durable storage](../operations/durable-storage.md). -To prevent durable storage from getting filled up with temporary files in case the tasks fail to clean them up, a periodic -cleaner can be scheduled to clean the directories corresponding to which there isn't a controller task running. It utilizes -the storage connector to work upon the durable storage. The durable storage location should only be utilized to store the output -for cluster's MSQ tasks. If the location contains other files or directories, then they will get cleaned up as well. - -Enabling durable storage also enables the use of local disk to store temporary files, such as the intermediate files produced -by the super sorter. Tasks will use whatever has been configured for their temporary usage as described in [Configuring task storage sizes](../ingestion/tasks.md#configuring-task-storage-sizes) -If the configured limit is too low, `NotEnoughTemporaryStorageFault` may be thrown. - -### Enable durable storage - -To enable durable storage, you need to set the following common service properties: - -``` -druid.msq.intermediate.storage.enable=true -druid.msq.intermediate.storage.type=s3 -druid.msq.intermediate.storage.bucket=YOUR_BUCKET -druid.msq.intermediate.storage.prefix=YOUR_PREFIX -druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir -``` - -For detailed information about the settings related to durable storage, see [Durable storage configurations](#durable-storage-configurations). - - -### Use durable storage for queries - -When you run a query, include the context parameter `durableShuffleStorage` and set it to `true`. - -For queries where you want to use fault tolerance for workers, set `faultTolerance` to `true`, which automatically sets `durableShuffleStorage` to `true`. - -Set `selectDestination`:`durableStorage` for select queries that want to write the final results to durable storage instead of the task reports. Saving the results in the durable -storage allows users to fetch large result sets. The location where the workers write the intermediate results is different than the location where final results get stored. Therefore, `durableShuffleStorage`:`false` and -`selectDestination`:`durableStorage` is a valid configuration to use in the query context, that instructs the controller to persist only the final result in the durable storage, and not the -intermediate results. - - -## Durable storage configurations +### Durable storage configurations The following common service properties control how durable storage behaves: |Parameter |Default | Description | |-------------------|----------------------------------------|----------------------| -|`druid.msq.intermediate.storage.bucket` | n/a | The bucket in S3 where you want to store intermediate files. | -|`druid.msq.intermediate.storage.chunkSize` | 100MiB | Optional. Defines the size of each chunk to temporarily store in `druid.msq.intermediate.storage.tempDir`. The chunk size must be between 5 MiB and 5 GiB. A large chunk size reduces the API calls made to the durable storage, however it requires more disk space to store the temporary chunks. Druid uses a default of 100MiB if the value is not provided.| -|`druid.msq.intermediate.storage.enable` | true | Required. Whether to enable durable storage for the cluster.| -|`druid.msq.intermediate.storage.maxRetry` | 10 | Optional. Defines the max number times to attempt S3 API calls to avoid failures due to transient errors. | -|`druid.msq.intermediate.storage.prefix` | n/a | S3 prefix to store intermediate stage results. Provide a unique value for the prefix. Don't share the same prefix between clusters. If the location includes other files or directories, then they will get cleaned up as well. | +|`druid.msq.intermediate.storage.enable` | true | Required. Whether to enable durable storage for the cluster. For more information about enabling durable storage, see [Durable storage](../operations/durable-storage.md).| +|`druid.msq.intermediate.storage.type` | `s3` for Amazon S3 | Required. The type of storage to use. `s3` is the only supported storage type. | +|`druid.msq.intermediate.storage.bucket` | n/a | The S3 bucket to store intermediate files. | +|`druid.msq.intermediate.storage.prefix` | n/a | S3 prefix to store intermediate stage results. Provide a unique value for the prefix. Don't share the same prefix between clusters. If the location includes other files or directories, then they will get cleaned up as well. | |`druid.msq.intermediate.storage.tempDir`| n/a | Required. Directory path on the local disk to temporarily store intermediate stage results. | -|`druid.msq.intermediate.storage.type` | `s3` if your deep storage is S3 | Required. The type of storage to use. You can either set this to `local` or `s3`. | +|`druid.msq.intermediate.storage.maxRetry` | 10 | Optional. Defines the max number times to attempt S3 API calls to avoid failures due to transient errors. | +|`druid.msq.intermediate.storage.chunkSize` | 100MiB | Optional. Defines the size of each chunk to temporarily store in `druid.msq.intermediate.storage.tempDir`. The chunk size must be between 5 MiB and 5 GiB. A large chunk size reduces the API calls made to the durable storage, however it requires more disk space to store the temporary chunks. Druid uses a default of 100MiB if the value is not provided.| + In addition to the common service properties, there are certain properties that you configure on the Overlord specifically to clean up intermediate files: diff --git a/docs/operations/durable-storage.md b/docs/operations/durable-storage.md new file mode 100644 index 000000000000..80545f9a9b28 --- /dev/null +++ b/docs/operations/durable-storage.md @@ -0,0 +1,86 @@ +--- +id: durable-storage +title: "Durable storage for the multi-stage query engine" +sidebar_label: "Durable storage" +--- + + + +You can use durable storage to improve querying from deep storage and SQL-based ingestion. + +> Note that only S3 is supported as a durable storage location. + +Durable storage for queries from deep storage provides a location where you can write the results of deep storage queries to. Durable storage for SQL-based ingestion is used to temporarily house intermediate files, which can improve reliability. + +Enabling durable storage also enables the use of local disk to store temporary files, such as the intermediate files produced +while sorting the data. Tasks will use whatever has been configured for their temporary usage as described in [Configuring task storage sizes](../ingestion/tasks.md#configuring-task-storage-sizes). +If the configured limit is too low, Druid may throw the error, `NotEnoughTemporaryStorageFault`. + +## Enable durable storage + +To enable durable storage, you need to set the following common service properties: + +``` +druid.msq.intermediate.storage.enable=true +druid.msq.intermediate.storage.type=s3 +druid.msq.intermediate.storage.bucket=YOUR_BUCKET +druid.msq.intermediate.storage.prefix=YOUR_PREFIX +druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir +``` + +For detailed information about the settings related to durable storage, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). + + +## Use durable storage for SQL-based ingestion queries + +When you run a query, include the context parameter `durableShuffleStorage` and set it to `true`. + +For queries where you want to use fault tolerance for workers, set `faultTolerance` to `true`, which automatically sets `durableShuffleStorage` to `true`. + +## Use durable storage for queries from deep storage + +Depending on the size of the results you're expecting, saving the final results for queries from deep storage to durable storage might be needed. + +By default, Druid saves the final results for queries from deep storage to task reports. Generally, this is acceptable for smaller result sets but may lead to timeouts for larger result sets. + +When you run a query, include the context parameter `selectDestination` and set it to `DURABLESTORAGE`: + +```json + "context":{ + ... + "selectDestination": "DURABLESTORAGE" + } +``` + +You can also write intermediate results to durable storage (`durableShuffleStorage`) for better reliability. The location where workers write intermediate results is different than the location where final results get stored. This means that durable storage for results can be enabled even if you don't write intermediate results to durable storage. + +If you write the results for queries from deep storage to durable storage, the results are cleaned up when the task is removed from the metadata store. + +## Durable storage clean up + +To prevent durable storage from getting filled up with temporary files in case the tasks fail to clean them up, a periodic +cleaner can be scheduled to clean the directories corresponding to which there isn't a controller task running. It utilizes +the storage connector to work upon the durable storage. The durable storage location should only be utilized to store the output +for the cluster's MSQ tasks. If the location contains other files or directories, then they will get cleaned up as well. + +Use `druid.msq.intermediate.storage.cleaner.enabled` and `druid.msq.intermediate.storage.cleaner.delaySEconds` to configure the cleaner. For more information, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). + +Note that if you choose to write query results to durable storage,the results are cleaned up when the task is removed from the metadata store. + diff --git a/docs/operations/rule-configuration.md b/docs/operations/rule-configuration.md index 8d12beac96ca..aa42ef461b28 100644 --- a/docs/operations/rule-configuration.md +++ b/docs/operations/rule-configuration.md @@ -107,7 +107,7 @@ In the web console you can use the up and down arrows on the right side of the i ## Load rules -Load rules define how Druid assigns segments to [historical process tiers](./mixed-workloads.md#historical-tiering), and how many replicas of a segment exist in each tier. +Load rules define how Druid assigns segments to [Historical process tiers](./mixed-workloads.md#historical-tiering), and how many replicas of a segment exist in each tier. If you have a single tier, Druid automatically names the tier `_default`. If you define an additional tier, you must define a load rule to specify which segments to load on that tier. Until you define a load rule, your new tier remains empty. @@ -120,6 +120,8 @@ All load rules can have these properties: Specific types of load rules discussed below may have other properties too. +Load rules are also how you take advantage of the resource savings that [query the data from deep storage](../querying/query-from-deep-storage.md) provides. One way to configure data so that certain segments are not loaded onto Historical tiers but are available to query from deep storage is to set `tieredReplicants` to an empty array and `useDefaultTierForNull` to `false` for those segments, either by interval or by period. + ### Forever load rule The forever load rule assigns all datasource segments to specified tiers. It is the default rule Druid applies to datasources. Forever load rules have type `loadForever`. @@ -167,7 +169,7 @@ Set the following properties: - the segment interval starts any time after the rule interval starts. You can use this property to load segments with future start and end dates, where "future" is relative to the time when the Coordinator evaluates data against the rule. Defaults to `true`. -- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier. +- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier. - `useDefaultTierForNull`: This parameter determines the default value of `tieredReplicants` and only has an effect if the field is not present. The default value of `useDefaultTierForNull` is true. ### Interval load rule @@ -190,7 +192,7 @@ Interval load rules have type `loadByInterval`. The following example places one Set the following properties: - `interval`: the load interval specified as an [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) range encoded as a string. -- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier. +- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier. - `useDefaultTierForNull`: This parameter determines the default value of `tieredReplicants` and only has an effect if the field is not present. The default value of `useDefaultTierForNull` is true. ## Drop rules @@ -256,7 +258,7 @@ Set the following property: ### Interval drop rule -You can use a drop interval rule to prevent Druid from loading a specified range of data onto any tier. The range is typically your oldest data. The dropped data resides in cold storage, but is not queryable. If you need to query the data, update or remove the interval drop rule so that Druid reloads the data. +You can use a drop interval rule to prevent Druid from loading a specified range of data onto any tier. The range is typically your oldest data. The dropped data resides in deep storage and can still be [queried from deep storage](../querying/query-from-deep-storage.md). Interval drop rules have type `dropByInterval` and the following JSON structure: diff --git a/docs/querying/query-from-deep-storage.md b/docs/querying/query-from-deep-storage.md new file mode 100644 index 000000000000..5f076ca47cdf --- /dev/null +++ b/docs/querying/query-from-deep-storage.md @@ -0,0 +1,195 @@ +--- +id: query-deep-storage +title: "Query from deep storage" +--- + + + +> Query from deep storage is an [experimental feature](../development/experimental.md). + +Druid can query segments that are only stored in deep storage. Running a query from deep storage is slower than running queries from segments that are loaded on Historical processes, but it's a great tool for data that you either access infrequently or where the low latency results that typical Druid queries provide is not necessary. Queries from deep storage can increase the surface area of data available to query without requiring you to scale your Historical processes to accommodate more segments. + +## Keep segments in deep storage only + +Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. However, to take advantage of the cost savings that querying from deep storage provides, make sure not all your segments get loaded onto Historical processes. + +To do this, configure [load rules](../operations/rule-configuration.md#load-rules) to manage the which segments are only in deep storage and which get loaded onto Historical processes. + +The easiest way to do this is to explicitly configure the segments that don't get loaded onto Historical processes. Set `tieredReplicants` to an empty array and `useDefaultTierForNull` to `false`. For example, if you configure the following rule for a datasource: + +```json +[ + { + "interval": "2016-06-27T00:00:00.000Z/2016-06-27T02:59:00.000Z", + "tieredReplicants": {}, + "useDefaultTierForNull": false, + "type": "loadByInterval" + } +] +``` + +Any segment that falls within the specified interval exists only in deep storage. For segments that aren't in this interval, they'll use the default cluster load rules or any other load rules you configure. + +To configure the load rules through the Druid console, go to **Datasources > ... in the Actions column > Edit retention rules**. Then, paste the provided JSON into the JSON tab: + +![](../assets/tutorial-query-deepstorage-retention-rule.png) + + +You can verify that a segment is not loaded on any Historical tiers by querying the Druid metadata table: + +```sql +SELECT "segment_id", "replication_factor" FROM sys."segments" WHERE "replication_factor" = 0 AND "datasource" = YOUR_DATASOURCE +``` + +Segments with a `replication_factor` of `0` are not assigned to any Historical tiers. Queries against these segments are run directly against the segment in deep storage. + +You can also confirm this through the Druid console. On the **Segments** page, see the **Replication factor** column. + +Keep the following in mind when working with load rules to control what exists only in deep storage: + +- At least one of the segments in a datasource must be loaded onto a Historical process so that Druid can plan the query. The segment on the Historical process can be any segment from the datasource. It does not need to be a specific segment. One way to verify that a datasource has at least one segment on a Historical process is if it's visible in the Druid console. +- The actual number of replicas may differ from the replication factor temporarily as Druid processes your load rules. + +## Run a query from deep storage + +### Submit a query + +You can query data from deep storage by submitting a query to the API using `POST /sql/statements` or the Druid console. Druid uses the multi-stage query (MSQ) task engine to perform the query. + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the query is contained in the "query" field in the JSON object within the request payload. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the request body fields are the same between the `sql` and `sql/statements` endpoints. + +There are additional context parameters for `sql/statements` specifically: + + - `executionMode` (required) determines how query results are fetched. Set this to `ASYNC`. + - `selectDestination` (optional) set to `durableStorage` instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have [durable storage for MSQ enabled](../operations/durable-storage.md). + +The following sample query includes the two additional context parameters that querying from deep storage supports: + +``` +curl --location 'http://localhost:8888/druid/v2/sql/statements' \ +--header 'Content-Type: application/json' \ +--data '{ + "query":"SELECT * FROM \"YOUR_DATASOURCE\" where \"__time\" >TIMESTAMP'\''2017-09-01'\'' and \"__time\" <= TIMESTAMP'\''2017-09-02'\''", + "context":{ + "executionMode":"ASYNC", + "selectDestination": "durableStorage" + + } +}' +``` + +The response for submitting a query includes the query ID along with basic information, such as when you submitted the query and the schema of the results: + +```json +{ + "queryId": "query-ALPHANUMBERIC-STRING", + "state": "ACCEPTED", + "createdAt": CREATION_TIMESTAMP, +"schema": [ + { + "name": COLUMN_NAME, + "type": COLUMN_TYPE, + "nativeType": COLUMN_TYPE + }, + ... +], +"durationMs": DURATION_IN_MS, +} +``` + + +### Get query status + +You can check the status of a query with the following API call: + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/QUERYID +``` + +The query returns the status of the query, such as `ACCEPTED` or `RUNNING`. Before you attempt to get results, make sure the state is `SUCCESS`. + +When you check the status on a successful query, it includes useful information about your query results including a sample record and information about how the results are organized by `pages`. The information for each page includes the following: + +- `numRows`: the number of rows in that page of results +- `sizeInBytes`: the size of the page +- `id`: the indexed page number that you can use to reference a specific page when you get query results + +You can use `page` as a parameter to refine the results you retrieve. + +The following snippet shows the structure of the `result` object: + +```json +{ + ... + "result": { + "numTotalRows": INTEGER, + "totalSizeInBytes": INTEGER, + "dataSource": "__query_select", + "sampleRecords": [ + [ + RECORD_1, + RECORD_2, + ... + ] + ], + "pages": [ + { + "numRows": INTEGER, + "sizeInBytes": INTEGER, + "id": INTEGER_PAGE_NUMBER + } + ... + ] +} +} +``` + +### Get query results + +Only the user who submitted a query can retrieve the results for the query. + +Use the following endpoint to retrieve results: + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/QUERYID/results?page=PAGENUMBER&size=RESULT_SIZE&timeout=TIMEOUT_MS +``` + +Results are returned in JSON format. + +You can use the optional `page`, `size`, and `timeout` parameters to refine your results. You can retrieve the `page` information for your results by fetching the status of the completed query. + +When you try to get results for a query from deep storage, you may receive an error that states the query is still running. Wait until the query completes before you try again. + +## Further reading + +* [Query from deep storage tutorial](../tutorials/tutorial-query-deep-storage.md) +* [Query from deep storage API reference](../api-reference/sql-api.md#query-from-deep-storage) diff --git a/docs/tutorials/tutorial-query-deep-storage.md b/docs/tutorials/tutorial-query-deep-storage.md new file mode 100644 index 000000000000..5502ad94228d --- /dev/null +++ b/docs/tutorials/tutorial-query-deep-storage.md @@ -0,0 +1,293 @@ +--- +id: tutorial-query-deep-storage +title: "Tutorial: Query from deep storage" +sidebar_label: "Query from deep storage" +--- + + + + +> Query from deep storage is an [experimental feature](../development/experimental.md). + +Query from deep storage allows you to query segments that are stored only in deep storage, which provides lower costs than if you were to load everything onto Historical processes. The tradeoff is that queries from deep storage may take longer to complete. + +This tutorial walks you through loading example data, configuring load rules so that not all the segments get loaded onto Historical processes, and querying data from deep storage. + +To run the queries in this tutorial, replace `ROUTER:PORT` with the location of the Router process and its port number. For example, use `localhost:8888` for the quickstart deployment. + +For more general information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +## Load example data + +Use the **Load data** wizard or the following SQL query to ingest the `wikipedia` sample datasource bundled with Druid. If you use the wizard, make sure you change the partitioning to be by hour. + +Partitioning by hour provides more segment granularity, so you can selectively load segments onto Historicals or keep them in deep storage. + +
Show the query + +```sql +REPLACE INTO "wikipedia" OVERWRITE ALL +WITH "ext" AS (SELECT * +FROM TABLE( + EXTERN( + '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}', + '{"type":"json"}' + ) +) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR)) +SELECT + TIME_PARSE("timestamp") AS "__time", + "isRobot", + "channel", + "flags", + "isUnpatrolled", + "page", + "diffUrl", + "added", + "comment", + "commentLength", + "isNew", + "isMinor", + "delta", + "isAnonymous", + "user", + "deltaBucket", + "deleted", + "namespace", + "cityName", + "countryName", + "regionIsoCode", + "metroCode", + "countryIsoCode", + "regionName" +FROM "ext" +PARTITIONED BY HOUR +``` + +
+ +## Configure a load rule + +The load rule configures Druid to keep any segments that fall within the following interval only in deep storage: + +``` +2016-06-27T00:00:00.000Z/2016-06-27T02:59:00.000Z +``` + +The JSON form of the rule is as follows: + +```json +[ + { + "interval": "2016-06-27T00:00:00.000Z/2016-06-27T02:59:00.000Z", + "tieredReplicants": {}, + "useDefaultTierForNull": false, + "type": "loadByInterval" + } +] +``` + +The rest of the segments use the default load rules for the cluster. For the quickstart, that means all the other segments get loaded onto Historical processes. + +You can configure the load rules through the API or the Druid console. To configure the load rules through the Druid console, go to **Datasources > ... in the Actions column > Edit retention rules**. Then, paste the provided JSON into the JSON tab: + +![](../assets/tutorial-query-deepstorage-retention-rule.png) + + +### Verify the replication factor + +Segments that are only available from deep storage have a `replication_factor` of 0 in the Druid system table. You can verify that your load rule worked as intended using the following query: + +```sql +SELECT "segment_id", "replication_factor", "num_replicas" FROM sys."segments" WHERE datasource = 'wikipedia' +``` + +You can also verify it through the Druid console by checking the **Replication factor** column in the **Segments** view. + +Note that the number of replicas and replication factor may differ temporarily as Druid processes your retention rules. + +## Query from deep storage + +Now that there are segments that are only available from deep storage, run the following query: + +```sql +SELECT page FROM wikipedia WHERE __time < TIMESTAMP'2016-06-27 00:10:00' LIMIT 10 +``` + +With the context parameter: + +```json +"executionMode": "ASYNC" +``` + +For example, run the following curl command: + +``` +curl --location 'http://localhost:8888/druid/v2/sql/statements' \ +--header 'Content-Type: application/json' \ +--data '{ + "query":"SELECT page FROM wikipedia WHERE __time < TIMESTAMP'\''2016-06-27 00:10:00'\'' LIMIT 10", + "context":{ + "executionMode":"ASYNC" + } +}' +``` + +This query looks for records with timestamps that precede `00:10:00`. Based on the load rule you configured earlier, this data is only available from deep storage. + +When you submit the query from deep storage through the API, you get the following response: + +
Show the response + +```json +{ + "queryId": "query-6888b6f6-e597-456c-9004-222b05b97051", + "state": "ACCEPTED", + "createdAt": "2023-07-28T21:59:02.334Z", + "schema": [ + { + "name": "page", + "type": "VARCHAR", + "nativeType": "STRING" + } + ], + "durationMs": -1 +} +``` + +Make sure you note the `queryID`. You'll need it to interact with the query. + +
+ +Compare this to if you were to submit the query to Druid SQL's regular endpoint, `POST /sql`: + +``` +curl --location 'http://localhost:8888/druid/v2/sql/' \ +--header 'Content-Type: application/json' \ +--data '{ + "query":"SELECT page FROM wikipedia WHERE __time < TIMESTAMP'\''2016-06-27 00:10:00'\'' LIMIT 10", + "context":{ + "executionMode":"ASYNC" + } +}' +``` + +The response you get back is an empty response cause there are no records on the Historicals that match the query. + +## Get query status + +Replace `:queryId` with the ID for your query and run the following curl command to get your query status: + +``` +curl --location --request GET 'http://localhost:8888/druid/v2/sql/statements/:queryId' \ +--header 'Content-Type: application/json' \ +``` + + +### Response for a running query + +The response for a running query is the same as the response from when you submitted the query except the `state` is `RUNNING` instead of `ACCEPTED`. + +### Response for a completed query + +A successful query also returns a `pages` object that includes the page numbers (`id`), rows per page (`numRows`), and the size of the page (`sizeInBytes`). You can pass the page number as a parameter when you get results to refine the results you get. + +Note that `sampleRecords` has been truncated for brevity. + +
Show the response + +```json +{ + "queryId": "query-6888b6f6-e597-456c-9004-222b05b97051", + "state": "SUCCESS", + "createdAt": "2023-07-28T21:59:02.334Z", + "schema": [ + { + "name": "page", + "type": "VARCHAR", + "nativeType": "STRING" + } + ], + "durationMs": 87351, + "result": { + "numTotalRows": 152, + "totalSizeInBytes": 9036, + "dataSource": "__query_select", + "sampleRecords": [ + [ + "Salo Toraut" + ], + [ + "利用者:ワーナー成増/放送ウーマン賞" + ], + [ + "Bailando 2015" + ], + ... + ... + ... + ], + "pages": [ + { + "id": 0, + "numRows": 152, + "sizeInBytes": 9036 + } + ] + } +} +``` + +
+ +## Get query results + +Replace `:queryId` with the ID for your query and run the following curl command to get your query results: + +``` +curl --location 'http://ROUTER:PORT/druid/v2/sql/statements/:queryId' +``` + +Note that the response has been truncated for brevity. + +
Show the response + +```json +[ + { + "page": "Salo Toraut" + }, + { + "page": "利用者:ワーナー成増/放送ウーマン賞" + }, + { + "page": "Bailando 2015" + }, + ... + ... + ... +] +``` + +
+ +## Further reading + +* [Query from deep storage](../querying/query-from-deep-storage.md) +* [Query from deep storage API reference](../api-reference/sql-api.md#query-from-deep-storage) \ No newline at end of file diff --git a/website/.spelling b/website/.spelling index 5b12d9b3eaf4..3ea9178552a1 100644 --- a/website/.spelling +++ b/website/.spelling @@ -504,6 +504,7 @@ supervisorTaskId SVG symlink syntaxes +TabItem tiering timeseries Timeseries diff --git a/website/sidebars.json b/website/sidebars.json index 458d2bfe033d..0dbb95a44b3f 100644 --- a/website/sidebars.json +++ b/website/sidebars.json @@ -24,6 +24,7 @@ "tutorials/tutorial-kerberos-hadoop", "tutorials/tutorial-sql-query-view", "tutorials/tutorial-unnest-arrays", + "tutorials/tutorial-query-deep-storage", "tutorials/tutorial-jupyter-index", "tutorials/tutorial-jupyter-docker", "tutorials/tutorial-jdbc" @@ -98,6 +99,7 @@ "label": "Druid SQL", "ids": [ "querying/sql", + "querying/query-deep-storage", "querying/sql-data-types", "querying/sql-operators", "querying/sql-scalar", @@ -200,6 +202,7 @@ "Operations": [ "operations/web-console", "operations/java", + "operations/durable-storage", { "type": "subcategory", "label": "Security",