Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds #15245

Merged
merged 13 commits into from
Nov 2, 2023
Merged
14 changes: 10 additions & 4 deletions docs/multi-stage-query/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,9 @@ When deciding whether to use `REPLACE` or `INSERT`, keep in mind that segments g
with dimension-based pruning but those generated with `INSERT` cannot. For more information about the requirements
for dimension-based pruning, see [Clustering](#clustering).

To insert [ARRAY types](../querying/arrays.md), be sure to set context flag `"arrayIngestMode":"array"` which allows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this seems like the wrong place to put this. It's generic docs about INSERT, we don't want to gunk it up with stuff about specific data that might be inserted. (Otherwise this would be, like, 10 times longer.)

I suggest cutting it, and relying on the examples and the array docs to guide people.

ARRAY types to be stored in segments. This flag is not enabled by default.

For more information about the syntax, see [INSERT](./reference.md#insert).

<a name="replace"></a>
Expand Down Expand Up @@ -192,10 +195,13 @@ To perform ingestion with rollup:
2. Set [`finalizeAggregations: false`](reference.md#context-parameters) in your context. This causes aggregation
functions to write their internal state to the generated segments, instead of the finalized end result, and enables
further aggregation at query time.
3. Wrap all multi-value strings in `MV_TO_ARRAY(...)` and set [`groupByEnableMultiValueUnnesting:
false`](reference.md#context-parameters) in your context. This ensures that multi-value strings are left alone and
remain lists, instead of being [automatically unnested](../querying/sql-data-types.md#multi-value-strings) by the
`GROUP BY` operator.
3. To ingest [Druid multi-value dimensions](../querying/multi-value-dimensions.md), wrap all multi-value strings
Copy link
Contributor

@gianm gianm Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This direction has become too complicated for people to understand, so I think we'll need an example. Or to link to one.

in `MV_TO_ARRAY(...)` in the grouping clause and set [`groupByEnableMultiValueUnnesting: false`](reference.md#context-parameters) in your context.
This ensures that multi-value strings are left alone and remain lists, instead of being [automatically unnested](../querying/sql-data-types.md#multi-value-strings) by the
`GROUP BY` operator. To INSERT these arrays as multi-value strings, wrap the expressions in the SELECT clause with
`ARRAY_TO_MV` to coerce the ARRAY back to a VARCHAR
4. To ingest [ARRAY types](../querying/arrays.md), be sure to set context flag `"arrayIngestMode":"array"` which allows
ARRAY types to be stored in segments. This flag is not enabled by default.

When you do all of these things, Druid understands that you intend to do an ingestion with rollup, and it writes
rollup-related metadata into the generated segments. Other applications can then use [`segmentMetadata`
Expand Down
47 changes: 45 additions & 2 deletions docs/multi-stage-query/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ CLUSTERED BY channel

## INSERT with rollup

This example inserts data into a table named `kttm_data` and performs data rollup. This example implements the recommendations described in [Rollup](./concepts.md#rollup).
This example inserts data into a table named `kttm_rollup` and performs data rollup. The ARRAY inputs are stored in a [multi-value dimension](../querying/multi-value-dimensions.md). This example implements the recommendations described in [Rollup](./concepts.md#rollup).

<details><summary>Show the query</summary>

Expand All @@ -102,7 +102,50 @@ SELECT
agent_type,
browser,
browser_version,
MV_TO_ARRAY("language") AS "language", -- Multi-value string dimension
ARRAY_TO_MV(MV_TO_ARRAY("language")) AS "language", -- Multi-value string dimension
os,
city,
country,
forwarded_for AS ip_address,

COUNT(*) AS "cnt",
SUM(session_length) AS session_length,
APPROX_COUNT_DISTINCT_DS_HLL(event_type) AS unique_event_types
FROM kttm_data
WHERE os = 'iOS'
GROUP BY 1, 2, 3, 4, 5, 6, MV_TO_ARRAY("language"), 8, 9, 10, 11
PARTITIONED BY HOUR
CLUSTERED BY browser, session
```
</details>

## INSERT with rollup and ARRAY types

This example inserts data into a table named `kttm_rollup_arrays` and performs data rollup. The ARRAY inputs are stored in an [ARRAY column](../querying/arrays.md). This example also implements the recommendations described in [Rollup](./concepts.md#rollup). Be sure to set context flag `"arrayIngestMode":"array"` which allows
ARRAY types to be stored in segments.

<details><summary>Show the query</summary>

```sql
INSERT INTO "kttm_rollup_arrays"

WITH kttm_data AS (
SELECT * FROM TABLE(
EXTERN(
'{"type":"http","uris":["https://static.imply.io/example-data/kttm-v2/kttm-v2-2019-08-25.json.gz"]}',
'{"type":"json"}',
'[{"name":"timestamp","type":"string"},{"name":"agent_category","type":"string"},{"name":"agent_type","type":"string"},{"name":"browser","type":"string"},{"name":"browser_version","type":"string"},{"name":"city","type":"string"},{"name":"continent","type":"string"},{"name":"country","type":"string"},{"name":"version","type":"string"},{"name":"event_type","type":"string"},{"name":"event_subtype","type":"string"},{"name":"loaded_image","type":"string"},{"name":"adblock_list","type":"string"},{"name":"forwarded_for","type":"string"},{"name":"language","type":"array<string>"},{"name":"number","type":"long"},{"name":"os","type":"string"},{"name":"path","type":"string"},{"name":"platform","type":"string"},{"name":"referrer","type":"string"},{"name":"referrer_host","type":"string"},{"name":"region","type":"string"},{"name":"remote_address","type":"string"},{"name":"screen","type":"string"},{"name":"session","type":"string"},{"name":"session_length","type":"long"},{"name":"timezone","type":"string"},{"name":"timezone_offset","type":"long"},{"name":"window","type":"string"}]'
)
))

SELECT
FLOOR(TIME_PARSE("timestamp") TO MINUTE) AS __time,
session,
agent_category,
agent_type,
browser,
browser_version,
"language", -- array
os,
city,
country,
Expand Down
Loading