Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document expression aggregator #14497

Merged
merged 12 commits into from
Aug 8, 2023
148 changes: 115 additions & 33 deletions docs/querying/aggregations.md
Original file line number Diff line number Diff line change
Expand Up @@ -310,39 +310,6 @@ Returns any value including null. This aggregator can simplify and optimize the
}
```

### JavaScript aggregator

Computes an arbitrary JavaScript function over a set of columns (both metrics and dimensions are allowed). Your
JavaScript functions are expected to return floating-point values.

```json
{ "type": "javascript",
"name": "<output_name>",
"fieldNames" : [ <column1>, <column2>, ... ],
"fnAggregate" : "function(current, column1, column2, ...) {
<updates partial aggregate (current) based on the current row values>
return <updated partial aggregate>
}",
"fnCombine" : "function(partialA, partialB) { return <combined partial results>; }",
"fnReset" : "function() { return <initial value>; }"
}
```

**Example**

```json
{
"type": "javascript",
"name": "sum(log(x)*y) + 10",
"fieldNames": ["x", "y"],
"fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }",
"fnCombine" : "function(partialA, partialB) { return partialA + partialB; }",
"fnReset" : "function() { return 10; }"
}
```

> JavaScript-based functionality is disabled by default. Please refer to the Druid [JavaScript programming guide](../development/javascript.md) for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.

<a name="approx"></a>

## Approximate aggregations
Expand Down Expand Up @@ -422,6 +389,121 @@ It is not possible to determine a priori how well this aggregator will behave fo

For these reasons, we have deprecated this aggregator and recommend using the DataSketches Quantiles aggregator instead for new and existing use cases, although we will continue to support Approximate Histogram for backwards compatibility.


## Expression aggregators
Copy link
Contributor

@ektravel ektravel Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to list both the expression aggregator and the JavaScript aggregator under Expression aggregators? Can we remove line 393 and change line 395 to H2 (## Expression aggregator)? If so, you can add an H3 section called Examples and list all of the expression aggregator examples there.
For example:

## Expression aggregator
### Examples
## JavaScript aggregator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering is the JavaScript agg worth it's own section? It's not enabled by default due to security issues

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about "Expression aggregations" for H2 and then "Expression aggregator" and "JavaScript aggregator"? So something like this:

H2 Expression aggregations
H3 Expression aggregator
H4 Examples
H3 JavaScript aggregator
H4 Example


### Expression aggregator
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i considered both this aggregator and the javascript aggregator as free form "expression" aggregators since you can just write whatever functions you want, which is why they are both under the category. can you think of a better category name than "Expression aggregators"? The native expression aggregator and javascript aggregator are totally separate, just similar in spirit...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about "Expression aggregations"? For example:
H2 Expression aggregations
H3 Expression aggregator
H4 Examples
H3 JavaScript aggregator
H4 Example


Query time only aggregator that can aggregate results using [Druid expressions](./math-expr.md) functions to facilitate building custom functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to show the expression aggregator here. Similar to what you have for the JavaScript aggregator on line 478.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't really understand, there are several real examples of the expression aggregator after the table, and the table seems a more suitable place to explain the syntax rather than a pseudo json example like the javascript aggregator uses...

Copy link
Member Author

@clintropolis clintropolis Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would probably be better if all of the aggregators on this page had a table to explain all of the parameters and then saved the json for realistic examples, but .. i wasn't very motivated to fix the whole page 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. Fixing the whole page is quite an undertaking :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, i think maybe i will just try to fix the whole page once i get around to fixing up this PR based on review, seems like it won't be that bad

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Query time only aggregator that can aggregate results using [Druid expressions](./math-expr.md) functions to facilitate building custom functions.
Aggregator applicable only at query time. Aggregates results using [Druid expressions](./math-expr.md) to facilitate building custom functions.


| property | description | required |
Copy link
Contributor

@ektravel ektravel Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| property | description | required |
| Property | Description | Required |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rows of the Required column should say Yes or No instead of true or false.
For example:
| type | Must be expression. | Yes |

| --- | --- | --- |
| `type` | must be `expression` | true |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `type` | must be `expression` | true |
| `type` | Must be `expression`. | true |

| `name` | aggregator output name | true |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `name` | aggregator output name | true |
| `name` | The aggregator output name. | true |

| `fields` | list of aggregator input columns | true |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `fields` | list of aggregator input columns | true |
| `fields` | The list of aggregator input columns. | true |

| `accumulatorIdentifier` | variable which identifies the accumulator value in the `fold` and `combine` expressions | false (default `__acc`)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `accumulatorIdentifier` | variable which identifies the accumulator value in the `fold` and `combine` expressions | false (default `__acc`)|
| `accumulatorIdentifier` | The variable which identifies the accumulator value in the `fold` and `combine` expressions. | false (default `__acc`)|

| `fold` | expression to accumulate values from `fields`. The result of the expression will be stored in `accumulatorIdentifier` and available to the next computation. | true |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `fold` | expression to accumulate values from `fields`. The result of the expression will be stored in `accumulatorIdentifier` and available to the next computation. | true |
| `fold` | The expression to accumulate values from `fields`. The result of the expression is stored in `accumulatorIdentifier` and available to the next computation. | true |

| `combine` | expression to combine the results of various `fold` expressions of each segment when merging results. If not defined and `fold` has a single input column in `fields`, then the `fold` expression may be used, otherwise the input is available to the expression as the `name`| false (default to `fold` expression if and only if the expression has a single input in `fields`)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `combine` | expression to combine the results of various `fold` expressions of each segment when merging results. If not defined and `fold` has a single input column in `fields`, then the `fold` expression may be used, otherwise the input is available to the expression as the `name`| false (default to `fold` expression if and only if the expression has a single input in `fields`)|
| `combine` | The expression to combine the results of various `fold` expressions of each segment when merging results. You can use the `fold` expression if `combine` is not defined and `fold` has a single input column in `fields`. Otherwise, the input is available to the expression as `name`.| false (Defaults to `fold` expression if the expression has a single input in `fields`.)|

| `compare` | comparator expression which can only refer to 2 input variables, `o1` and `o2`, where `o1` and `o2` are the output of `fold` or `combine` expressions, and must adhere to the Java comparator contract. If not set, this will try to fall back to an output type appropriate comparator | false |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `compare` | comparator expression which can only refer to 2 input variables, `o1` and `o2`, where `o1` and `o2` are the output of `fold` or `combine` expressions, and must adhere to the Java comparator contract. If not set, this will try to fall back to an output type appropriate comparator | false |
| `compare` | The comparator expression which can only refer to two input variables, `o1` and `o2`. `o1` and `o2` represent the output of `fold` or `combine` expressions and must adhere to the Java comparator contract. If not set, `compare` will try to fall back to an output type appropriate comparator. | false |

What do you mean by "try to fall back"? What happens if it doesn't fall back?

| `finalize` | finalize expression which can only refer to a single input variable, `o`, and is used to perform any final transformation of the output of `fold` or `combine` expressions. If not set, then the value is not transformed | false |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `finalize` | finalize expression which can only refer to a single input variable, `o`, and is used to perform any final transformation of the output of `fold` or `combine` expressions. If not set, then the value is not transformed | false |
| `finalize` | The finalize expression which can only refer to a single input variable, `o`. You use `finalize` to perform final transformation of the output of `fold` or `combine` expressions. If not set, the value is not transformed. | false |

| `initialValue` | initial value of the accumulator for `fold` (and `combine`, if `InitialCombineValue` is null) expression | true |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `initialValue` | initial value of the accumulator for `fold` (and `combine`, if `InitialCombineValue` is null) expression | true |
| `initialValue` | The initial value of the accumulator for `fold` (and `combine`, if `InitialCombineValue` is null) expression. | true |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `initialValue` | initial value of the accumulator for `fold` (and `combine`, if `InitialCombineValue` is null) expression | true |
| `initialValue` | initial value of the accumulator for the `fold` (and `combine`, if `InitialCombineValue` is null) expression | true |

| `initialCombineValue` | initial value of the accumulator for `combine` expression | false (default `initialValue`) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `initialCombineValue` | initial value of the accumulator for `combine` expression | false (default `initialValue`) |
| `initialCombineValue` | The initial value of the accumulator for the `combine` expression. | false (defaults to `initialValue`) |

| `isNullUnlessAggregated` | indicates that the default output value should be `null` if the aggregator does not process any rows. If true, the value is `null`, if false, the result of running the expressions with initial values is used instead. | false (defaults to value of `druid.generic.useDefaultValueForNull`)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `isNullUnlessAggregated` | indicates that the default output value should be `null` if the aggregator does not process any rows. If true, the value is `null`, if false, the result of running the expressions with initial values is used instead. | false (defaults to value of `druid.generic.useDefaultValueForNull`)|
| `isNullUnlessAggregated` | Indicates that the default output value should be `null` if the aggregator does not process any rows. If true, the value is `null`, if false, the result of running the expressions with initial values is used instead. | false (defaults to the value of `druid.generic.useDefaultValueForNull`)|

| `shouldAggregateNullInputs` | indicates if the `fold` expression should operate on any `null` input values | false (default value is `true`) |
Copy link
Contributor

@ektravel ektravel Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `shouldAggregateNullInputs` | indicates if the `fold` expression should operate on any `null` input values | false (default value is `true`) |
| `shouldAggregateNullInputs` | Indicates that the `fold` expression should operate on any `null` input values. | false (defaults to `true`) |

| `shouldCombineAggregateNullInputs` | indicates if the `combine` expression should operate on any `null` input values | false (default value is `shouldAggregateNullInputs`) |
Copy link
Contributor

@ektravel ektravel Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `shouldCombineAggregateNullInputs` | indicates if the `combine` expression should operate on any `null` input values | false (default value is `shouldAggregateNullInputs`) |
| `shouldCombineAggregateNullInputs` | Indicates if the `combine` expression should operate on any `null` input values. | false (defaults to the value of `shouldAggregateNullInputs`) |

| `maxSizeBytes` | maximum size in bytes that variably sized aggregator output types such as strings and arrays are allowed to grow before the aggregation will fail. | false (8192 bytes) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `maxSizeBytes` | maximum size in bytes that variably sized aggregator output types such as strings and arrays are allowed to grow before the aggregation will fail. | false (8192 bytes) |
| `maxSizeBytes` | The maximum size in bytes that variably sized aggregator output types such as strings and arrays are allowed to grow to before the aggregation fails. | false (8192 bytes) |


#### Example: a "count" aggregator
The initial value is `0` and adds `1` for each row processed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The initial value is `0` and adds `1` for each row processed.
The initial value is `0`. `fold` adds `1` for each row processed.


```json
{
"type": "expression",
"name": "expression_count",
"fields": [],
"initialValue": "0",
"fold": "__acc + 1",
"combine": "__acc + expression_count"
}
```

#### Example: a "sum" aggregator
The initial value is `0`, adds the numeric value `column_a` for each row processed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The initial value is `0`, adds the numeric value `column_a` for each row processed.
The initial value is `0`. `fold` adds the numeric value `column_a` for each row processed.


```json
{
"type": "expression",
"name": "expression_sum",
"fields": ["column_a"],
"initialValue": "0",
"fold": "__acc + column_a"
}
```

#### Example: a "distinct array element" aggregator, sorted by array_length
The initial value is an empty array, `fold` adds the elements of `column_a` to the accumulator using set semantics, `combine` merges the sets, and `compare` orders the values by `array_length`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The initial value is an empty array, `fold` adds the elements of `column_a` to the accumulator using set semantics, `combine` merges the sets, and `compare` orders the values by `array_length`.
The initial value is an empty array. `fold` adds the elements of `column_a` to the accumulator using set semantics, `combine` merges the sets, and `compare` orders the values by `array_length`.


```json
{
"type": "expression",
"name": "expression_array_agg_distinct",
"fields": ["column_a"],
"initialValue": "[]",
"fold": "array_set_add(__acc, column_a)",
"combine": "array_set_add_all(__acc, expression_array_agg_distinct)",
"compare": "if(array_length(o1) > array_length(o2), 1, if (array_length(o1) == array_length(o2), 0, -1))"
}
```

#### Example: an "approximate count" aggregator using the built-in hyper-unique
Similar to the 'cardinality' aggregator, the default value is an empty hyper-unique sketch, `fold` adds the value of `column_a` to the sketch, `combine` merges the sketches, and `finalize` gets the estimated count from the accumulated sketch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Similar to the 'cardinality' aggregator, the default value is an empty hyper-unique sketch, `fold` adds the value of `column_a` to the sketch, `combine` merges the sketches, and `finalize` gets the estimated count from the accumulated sketch.
Similar to the cardinality aggregator, the default value is an empty hyper-unique sketch, `fold` adds the value of `column_a` to the sketch, `combine` merges the sketches, and `finalize` gets the estimated count from the accumulated sketch.


```json
{
"type": "expression",
"name": "expression_cardinality",
"fields": ["column_a"],
"initialValue": "hyper_unique()",
"fold": "hyper_unique_add(column_a, __acc)",
"combine": "hyper_unique_add(expression_cardinality, __acc)",
"finalize": "hyper_unique_estimate(o)"
}
```

### JavaScript aggregator

Computes an arbitrary JavaScript function over a set of columns (both metrics and dimensions are allowed). Your
JavaScript functions are expected to return floating-point values.

```json
{ "type": "javascript",
"name": "<output_name>",
"fieldNames" : [ <column1>, <column2>, ... ],
"fnAggregate" : "function(current, column1, column2, ...) {
<updates partial aggregate (current) based on the current row values>
return <updated partial aggregate>
}",
"fnCombine" : "function(partialA, partialB) { return <combined partial results>; }",
"fnReset" : "function() { return <initial value>; }"
}
```

**Example**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Example**
#### Example


```json
{
"type": "javascript",
"name": "sum(log(x)*y) + 10",
"fieldNames": ["x", "y"],
"fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }",
"fnCombine" : "function(partialA, partialB) { return partialA + partialB; }",
"fnReset" : "function() { return 10; }"
}
```

> JavaScript-based functionality is disabled by default. Please refer to the Druid [JavaScript programming guide](../development/javascript.md) for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> JavaScript-based functionality is disabled by default. Please refer to the Druid [JavaScript programming guide](../development/javascript.md) for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.
> JavaScript-based functionality is disabled by default. Refer to the Druid [JavaScript programming guide](../development/javascript.md) for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't new, i just moved it, but i can adjust it...



## Miscellaneous aggregations

### Filtered aggregator
Expand Down