Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce contract to know whether InternalAggs have computed results or not #34903

Closed
costin opened this issue Oct 26, 2018 · 6 comments
Closed

Comments

@costin
Copy link
Member

costin commented Oct 26, 2018

It's a common concern for aggregations to not compute any results (empty bucket or null values) yet there is no general way to discover that when interacting with them. One needs to look into their internals but even then there are aggs where it's ambiguous whether any computation has happened or if it's just the default value (as aggs use primitives, there's no null).
Take SUM and MAX; if there's nothing to compute, SUM returns 0 while MAX returns Double.POSITIVE_INFINITY. The latter might be used but 0 is a valid sum so the caller has no idea whether its meaning is null or actually 0.

See #34896

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@colings86
Copy link
Contributor

Whilst I agree that this is something that we should think about solving in a general way, especially as it has come up before when dealing with pipeline aggregations I am wondering if in the short term #34896 could be solved by looking at the doc_count of the parent aggregation bucket (or total.hits if the agg is already a root aggregation) since this would determine with accuracy if the aggregation received any data?

One other thing to note here is that the Sum of no values is 0 (and not null) so I think it's right that the sum outputs 0 for this case.

@astefan
Copy link
Contributor

astefan commented Oct 26, 2018

@colings86 may I ask why sum has a different behavior than min, max, avg, from the 0/null point of view?

A sum of nulls (no values) for at least one document is 0.
A sum of 0 documents is 0.
A min of nulls (no values) is null, same goes for max. This is different than what sum does.

@colings86
Copy link
Contributor

It works this way because max and min have no value if there is no data but sum,cardinality and value_count by definition (outside of Elasticsearch just the actual definition of the metric) are initialised to 0 and the result of all three is 0 even if there is no data. max, min, avg and percentiles are all initialised without a value hence returning null if the re is no data for those.

@costin
Copy link
Member Author

costin commented Oct 29, 2018

could be solved by looking at the doc_count of the parent aggregation bucket (or total.hits if the agg is already a root aggregation) since this would determine with accuracy if the aggregation received any data

It would work but it requires knowledge about the agg structure (its parent) in the consumer. This information can be encapsulated in the agg itself and in a good number of metric aggs it already is (to properly report values as null in doXContentBody).

@polyfractal
Copy link
Contributor

A related issue that will need addressing is how this interacts with pipeline aggs, as also discovered by Costin. E.g. when pipelines resolve agg values, they get the primitive internal state from the agg. This means that pipelines like bucket_selector will get a value, but it's unclear what the value actually represents (is it NaN because it's not actually a number, or because the agg had no values?)

I think this can be fixed by providing the eventual mechanism (hasValue() or whatever) through the various pipeline painless contexts for use in the script itself.

Note: this is related to #27377 although somewhat different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants