Implement stats aggregation for string terms #47468

csoulios · 2019-10-02T19:03:02Z

This PR adds a new metric aggregation called string_stats that operates on string terms of a document and returns the following:

min_length: The length of the shortest term
max_length: The length of the longest term
avg_length: The average length of all terms
distribution: The probability distribution of all characters appearing in all terms
entropy: The total Shannon entropy value calculated for all terms

This aggregation has been implemented as an analytics plugin.

elasticmachine · 2019-10-02T19:03:04Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

StringStatsAggregatorTests#testSingleValuedFieldFormatter fails because of elastic#47469

polyfractal

Left some comments, think it looks good!

Needs some documentation + doc tests. Let me know if you have questions about the doc tests, they are a bit funky :)

polyfractal · 2019-10-07T13:31:22Z

.../analytics/src/main/java/org/elasticsearch/xpack/analytics/AnalyticsAggregationBuilders.java


-public class DataScienceAggregationBuilders {
+public class AnalyticsAggregationBuilders {


Whoops, good catch, thanks for the fix :)

And the embarrassing typo below heh :)

polyfractal · 2019-10-07T13:35:23Z

...alytics/src/main/java/org/elasticsearch/xpack/analytics/stringstats/InternalStringStats.java

+    }
+
+    static class Fields {
+        public static final String COUNT = "count";


Let's use ParseField for these. ParseField has some extra functionality to handle deprecations/renaming, in case we ever decide to change the string values.

polyfractal · 2019-10-07T13:43:43Z

...alytics/src/main/java/org/elasticsearch/xpack/analytics/stringstats/InternalStringStats.java

+            case avg_length: return this.getAvgLength();
+            case entropy: return this.getEntropy();
+            default:
+                throw new IllegalArgumentException("Unknown value [" + name + "] in common stats aggregation");


"common stats" left over from a different agg?

Right, it should read string stats. Fixed.

...ytics/src/main/java/org/elasticsearch/xpack/analytics/stringstats/StringStatsAggregator.java

polyfractal · 2019-10-07T14:00:29Z

...ytics/src/main/java/org/elasticsearch/xpack/analytics/stringstats/StringStatsAggregator.java

+                            // Parse string chars and count occurrences
+                            for (Character c : valueStr.toCharArray()) {
+                                LongArray occ = charOccurrences.get(c);
+                                final long overSize = BigArrays.overSize(bucket + 1);


It's not a terribly expensive call, but we should be able to move this up and out of the valuesCount loop I think? That way we only calculate the bigarrays size once instead of for each character in each value?

csoulios · 2019-10-09T12:30:24Z

@elasticmachine run elasticsearch-ci/packaging-sample-matrix

polyfractal

Code LGTM, pending docs. /cc @not-napoleon who I volunteered to review the docs while I'm out 😁

csoulios · 2019-10-23T13:47:33Z

@elasticmachine run elasticsearch-ci/default-distro

csoulios · 2019-10-24T11:58:36Z

@elasticmachine run elasticsearch-ci/default-distro
@elasticmachine run elasticsearch-ci/bwc

not-napoleon

Couple of nits, but looks good to me overall.

not-napoleon · 2019-10-24T13:46:48Z

...alytics/src/main/java/org/elasticsearch/xpack/analytics/stringstats/InternalStringStats.java

+     * @return A map with the character as key and the probability of
+     * this character to occur as value. The map is ordered by frequency descending.
+     */
+    public Map<String, Double> getDistribution() {


Nit: Does this need to be public? Looked to me like it was only called within the package

not-napoleon · 2019-10-24T14:09:41Z

...alytics/src/main/java/org/elasticsearch/xpack/analytics/stringstats/InternalStringStats.java

+
+    public Object value(String name) {
+        Metrics metrics = Metrics.valueOf(name);
+        switch (metrics) {


Bit of a nit, but I don't love switching on an enum. If someone later adds a field to the enum, they need to remember to also update this switch. IMHO, a better solution would be to put a method on the enum getFieldValue(InternalStringStats stats) which could then call the appropriate getter and return the value. That way, any new enum value would need to implement the method for it to compile.

not-napoleon · 2019-10-24T14:22:39Z

...ytics/src/main/java/org/elasticsearch/xpack/analytics/stringstats/StringStatsAggregator.java

+
+    @Override
+    public ScoreMode scoreMode() {
+        return valuesSource != null && valuesSource.needsScores() ? ScoreMode.COMPLETE : ScoreMode.COMPLETE_NO_SCORES;


Please add some parenthesis around the predicate here. Having to remember that && is higher precedence than ?: is unnecessary cognitive load, so parenthesis will make it more readable, even if they are technically redundant.

csoulios · 2019-11-13T12:11:06Z

@elasticmachine run elasticsearch-ci/bwc

Backport of #47468 to 7.x This PR adds a new metric aggregation called string_stats that operates on string terms of a document and returns the following: min_length: The length of the shortest term max_length: The length of the longest term avg_length: The average length of all terms distribution: The probability distribution of all characters appearing in all terms entropy: The total Shannon entropy value calculated for all terms This aggregation has been implemented as an analytics plugin.

First commit of the implementation for string_stats aggregation

981807f

csoulios added >feature :Analytics/Aggregations Aggregations v8.0.0 v7.5.0 labels Oct 2, 2019

csoulios requested a review from polyfractal October 2, 2019 19:03

Muted unit test

fe35bf8

StringStatsAggregatorTests#testSingleValuedFieldFormatter fails because of elastic#47469

$polyfractal$

polyfractal suggested changes Oct 7, 2019

View reviewed changes

csoulios added 2 commits October 8, 2019 15:49

Addressed code review comments

f05eb48

Merge branch 'master' into feature/string_stats

5904d18

$polyfractal$

polyfractal approved these changes Oct 18, 2019

View reviewed changes

csoulios added 3 commits October 21, 2019 15:38

Merge branch 'master' into feature/string_stats

7f723c6

Added asciidoc for the string_stats aggregation

08cf5e7

Added missing reference/callout

82b6eb7

csoulios requested a review from not-napoleon October 23, 2019 13:33

not-napoleon approved these changes Oct 24, 2019

View reviewed changes

csoulios added 4 commits October 28, 2019 21:01

Merge branch 'master' into feature/string_stats

ee31f20

Addressed review comments

dc5077a

Fix serialization bug that failed caching results

3f4ef90

Addressed review comments

630701c

csoulios mentioned this pull request Nov 1, 2019

Closes #48469 - Refactor and DRY up Kahan Sum algorithm #48558

Merged

jimczi added v7.6.0 and removed v7.5.0 labels Nov 12, 2019

csoulios added 2 commits November 12, 2019 17:28

Merge branch 'master' into feature/string_stats

699a91e

Used CompensatedSum for computing Kahan Summation

9861659

Merge branch 'master' into feature/string_stats

b580cdb

csoulios merged commit b0e12c9 into elastic:master Nov 14, 2019

csoulios deleted the feature/string_stats branch November 14, 2019 15:23

csoulios mentioned this pull request Nov 14, 2019

[7.x] Implement stats aggregation for string terms #49097

Merged

wylieconlon mentioned this pull request Nov 22, 2019

[data.search.aggs] Support string statistics aggs in AggConfigs elastic/kibana#51510

Closed

10 tasks

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

Add String Stats Aggregation elastic/elasticsearch-net#4369

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement stats aggregation for string terms #47468

Implement stats aggregation for string terms #47468

csoulios commented Oct 2, 2019

elasticmachine commented Oct 2, 2019

$@polyfractal$ polyfractal left a comment

$@polyfractal$ polyfractal Oct 7, 2019

$@polyfractal$ polyfractal Oct 7, 2019

$@polyfractal$ polyfractal Oct 7, 2019

$@polyfractal$ polyfractal Oct 7, 2019

csoulios Oct 8, 2019

$@polyfractal$ polyfractal Oct 7, 2019

csoulios commented Oct 9, 2019

$@polyfractal$ polyfractal left a comment

csoulios commented Oct 23, 2019

csoulios commented Oct 24, 2019

not-napoleon left a comment

not-napoleon Oct 24, 2019

not-napoleon Oct 24, 2019

not-napoleon Oct 24, 2019

csoulios commented Nov 13, 2019


		public class DataScienceAggregationBuilders {
		public class AnalyticsAggregationBuilders {

Implement stats aggregation for string terms #47468

Implement stats aggregation for string terms #47468

Conversation

csoulios commented Oct 2, 2019

elasticmachine commented Oct 2, 2019

polyfractal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csoulios commented Oct 9, 2019

polyfractal left a comment

Choose a reason for hiding this comment

csoulios commented Oct 23, 2019

csoulios commented Oct 24, 2019

not-napoleon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csoulios commented Nov 13, 2019

$@polyfractal$ polyfractal left a comment

$@polyfractal$ polyfractal left a comment