Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement stats aggregation for string terms #47468

Merged
merged 14 commits into from
Nov 14, 2019

Conversation

csoulios
Copy link
Contributor

@csoulios csoulios commented Oct 2, 2019

This PR adds a new metric aggregation called string_stats that operates on string terms of a document and returns the following:

  • min_length: The length of the shortest term
  • max_length: The length of the longest term
  • avg_length: The average length of all terms
  • distribution: The probability distribution of all characters appearing in all terms
  • entropy: The total Shannon entropy value calculated for all terms

This aggregation has been implemented as an analytics plugin.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

StringStatsAggregatorTests#testSingleValuedFieldFormatter fails because
of elastic#47469
Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, think it looks good!

Needs some documentation + doc tests. Let me know if you have questions about the doc tests, they are a bit funky :)


public class DataScienceAggregationBuilders {
public class AnalyticsAggregationBuilders {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, good catch, thanks for the fix :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the embarrassing typo below heh :)

}

static class Fields {
public static final String COUNT = "count";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use ParseField for these. ParseField has some extra functionality to handle deprecations/renaming, in case we ever decide to change the string values.

case avg_length: return this.getAvgLength();
case entropy: return this.getEntropy();
default:
throw new IllegalArgumentException("Unknown value [" + name + "] in common stats aggregation");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"common stats" left over from a different agg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it should read string stats. Fixed.

// Parse string chars and count occurrences
for (Character c : valueStr.toCharArray()) {
LongArray occ = charOccurrences.get(c);
final long overSize = BigArrays.overSize(bucket + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a terribly expensive call, but we should be able to move this up and out of the valuesCount loop I think? That way we only calculate the bigarrays size once instead of for each character in each value?

@csoulios
Copy link
Contributor Author

csoulios commented Oct 9, 2019

@elasticmachine run elasticsearch-ci/packaging-sample-matrix

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, pending docs. /cc @not-napoleon who I volunteered to review the docs while I'm out 😁

@csoulios
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/default-distro

@csoulios
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/default-distro
@elasticmachine run elasticsearch-ci/bwc

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits, but looks good to me overall.

* @return A map with the character as key and the probability of
* this character to occur as value. The map is ordered by frequency descending.
*/
public Map<String, Double> getDistribution() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Does this need to be public? Looked to me like it was only called within the package


public Object value(String name) {
Metrics metrics = Metrics.valueOf(name);
switch (metrics) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a nit, but I don't love switching on an enum. If someone later adds a field to the enum, they need to remember to also update this switch. IMHO, a better solution would be to put a method on the enum getFieldValue(InternalStringStats stats) which could then call the appropriate getter and return the value. That way, any new enum value would need to implement the method for it to compile.


@Override
public ScoreMode scoreMode() {
return valuesSource != null && valuesSource.needsScores() ? ScoreMode.COMPLETE : ScoreMode.COMPLETE_NO_SCORES;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some parenthesis around the predicate here. Having to remember that && is higher precedence than ?: is unnecessary cognitive load, so parenthesis will make it more readable, even if they are technically redundant.

@csoulios
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

@csoulios csoulios merged commit b0e12c9 into elastic:master Nov 14, 2019
@csoulios csoulios deleted the feature/string_stats branch November 14, 2019 15:23
csoulios added a commit that referenced this pull request Nov 15, 2019
Backport of #47468 to 7.x

This PR adds a new metric aggregation called string_stats that operates on string terms of a document and returns the following:

min_length: The length of the shortest term
max_length: The length of the longest term
avg_length: The average length of all terms
distribution: The probability distribution of all characters appearing in all terms
entropy: The total Shannon entropy value calculated for all terms

This aggregation has been implemented as an analytics plugin.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants