Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add new categorization stats to model_size_stats #989

Merged
merged 5 commits into from
Feb 7, 2020

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Feb 4, 2020

This change adds support for the following new model_size_stats
fields:

  • categorized_doc_count
  • total_category_count
  • frequent_category_count
  • rare_category_count
  • dead_category_count
  • categorization_status

Relates elastic/elasticsearch#50749

This change adds support for the following new model_size_stats
fields:

- categorized_doc_count
- total_category_count
- frequent_category_count
- rare_category_count
- dead_category_count
- categorization_status

Relates #50749
@droberts195
Copy link
Contributor Author

droberts195 commented Feb 4, 2020

This PR will fail CI until elastic/elasticsearch#51879 is merged. The Java side change must be merged first, but I wanted to open this now so they can be reviewed together.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a pass through: looks good Dave. I made some minor tidy up suggestions.

testIter != uniqueTokenIds.end()) {
if (commonIter->first < testIter->first) {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It strikes me that this becomes somewhat cleaner using std::find, i.e.

while (commonIter != m_CommonUniqueTokenIds.end()) {
  testIter = std::find(testIter, uniqueTokenIds.end(), *commonIter,
                       [](const auto& lhs, const auto& rhs) { return lhs->first >= rhs->first; }); 
  if (testIter == uniqueTokenIds.end() || *testIter != *commonIter) {
      return false;
  }
  ++commonIter;
}

I guess this is just moving well tested code to a different place, so feel free to leave as if you'd rather not change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I more-or-less did this. The second part of the condition has to be testIter->first != commonIter->first || testIter->second != commonIter->second instead of simply *testIter != *commonIter because in some cases one is a std::pair<const std::size_t, std::size_t> and the other is a std::pair<std::size_t, std::size_t>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, great. It's a shame comparison doesn't work with const. That feels like an error in the std::pair implementation.

std::size_t s_FrequentCategories;
std::size_t s_RareCategories;
std::size_t s_DeadCategories;
model_t::ECategorizationStatus s_CategorizationStatus;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're only extending, but should this provide defaults? It seems like this object is a recipe for uninitialised variables as it stands.

BOOST_TEST_REQUIRE(modelSizeStats.HasMember("model_bytes_exceeded"));
BOOST_REQUIRE_EQUAL(int64_t(0), modelSizeStats["model_bytes_exceeded"].GetInt64());
BOOST_TEST_REQUIRE(modelSizeStats.HasMember("model_bytes_memory_limit"));
BOOST_REQUIRE_EQUAL(int64_t(50000),
modelSizeStats["model_bytes_memory_limit"].GetInt64());
BOOST_TEST_REQUIRE(modelSizeStats.HasMember("categorized_doc_count"));
BOOST_REQUIRE_EQUAL(int64_t(1000), modelSizeStats["categorized_doc_count"].GetInt64());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe include cstdint and std::int64_t based on our guidelines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was also dodgy before my changes because many expected results should have been unsigned rather than signed. I copied that bug into my new lines, but I'll correct that throughout the file at the same time as adding std::.

res.s_FrequentCategories = 0;
res.s_RareCategories = 0;
res.s_DeadCategories = 0;
res.s_CategorizationStatus = model_t::E_CategorizationStatusOk;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make all these defaults in the class definition.

// - At least 100 messages have been categorized
// and one of the following holds:
// - There is only 1 category
// - More than 90% of categories have 1 message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// - More than 90% of categories have 1 message
// - More than 90% of categories are rare

(res.s_TotalCategories == 1 || 10 * res.s_RareCategories > 9 * res.s_TotalCategories ||
2 * res.s_TotalCategories > res.s_CategorizedMessages ||
res.s_FrequentCategories == 0 || 2 * res.s_DeadCategories > res.s_TotalCategories)) {
res.s_CategorizationStatus = model_t::E_CategorizationStatusPoor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like an odd place for this logic: could it perhaps be better encapsulated. Maybe a static member on some class related to categorisation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it I can just put this in CTokenListDataCategorizerBase::updateModelSizeStats(). I was thinking we'd only want to do this calculation once if we start doing per-partition categorization, but since the stats it's based on will accumulate over the partitions the last partition we calculated the status for would use the overall numbers and that's the value that would get reported. The cost is that we evaluate some 5 simple comparisons per-partition instead of once but that's not going to take long and I agree the encapsulation is much better moving this formula into categorization code.

(Plus of course while we're not doing per-partition categorization the cost concerns are all theoretical anyway as we only have one categorizer object.)

compCategory.maxMatchingStringLen() >= rawStringLen &&
compCategory.isMissingCommonTokenWeightZero(m_WorkTokenUniqueIds) &&
compCategory.containsCommonTokensInOrder(m_WorkTokenIds));
bool matchesSearch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: braces initialization

@@ -539,6 +528,47 @@ std::size_t CTokenListDataCategorizerBase::memoryUsage() const {
return mem;
}

void CTokenListDataCategorizerBase::updateMemoryResults(CResourceMonitor::SResults& results) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about calling this updateMemoryResults. This includes other information now. Maybe updateTelemetry? I guess there is a slight issue because we've subverted CResourceMonitor to do more than monitor resource usage. Perhaps we should think about renaming this in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLion makes these renamings pretty trivial so how about updateModelSizeStats? Then naming will be consistent through C++, Java and JavaScript, which might make it easier to trace how these numbers flow through the code.

Many of the new stats are related to resource use - certainly total_category_count, rare_category_count and dead_category_count are indicators of how much memory categorization is using, and whether that usage is suboptimal. To put the other stats somewhere else would just create complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool. I definitely think that's better.

To put the other stats somewhere else would just create complexity.

Btw, I completely agree we don't want these somewhere else. It's more I wonder if we should think about renaming CResourceMonitor at some point. I'm happy for that not to happen in this PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ to not renaming CResourceMonitor in this PR. That would touch a very large number of files unrelated to categorization.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're waiting for Java changes to land, but LGTM when you're ready.

@droberts195 droberts195 merged commit 9ce5e7e into elastic:master Feb 7, 2020
@droberts195 droberts195 deleted the categorizer_model_stats branch February 7, 2020 12:49
droberts195 added a commit to droberts195/ml-cpp that referenced this pull request Feb 7, 2020
This change adds support for the following new model_size_stats
fields:

- categorized_doc_count
- total_category_count
- frequent_category_count
- rare_category_count
- dead_category_count
- categorization_status

Backport of elastic#989
droberts195 added a commit that referenced this pull request Feb 10, 2020
This change adds support for the following new model_size_stats
fields:

- categorized_doc_count
- total_category_count
- frequent_category_count
- rare_category_count
- dead_category_count
- categorization_status

Backport of #989
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants