[ML] Add new categorization stats to model_size_stats #989

droberts195 · 2020-02-04T17:35:15Z

This change adds support for the following new model_size_stats
fields:

categorized_doc_count
total_category_count
frequent_category_count
rare_category_count
dead_category_count
categorization_status

This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - dead_category_count - categorization_status Relates #50749

droberts195 · 2020-02-04T17:36:28Z

This PR will fail CI until elastic/elasticsearch#51879 is merged. The Java side change must be merged first, but I wanted to open this now so they can be reviewed together.

tveasey

I've done a pass through: looks good Dave. I made some minor tidy up suggestions.

tveasey · 2020-02-05T13:13:58Z

include/model/CTokenListCategory.h

+               testIter != uniqueTokenIds.end()) {
+            if (commonIter->first < testIter->first) {
+                return false;
+            }


It strikes me that this becomes somewhat cleaner using std::find, i.e.

while (commonIter != m_CommonUniqueTokenIds.end()) { testIter = std::find(testIter, uniqueTokenIds.end(), *commonIter, [](const auto& lhs, const auto& rhs) { return lhs->first >= rhs->first; }); if (testIter == uniqueTokenIds.end() || *testIter != *commonIter) { return false; } ++commonIter; }

I guess this is just moving well tested code to a different place, so feel free to leave as if you'd rather not change.

I more-or-less did this. The second part of the condition has to be testIter->first != commonIter->first || testIter->second != commonIter->second instead of simply *testIter != *commonIter because in some cases one is a std::pair<const std::size_t, std::size_t> and the other is a std::pair<std::size_t, std::size_t>.

Ok, great. It's a shame comparison doesn't work with const. That feels like an error in the std::pair implementation.

tveasey · 2020-02-05T15:33:37Z

include/model/CResourceMonitor.h

+        std::size_t s_FrequentCategories;
+        std::size_t s_RareCategories;
+        std::size_t s_DeadCategories;
+        model_t::ECategorizationStatus s_CategorizationStatus;


I know you're only extending, but should this provide defaults? It seems like this object is a recipe for uninitialised variables as it stands.

tveasey · 2020-02-05T15:36:42Z

lib/api/unittest/CModelSnapshotJsonWriterTest.cc

    BOOST_TEST_REQUIRE(modelSizeStats.HasMember("model_bytes_exceeded"));
    BOOST_REQUIRE_EQUAL(int64_t(0), modelSizeStats["model_bytes_exceeded"].GetInt64());
    BOOST_TEST_REQUIRE(modelSizeStats.HasMember("model_bytes_memory_limit"));
    BOOST_REQUIRE_EQUAL(int64_t(50000),
                        modelSizeStats["model_bytes_memory_limit"].GetInt64());
+    BOOST_TEST_REQUIRE(modelSizeStats.HasMember("categorized_doc_count"));
+    BOOST_REQUIRE_EQUAL(int64_t(1000), modelSizeStats["categorized_doc_count"].GetInt64());


nit: maybe include cstdint and std::int64_t based on our guidelines.

This file was also dodgy before my changes because many expected results should have been unsigned rather than signed. I copied that bug into my new lines, but I'll correct that throughout the file at the same time as adding std::.

tveasey · 2020-02-05T15:37:34Z

lib/model/CResourceMonitor.cc

+    res.s_FrequentCategories = 0;
+    res.s_RareCategories = 0;
+    res.s_DeadCategories = 0;
+    res.s_CategorizationStatus = model_t::E_CategorizationStatusOk;


I'd make all these defaults in the class definition.

tveasey · 2020-02-05T15:38:17Z

lib/model/CResourceMonitor.cc

+    // - At least 100 messages have been categorized
+    // and one of the following holds:
+    // - There is only 1 category
+    // - More than 90% of categories have 1 message


Suggested change

// - More than 90% of categories have 1 message

// - More than 90% of categories are rare

tveasey · 2020-02-05T15:40:00Z

lib/model/CResourceMonitor.cc

+        (res.s_TotalCategories == 1 || 10 * res.s_RareCategories > 9 * res.s_TotalCategories ||
+         2 * res.s_TotalCategories > res.s_CategorizedMessages ||
+         res.s_FrequentCategories == 0 || 2 * res.s_DeadCategories > res.s_TotalCategories)) {
+        res.s_CategorizationStatus = model_t::E_CategorizationStatusPoor;


This feels like an odd place for this logic: could it perhaps be better encapsulated. Maybe a static member on some class related to categorisation.

Come to think of it I can just put this in CTokenListDataCategorizerBase::updateModelSizeStats(). I was thinking we'd only want to do this calculation once if we start doing per-partition categorization, but since the stats it's based on will accumulate over the partitions the last partition we calculated the status for would use the overall numbers and that's the value that would get reported. The cost is that we evaluate some 5 simple comparisons per-partition instead of once but that's not going to take long and I agree the encapsulation is much better moving this formula into categorization code.

(Plus of course while we're not doing per-partition categorization the cost concerns are all theoretical anyway as we only have one categorizer object.)

tveasey · 2020-02-05T15:42:45Z

lib/model/CTokenListDataCategorizerBase.cc

-                           compCategory.maxMatchingStringLen() >= rawStringLen &&
-                           compCategory.isMissingCommonTokenWeightZero(m_WorkTokenUniqueIds) &&
-                           compCategory.containsCommonTokensInOrder(m_WorkTokenIds));
+        bool matchesSearch(


nit: braces initialization

tveasey · 2020-02-05T15:48:19Z

lib/model/CTokenListDataCategorizerBase.cc

@@ -539,6 +528,47 @@ std::size_t CTokenListDataCategorizerBase::memoryUsage() const {
    return mem;
 }

+void CTokenListDataCategorizerBase::updateMemoryResults(CResourceMonitor::SResults& results) const {


I'm not sure about calling this updateMemoryResults. This includes other information now. Maybe updateTelemetry? I guess there is a slight issue because we've subverted CResourceMonitor to do more than monitor resource usage. Perhaps we should think about renaming this in a separate PR.

CLion makes these renamings pretty trivial so how about updateModelSizeStats? Then naming will be consistent through C++, Java and JavaScript, which might make it easier to trace how these numbers flow through the code.

Many of the new stats are related to resource use - certainly total_category_count, rare_category_count and dead_category_count are indicators of how much memory categorization is using, and whether that usage is suboptimal. To put the other stats somewhere else would just create complexity.

Ok cool. I definitely think that's better.

To put the other stats somewhere else would just create complexity.

Btw, I completely agree we don't want these somewhere else. It's more I wonder if we should think about renaming CResourceMonitor at some point. I'm happy for that not to happen in this PR though.

++ to not renaming CResourceMonitor in this PR. That would touch a very large number of files unrelated to categorization.

tveasey

I know you're waiting for Java changes to land, but LGTM when you're ready.

This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - dead_category_count - categorization_status Backport of elastic#989

This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - dead_category_count - categorization_status Backport of #989

[ML] Add new categorization stats to model_size_stats

4bde366

This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - dead_category_count - categorization_status Relates #50749

droberts195 added >enhancement :ml v8.0.0 v7.7.0 labels Feb 4, 2020

tveasey reviewed Feb 5, 2020

View reviewed changes

droberts195 added 3 commits February 6, 2020 12:53

Address review comments

e55b927

Changing categorization status "poor" to "warn"

9c2ede9

Add changelog

7047f85

tveasey approved these changes Feb 6, 2020

View reviewed changes

Add a test for "warn" status

7021ae1

droberts195 merged commit 9ce5e7e into elastic:master Feb 7, 2020

droberts195 deleted the categorizer_model_stats branch February 7, 2020 12:49

droberts195 mentioned this pull request Feb 7, 2020

[7.x][ML] Add new categorization stats to model_size_stats #992

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add new categorization stats to model_size_stats #989

[ML] Add new categorization stats to model_size_stats #989

droberts195 commented Feb 4, 2020 •

edited

Loading

droberts195 commented Feb 4, 2020 •

edited

Loading

tveasey left a comment

tveasey Feb 5, 2020

droberts195 Feb 6, 2020

tveasey Feb 6, 2020

tveasey Feb 5, 2020

tveasey Feb 5, 2020

droberts195 Feb 6, 2020

tveasey Feb 5, 2020

tveasey Feb 5, 2020

tveasey Feb 5, 2020

droberts195 Feb 6, 2020

tveasey Feb 5, 2020

tveasey Feb 5, 2020

droberts195 Feb 6, 2020

tveasey Feb 6, 2020

droberts195 Feb 6, 2020

tveasey left a comment

	// - More than 90% of categories have 1 message
	// - More than 90% of categories are rare

[ML] Add new categorization stats to model_size_stats #989

[ML] Add new categorization stats to model_size_stats #989

Conversation

droberts195 commented Feb 4, 2020 • edited Loading

droberts195 commented Feb 4, 2020 • edited Loading

tveasey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment

droberts195 commented Feb 4, 2020 •

edited

Loading

droberts195 commented Feb 4, 2020 •

edited

Loading