[DOCS] Add feature importance to classification example

elastic · Sep 23, 2020 · 5d8027e · 5d8027e
1 parent ca66ab0
commit 5d8027e
Show file tree

Hide file tree

Showing 9 changed files with 161 additions and 83 deletions.
diff --git a/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc b/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc
@@ -185,4 +185,4 @@ testing. This split of the data set is the _testing data set_. Once the model ha
 been trained, you can let the model predict the value of the data points it has 
 never seen before and compare the prediction to the actual value by using the 
 evaluate {dfanalytics} API.
-////
+////
diff --git a/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc b/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc
@@ -104,6 +104,9 @@ image::images/flights-classification-job-1.png["Creating a {dfanalytics-job} in
 [role="screenshot"]
 image::images/flights-classification-job-2.png["Creating a {dfanalytics-job} in {kib} – continued"]
 
+[role="screenshot"]
+image::images/flights-classification-job-3.png["Creating a {dfanalytics-job} in {kib} – advanced options"]
+
 .. Choose `kibana_sample_data_flights` as the source index.
 .. Choose `classification` as the job type.
 .. Choose `FlightDelay` as the dependent variable, which is the field that we
@@ -116,15 +119,18 @@ recommended to exclude fields that either contain erroneous data or describe the
 source data for training. While that value is low for this example, for many
 large data sets using a small training sample greatly reduces runtime without 
 impacting accuracy.
-.. Use the default feature importance values.
+.. If you want to experiment with <<ml-feature-importance,feature importance>>,
+specify a value in the advanced configuration options. In this example, we
+choose to return a maximum of 10 feature importance values per document. This
+option affects the speed of the analysis, so by default it is disabled. 
 .. Use the default memory limit for the job. If the job requires more than this 
 amount of memory, it fails to start. If the available memory on the node is
 limited, this setting makes it possible to prevent job execution.
 .. Add a job ID and optionally a job description.
 .. Add the name of the destination index that will contain the results of the
-analysis. It will contain a copy of the source index data where each document is
-annotated with the results. If the index does not exist, it will be created
-automatically.
+analysis. In {kib}, the index name matches the job ID by default. It will
+contain a copy of the source index data where each document is annotated with
+the results. If the index does not exist, it will be created automatically.
 
 
 .API example
@@ -140,13 +146,15 @@ PUT _ml/data_frame/analytics/model-flight-delay-classification
     ]
   },
   "dest": {
-    "index": "df-flight-delayed",
+    "index": "model-flight-delay-classification",
     "results_field": "ml" <1>
   },
   "analysis": {
     "classification": {
       "dependent_variable": "FlightDelay",
-      "training_percent": 10
+      "training_percent": 10,
+      "num_top_classes": 10,
+      "num_top_feature_importance_values": 10 <2>
     }
   },
   "analyzed_fields": {
@@ -160,7 +168,8 @@ PUT _ml/data_frame/analytics/model-flight-delay-classification
 }
 --------------------------------------------------
 // TEST[skip:setup kibana sample data]
-<1> The field name in the `dest` index that contains the analysis results. 
+<1> The field name in the `dest` index that contains the analysis results.
+<2> To disable feature importance calculations, omit this option.
 ====
 --
 
@@ -259,32 +268,31 @@ The API call returns the following response:
       },
       "analysis_stats" : {
         "classification_stats" : {
-          "timestamp" : 1597182490577,
+          "timestamp" : 1599684771114,
           "iteration" : 18,
           "hyperparameters" : {
             "class_assignment_objective" : "maximize_minimum_recall",
-            "alpha" : 11.630957564710283,
-            "downsample_factor" : 0.9418550623091531,
-            "eta" : 0.032382816833064335,
-            "eta_growth_rate_per_tree" : 1.0198807182688074,
-            "feature_bag_fraction" : 0.5504020748926737,
-            "gamma" : 0.08388388780939579,
-            "lambda" : 0.08628826657684924,
+            "alpha" : 6.648298686326093,
+            "downsample_factor" : 0.7435400845721971,
+            "eta" : 0.039957516522980074,
+            "eta_growth_rate_per_tree" : 1.0168333294220058,
+            "feature_bag_fraction" : 0.49761652263010625,
+            "gamma" : 0.21224183609258152,
+            "lambda" : 0.2572621613644672,
             "max_attempts_to_add_tree" : 3,
             "max_optimization_rounds_per_hyperparameter" : 2,
-            "max_trees" : 644,
+            "max_trees" : 590,
             "num_folds" : 5,
             "num_splits_per_feature" : 75,
-            "soft_tree_depth_limit" : 7.550606337307592,
-            "soft_tree_depth_tolerance" : 0.13448633124842999
+            "soft_tree_depth_limit" : 3.2719032647442443,
+            "soft_tree_depth_tolerance" : 0.14970565884872958
           },
           "timing_stats" : {
-            "elapsed_time" : 44206,
-            "iteration_time" : 1884
+            "elapsed_time" : 37915,
+            "iteration_time" : 2552
           },
           "validation_loss" : {
-            "loss_type" : "binomial_logistic",
-            "fold_values" : [ ]
+            "loss_type" : "binomial_logistic"
           }
         }
       }
@@ -322,15 +330,15 @@ can examine its probability and score (`ml.prediction_probability` and
 model is that the data point belongs to the named class. If you examine the
 destination index more closely in the *Discover* app in {kib} or use the
 standard {es} search command, you can see that the analysis predicts the
-probability of all possible classes for the dependent variable. The 
+probability of all possible classes for the dependent variable. The
 `top_classes` object contains the predicted classes with the highest scores.
 
 .API example
 [%collapsible]
 ====
 [source,console]
 --------------------------------------------------
-GET df-flight-delayed/_search
+GET model-flight-delay-classification/_search
 --------------------------------------------------
 // TEST[skip:TBD]
 
@@ -342,48 +350,91 @@ The snippet below shows a part of a document with the annotated results:
           "FlightDelay" : false,
           ...
           "ml" : {
+            "FlightDelay_prediction" : false,
             "top_classes" : [ <1>
               {
-                "class_probability" : 0.9198146781161334, 
-               "class_score" : 0.36964390728677926, 
-               "class_name" : false
+                "class_name" : false,
+                "class_probability" : 0.3933807062505216,
+                "class_score" : 0.3933807062505216
               },
               {
-                "class_probability" : 0.08018532188386665,
-                 "class_score" : 0.08018532188386665,
-                 "class_name" : true
+                "class_name" : true,
+                "class_probability" : 0.6066192937494784,
+                "class_score" : 0.22857258275913037
               }
             ],
-            "prediction_score" : 0.36964390728677926,
-            "FlightDelay_prediction" : false,
-            "prediction_probability" : 0.9198146781161334,
+            "prediction_probability" : 0.3933807062505216,
+            "prediction_score" : 0.3933807062505216,
             "feature_importance" : [
               {
-                "feature_name" : "DistanceMiles",
-                "importance" : -3.039025449178423
+                "feature_name" : "FlightTimeMin",
+                  "importance" : -2.823868829093038,
+                  "classes" : [
+                    {
+                      "class_name" : false,
+                      "importance" : -2.823868829093038
+                    },
+                    {
+                      "class_name" : true,
+                      "importance" : 2.823868829093038
+                    }
+                  ]
               },
               {
-                "feature_name" : "FlightTimeMin",
-                "importance" : 2.4980756273399045
-              }
+                "feature_name" : "DistanceMiles",
+                "importance" : 0.9872151818111125,
+                "classes" : [
+                  {
+                    "class_name" : false,
+                    "importance" : 0.9872151818111125
+                  },
+                  {
+                    "class_name" : true,
+                    "importance" : -0.9872151818111125
+                  }
+                ]
+              },
+              ...
             ],
             "is_training" : false
           }
 ----
-<1> An array of values specifying the probability of the prediction and the 
+<1> An array of values specifying the probability of the prediction and the
 score for each class. 
 
 The class with the highest score is the prediction. In this example, `false` has
-a `class_score` of 0.37 while `true` has only 0.08, so the prediction will be
+a `class_score` of 0.39 while `true` has only 0.22, so the prediction will be
 `false`. For more details about these values, see
 <<dfa-classification-interpret>>.
+====
+
+If you chose to calculate feature importance, the destination index also
+contains `ml.feature_importance` objects. Every field that is included in the 
+{classanalysis} (known as a _feature_ of the data point) is assigned a feature
+importance value. However, only the most significant values (in this case, the
+top 10) are stored in the index. These values indicate which features had the
+biggest impact (positive or negative) on each prediction. In {kib}, you can see
+this information displayed in the form of a decision plot.
 
 ////
-It is chosen so that the decision to assign the 
-data point to the class with the highest score maximizes the minimum recall of 
-any class.
+[role="screenshot"]
+image::images/flights-classification-importance.png["A decision plot for feature importance values in {kib}"]
+
+The sum of the feature importance values for a class (in this example, `false`)
+in this data point approximates the logarithm of its odds
+(or {wikipedia}/Logit[log-odds]).
+
+While the probability of a class ranges between 0 and 1, its log-odds range 
+between negative and positive infinity. In {kib}, the decision path for each
+class starts at the average probability for that class over the training data
+set. From there, the feature importance values are added to the decision path.
+The features with the most significant positive or negative impact appear at the
+top. Thus in this example, the features related to flight time and distance had
+the most significant influence on this prediction. This type of information can
+help you to understand how models arrive at their predictions. It can also
+indicate which aspects of your data set are most influential or least useful
+when you are training and tuning your model.
 ////
-====
 
 [[flightdata-classification-evaluate]]
 == Evaluating {classification} results
@@ -408,18 +459,18 @@ own results.
 If you want to see the exact number of occurrences, select a quadrant in the
 matrix. You can optionally filter the table to contain only testing data so you
 can see how well the model performs on previously unseen data. In this example,
-there are 2952 documents in the testing data that have the `true` class. 1893 of
-them are predicted as `false`; this is called a _false negative_. 1059 are
+there are 2952 documents in the testing data that have the `true` class. 2109 of
+them are predicted as `false`; this is called a _false negative_. 843 are
 predicted correctly as `true`; this is called a _true positive_. The confusion
-matrix therefore shows us that 36% of the actual `true` values were correctly
-predicted and 64% were incorrectly predicted in the test data set.
+matrix therefore shows us that 29% of the actual `true` values were correctly
+predicted and 71% were incorrectly predicted in the test data set.
 
 Likewise if you select other quadrants in the matrix, it shows the number of
 documents that have the `false` class as their actual value in the testing data
-set. In this example, the model labeled 1033 documents out of 8802 correctly as
-`false`; this is called a _true negative_. 7769 documents are predicted
-incorrectly as `true`; this is called a _false positive_. Thus 12% of the actual
-`false` values were correctly predicted and 88% were incorrectly predicted in
+set. In this example, the model labeled 1544 documents out of 8802 correctly as
+`false`; this is called a _true negative_. 7258 documents are predicted
+incorrectly as `true`; this is called a _false positive_. Thus 18% of the actual
+`false` values were correctly predicted and 82% were incorrectly predicted in
 the test data set. When you perform {classanalysis} on your own data, it might
 take multiple iterations before you are satisfied with the results and ready to
 deploy the model.
@@ -438,7 +489,7 @@ performed on the training data set.
 --------------------------------------------------
 POST _ml/data_frame/_evaluate
 {
- "index": "df-flight-delayed",
+ "index": "model-flight-delay-classification",
    "query": {
     "term": {
       "ml.is_training": {
@@ -467,7 +518,7 @@ performed on previously unseen data:
 --------------------------------------------------
 POST _ml/data_frame/_evaluate
 {
- "index": "df-flight-delayed",
+ "index": "model-flight-delay-classification",
    "query": {
     "term": {
       "ml.is_training": {
@@ -506,11 +557,11 @@ were misclassified (`actual_class` does not match `predicted_class`):
           "predicted_classes" : [
             {
               "predicted_class" : "false", <3>
-              "count" : 1033 <4>
+              "count" : 1544 <4>
             },
             {
               "predicted_class" : "true",
-              "count" : 7769
+              "count" : 7258
             }
           ],
           "other_predicted_class_doc_count" : 0
@@ -521,11 +572,11 @@ were misclassified (`actual_class` does not match `predicted_class`):
           "predicted_classes" : [
             {
               "predicted_class" : "false",
-              "count" : 1893
+              "count" : 2109
             },
             {
               "predicted_class" : "true",
-              "count" : 1059
+              "count" : 843
             }
           ],
           "other_predicted_class_doc_count" : 0
@@ -548,6 +599,7 @@ When you have trained a satisfactory model, you can deploy it to make prediction
 about new data. Those steps are not covered in this example. See
 <<ml-inference>>.
 
-If you don't want to keep the {dfanalytics-job}, you can delete it by using the 
-{ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When you delete 
-{dfanalytics-jobs}, the destination indices remain intact.
+If you don't want to keep the {dfanalytics-job}, you can delete it in {kib} or
+by using the {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When
+you delete {dfanalytics-jobs} in {kib}, you have the option to also remove the 
+destination indices and index patterns.
diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-details.png b/docs/en/stack/ml/df-analytics/images/flights-classification-details.png
diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.png b/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.png
diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-importance.png b/docs/en/stack/ml/df-analytics/images/flights-classification-importance.png
diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-job-3.png b/docs/en/stack/ml/df-analytics/images/flights-classification-job-3.png
diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-results.png b/docs/en/stack/ml/df-analytics/images/flights-classification-results.png
diff --git a/docs/en/stack/ml/df-analytics/images/regression-decision-plot.png b/docs/en/stack/ml/df-analytics/images/regression-decision-plot.png