diff --git a/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc b/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc index 4bd5de6746..6d854ab4d2 100644 --- a/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc +++ b/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc @@ -185,4 +185,4 @@ testing. This split of the data set is the _testing data set_. Once the model ha been trained, you can let the model predict the value of the data points it has never seen before and compare the prediction to the actual value by using the evaluate {dfanalytics} API. -//// +//// \ No newline at end of file diff --git a/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc b/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc index afbe4d5292..ebecefc51d 100644 --- a/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc +++ b/docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc @@ -104,6 +104,9 @@ image::images/flights-classification-job-1.png["Creating a {dfanalytics-job} in [role="screenshot"] image::images/flights-classification-job-2.png["Creating a {dfanalytics-job} in {kib} – continued"] +[role="screenshot"] +image::images/flights-classification-job-3.png["Creating a {dfanalytics-job} in {kib} – advanced options"] + .. Choose `kibana_sample_data_flights` as the source index. .. Choose `classification` as the job type. .. Choose `FlightDelay` as the dependent variable, which is the field that we @@ -116,15 +119,18 @@ recommended to exclude fields that either contain erroneous data or describe the source data for training. While that value is low for this example, for many large data sets using a small training sample greatly reduces runtime without impacting accuracy. -.. Use the default feature importance values. +.. If you want to experiment with <>, +specify a value in the advanced configuration options. In this example, we +choose to return a maximum of 10 feature importance values per document. This +option affects the speed of the analysis, so by default it is disabled. .. Use the default memory limit for the job. If the job requires more than this amount of memory, it fails to start. If the available memory on the node is limited, this setting makes it possible to prevent job execution. .. Add a job ID and optionally a job description. .. Add the name of the destination index that will contain the results of the -analysis. It will contain a copy of the source index data where each document is -annotated with the results. If the index does not exist, it will be created -automatically. +analysis. In {kib}, the index name matches the job ID by default. It will +contain a copy of the source index data where each document is annotated with +the results. If the index does not exist, it will be created automatically. .API example @@ -140,13 +146,15 @@ PUT _ml/data_frame/analytics/model-flight-delay-classification ] }, "dest": { - "index": "df-flight-delayed", + "index": "model-flight-delay-classification", "results_field": "ml" <1> }, "analysis": { "classification": { "dependent_variable": "FlightDelay", - "training_percent": 10 + "training_percent": 10, + "num_top_classes": 10, + "num_top_feature_importance_values": 10 <2> } }, "analyzed_fields": { @@ -160,7 +168,8 @@ PUT _ml/data_frame/analytics/model-flight-delay-classification } -------------------------------------------------- // TEST[skip:setup kibana sample data] -<1> The field name in the `dest` index that contains the analysis results. +<1> The field name in the `dest` index that contains the analysis results. +<2> To disable feature importance calculations, omit this option. ==== -- @@ -259,32 +268,31 @@ The API call returns the following response: }, "analysis_stats" : { "classification_stats" : { - "timestamp" : 1597182490577, + "timestamp" : 1599684771114, "iteration" : 18, "hyperparameters" : { "class_assignment_objective" : "maximize_minimum_recall", - "alpha" : 11.630957564710283, - "downsample_factor" : 0.9418550623091531, - "eta" : 0.032382816833064335, - "eta_growth_rate_per_tree" : 1.0198807182688074, - "feature_bag_fraction" : 0.5504020748926737, - "gamma" : 0.08388388780939579, - "lambda" : 0.08628826657684924, + "alpha" : 6.648298686326093, + "downsample_factor" : 0.7435400845721971, + "eta" : 0.039957516522980074, + "eta_growth_rate_per_tree" : 1.0168333294220058, + "feature_bag_fraction" : 0.49761652263010625, + "gamma" : 0.21224183609258152, + "lambda" : 0.2572621613644672, "max_attempts_to_add_tree" : 3, "max_optimization_rounds_per_hyperparameter" : 2, - "max_trees" : 644, + "max_trees" : 590, "num_folds" : 5, "num_splits_per_feature" : 75, - "soft_tree_depth_limit" : 7.550606337307592, - "soft_tree_depth_tolerance" : 0.13448633124842999 + "soft_tree_depth_limit" : 3.2719032647442443, + "soft_tree_depth_tolerance" : 0.14970565884872958 }, "timing_stats" : { - "elapsed_time" : 44206, - "iteration_time" : 1884 + "elapsed_time" : 37915, + "iteration_time" : 2552 }, "validation_loss" : { - "loss_type" : "binomial_logistic", - "fold_values" : [ ] + "loss_type" : "binomial_logistic" } } } @@ -322,7 +330,7 @@ can examine its probability and score (`ml.prediction_probability` and model is that the data point belongs to the named class. If you examine the destination index more closely in the *Discover* app in {kib} or use the standard {es} search command, you can see that the analysis predicts the -probability of all possible classes for the dependent variable. The +probability of all possible classes for the dependent variable. The `top_classes` object contains the predicted classes with the highest scores. .API example @@ -330,7 +338,7 @@ probability of all possible classes for the dependent variable. The ==== [source,console] -------------------------------------------------- -GET df-flight-delayed/_search +GET model-flight-delay-classification/_search -------------------------------------------------- // TEST[skip:TBD] @@ -342,48 +350,91 @@ The snippet below shows a part of a document with the annotated results: "FlightDelay" : false, ... "ml" : { + "FlightDelay_prediction" : false, "top_classes" : [ <1> { - "class_probability" : 0.9198146781161334, - "class_score" : 0.36964390728677926, - "class_name" : false + "class_name" : false, + "class_probability" : 0.3933807062505216, + "class_score" : 0.3933807062505216 }, { - "class_probability" : 0.08018532188386665, - "class_score" : 0.08018532188386665, - "class_name" : true + "class_name" : true, + "class_probability" : 0.6066192937494784, + "class_score" : 0.22857258275913037 } ], - "prediction_score" : 0.36964390728677926, - "FlightDelay_prediction" : false, - "prediction_probability" : 0.9198146781161334, + "prediction_probability" : 0.3933807062505216, + "prediction_score" : 0.3933807062505216, "feature_importance" : [ { - "feature_name" : "DistanceMiles", - "importance" : -3.039025449178423 + "feature_name" : "FlightTimeMin", + "importance" : -2.823868829093038, + "classes" : [ + { + "class_name" : false, + "importance" : -2.823868829093038 + }, + { + "class_name" : true, + "importance" : 2.823868829093038 + } + ] }, { - "feature_name" : "FlightTimeMin", - "importance" : 2.4980756273399045 - } + "feature_name" : "DistanceMiles", + "importance" : 0.9872151818111125, + "classes" : [ + { + "class_name" : false, + "importance" : 0.9872151818111125 + }, + { + "class_name" : true, + "importance" : -0.9872151818111125 + } + ] + }, + ... ], "is_training" : false } ---- -<1> An array of values specifying the probability of the prediction and the +<1> An array of values specifying the probability of the prediction and the score for each class. The class with the highest score is the prediction. In this example, `false` has -a `class_score` of 0.37 while `true` has only 0.08, so the prediction will be +a `class_score` of 0.39 while `true` has only 0.22, so the prediction will be `false`. For more details about these values, see <>. +==== + +If you chose to calculate feature importance, the destination index also +contains `ml.feature_importance` objects. Every field that is included in the +{classanalysis} (known as a _feature_ of the data point) is assigned a feature +importance value. However, only the most significant values (in this case, the +top 10) are stored in the index. These values indicate which features had the +biggest impact (positive or negative) on each prediction. In {kib}, you can see +this information displayed in the form of a decision plot. //// -It is chosen so that the decision to assign the -data point to the class with the highest score maximizes the minimum recall of -any class. +[role="screenshot"] +image::images/flights-classification-importance.png["A decision plot for feature importance values in {kib}"] + +The sum of the feature importance values for a class (in this example, `false`) +in this data point approximates the logarithm of its odds +(or {wikipedia}/Logit[log-odds]). + +While the probability of a class ranges between 0 and 1, its log-odds range +between negative and positive infinity. In {kib}, the decision path for each +class starts at the average probability for that class over the training data +set. From there, the feature importance values are added to the decision path. +The features with the most significant positive or negative impact appear at the +top. Thus in this example, the features related to flight time and distance had +the most significant influence on this prediction. This type of information can +help you to understand how models arrive at their predictions. It can also +indicate which aspects of your data set are most influential or least useful +when you are training and tuning your model. //// -==== [[flightdata-classification-evaluate]] == Evaluating {classification} results @@ -408,18 +459,18 @@ own results. If you want to see the exact number of occurrences, select a quadrant in the matrix. You can optionally filter the table to contain only testing data so you can see how well the model performs on previously unseen data. In this example, -there are 2952 documents in the testing data that have the `true` class. 1893 of -them are predicted as `false`; this is called a _false negative_. 1059 are +there are 2952 documents in the testing data that have the `true` class. 2109 of +them are predicted as `false`; this is called a _false negative_. 843 are predicted correctly as `true`; this is called a _true positive_. The confusion -matrix therefore shows us that 36% of the actual `true` values were correctly -predicted and 64% were incorrectly predicted in the test data set. +matrix therefore shows us that 29% of the actual `true` values were correctly +predicted and 71% were incorrectly predicted in the test data set. Likewise if you select other quadrants in the matrix, it shows the number of documents that have the `false` class as their actual value in the testing data -set. In this example, the model labeled 1033 documents out of 8802 correctly as -`false`; this is called a _true negative_. 7769 documents are predicted -incorrectly as `true`; this is called a _false positive_. Thus 12% of the actual -`false` values were correctly predicted and 88% were incorrectly predicted in +set. In this example, the model labeled 1544 documents out of 8802 correctly as +`false`; this is called a _true negative_. 7258 documents are predicted +incorrectly as `true`; this is called a _false positive_. Thus 18% of the actual +`false` values were correctly predicted and 82% were incorrectly predicted in the test data set. When you perform {classanalysis} on your own data, it might take multiple iterations before you are satisfied with the results and ready to deploy the model. @@ -438,7 +489,7 @@ performed on the training data set. -------------------------------------------------- POST _ml/data_frame/_evaluate { - "index": "df-flight-delayed", + "index": "model-flight-delay-classification", "query": { "term": { "ml.is_training": { @@ -467,7 +518,7 @@ performed on previously unseen data: -------------------------------------------------- POST _ml/data_frame/_evaluate { - "index": "df-flight-delayed", + "index": "model-flight-delay-classification", "query": { "term": { "ml.is_training": { @@ -506,11 +557,11 @@ were misclassified (`actual_class` does not match `predicted_class`): "predicted_classes" : [ { "predicted_class" : "false", <3> - "count" : 1033 <4> + "count" : 1544 <4> }, { "predicted_class" : "true", - "count" : 7769 + "count" : 7258 } ], "other_predicted_class_doc_count" : 0 @@ -521,11 +572,11 @@ were misclassified (`actual_class` does not match `predicted_class`): "predicted_classes" : [ { "predicted_class" : "false", - "count" : 1893 + "count" : 2109 }, { "predicted_class" : "true", - "count" : 1059 + "count" : 843 } ], "other_predicted_class_doc_count" : 0 @@ -548,6 +599,7 @@ When you have trained a satisfactory model, you can deploy it to make prediction about new data. Those steps are not covered in this example. See <>. -If you don't want to keep the {dfanalytics-job}, you can delete it by using the -{ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When you delete -{dfanalytics-jobs}, the destination indices remain intact. +If you don't want to keep the {dfanalytics-job}, you can delete it in {kib} or +by using the {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When +you delete {dfanalytics-jobs} in {kib}, you have the option to also remove the +destination indices and index patterns. diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-details.png b/docs/en/stack/ml/df-analytics/images/flights-classification-details.png index 08cc74bc15..f19e04363e 100644 Binary files a/docs/en/stack/ml/df-analytics/images/flights-classification-details.png and b/docs/en/stack/ml/df-analytics/images/flights-classification-details.png differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.png b/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.png index d6c8a2a4ef..5fff468e37 100644 Binary files a/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.png and b/docs/en/stack/ml/df-analytics/images/flights-classification-evaluation.png differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-importance.png b/docs/en/stack/ml/df-analytics/images/flights-classification-importance.png new file mode 100644 index 0000000000..7187dd4176 Binary files /dev/null and b/docs/en/stack/ml/df-analytics/images/flights-classification-importance.png differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-job-3.png b/docs/en/stack/ml/df-analytics/images/flights-classification-job-3.png new file mode 100644 index 0000000000..6126e9c1f8 Binary files /dev/null and b/docs/en/stack/ml/df-analytics/images/flights-classification-job-3.png differ diff --git a/docs/en/stack/ml/df-analytics/images/flights-classification-results.png b/docs/en/stack/ml/df-analytics/images/flights-classification-results.png index 31fe7a8072..d863f4c8ec 100644 Binary files a/docs/en/stack/ml/df-analytics/images/flights-classification-results.png and b/docs/en/stack/ml/df-analytics/images/flights-classification-results.png differ diff --git a/docs/en/stack/ml/df-analytics/images/regression-decision-plot.png b/docs/en/stack/ml/df-analytics/images/regression-decision-plot.png new file mode 100644 index 0000000000..d588a0071b Binary files /dev/null and b/docs/en/stack/ml/df-analytics/images/regression-decision-plot.png differ diff --git a/docs/en/stack/ml/df-analytics/ml-feature-importance.asciidoc b/docs/en/stack/ml/df-analytics/ml-feature-importance.asciidoc index 184b7a8a72..98a98a1e4e 100644 --- a/docs/en/stack/ml/df-analytics/ml-feature-importance.asciidoc +++ b/docs/en/stack/ml/df-analytics/ml-feature-importance.asciidoc @@ -2,33 +2,59 @@ [[ml-feature-importance]] = {feat-imp-cap} +experimental[] + {feat-imp-cap} values indicate which fields had the biggest impact on each -prediction that is generated by <> or -<> analysis. The features of the data points are -responsible for a particular prediction to varying degrees. {feat-imp-cap} shows -to what degree a given feature of a data point contributes to the prediction. -The {feat-imp} value can be either positive or negative depending on its effect -on the prediction. If the feature reduces the prediction value, the {feat-imp} -is negative, if it increases the prediction, then the {feat-imp} is positive. -The magnitude of {feat-imp} shows how significantly the feature affects the -prediction for a given data point. +prediction that is generated by {classification} or {regression} analysis. Each +field (or _feature_ of the data point) is responsible for the prediction to +varying degrees. In {kib}, you can examine the most important features in JSON +objects or decision plots: -{feat-imp-cap} in the {stack} is calculated using the SHAP (SHapley Additive -exPlanations) method as described in -https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf[Lundberg, S. M., & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In NeurIPS 2017]. +[role="screenshot"] +image::images/regression-decision-plot.png["Feature importance values for a {regression} {dfanalytics-job} in {kib}"] + +A {feat-imp} value can be either positive or negative depending on its effect +on the prediction. The magnitude of the {feat-imp} value shows how significantly +the feature affects the prediction for a given data point. + +For {reganalysis}, each decision plot starts at a shared baseline, which is +the average of the prediction values for all the data points in the training +data set. When you add all of the feature importance values for a particular +data point to that baseline, you arrive at the numeric prediction value. If a +{feat-imp} value is negative, it reduces the prediction value. If a {feat-imp} +value is positive, it increases the prediction value. -By default, {feat-imp} values are not calculated when you configure the job via -the API. To generate this information, when you create a {dfanalytics-job} you -must specify the `num_top_feature_importance_values` property. When you -configure the job in {kib}, {feat-imp} values are calculated automatically. The -{feat-imp} values are stored in the {ml} results field for each document in the -destination index. +//// +For {classanalysis}, the baseline is the average of the probability values for a +specific class across all the data points in the training data set. When you add +the feature importance values for a particular data point to that baseline, you +arrive at the prediction probability for that class. If a {feat-imp} value is +negative, it reduces the prediction probability. If a {feat-imp} value is +positive, it increases the prediction probability. +//// -NOTE: The number of {feat-imp} values for each document might be less than the -`num_top_feature_importance_values` property value. For example, it returns only -features that had a positive or negative effect on the prediction. +By default, {feat-imp} values are not calculated. To generate this information, +when you create a {dfanalytics-job} you must specify the `num_top_feature_importance_values` property. For examples, see +<> and <>. + +The {feat-imp} values are stored in the {ml} results field for each document in +the destination index. The number of {feat-imp} values for each document might +be less than the `num_top_feature_importance_values` property value. For example, +it returns only features that had a positive or negative effect on the +prediction. + +The purpose of {feat-imp} is to help you determine whether the predictions are +sensible. Is the relationship between the dependent variable and the important +features supported by your domain knowledge? The lessons you learn about the +importance of specific features might also affect your decision to include them +in future iterations of your trained model. [[ml-feature-importance-readings]] == Further reading +{feat-imp-cap} in the {stack} is calculated using the SHAP (SHapley Additive +exPlanations) method as described in +https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf[Lundberg, S. M., & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In NeurIPS 2017]. + +See also https://www.elastic.co/blog/feature-importance-for-data-frame-analytics-with-elastic-machine-learning[{feat-imp-cap} for {dfanalytics} with Elastic {ml}]