Skip to content

Commit

Permalink
[DOCS] Add feature importance to classification example
Browse files Browse the repository at this point in the history
  • Loading branch information
lcawl committed Sep 23, 2020
1 parent ca66ab0 commit 5d8027e
Show file tree
Hide file tree
Showing 9 changed files with 161 additions and 83 deletions.
2 changes: 1 addition & 1 deletion docs/en/stack/ml/df-analytics/dfa-classification.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -185,4 +185,4 @@ testing. This split of the data set is the _testing data set_. Once the model ha
been trained, you can let the model predict the value of the data points it has
never seen before and compare the prediction to the actual value by using the
evaluate {dfanalytics} API.
////
////
174 changes: 113 additions & 61 deletions docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,9 @@ image::images/flights-classification-job-1.png["Creating a {dfanalytics-job} in
[role="screenshot"]
image::images/flights-classification-job-2.png["Creating a {dfanalytics-job} in {kib} – continued"]

[role="screenshot"]
image::images/flights-classification-job-3.png["Creating a {dfanalytics-job} in {kib} – advanced options"]

.. Choose `kibana_sample_data_flights` as the source index.
.. Choose `classification` as the job type.
.. Choose `FlightDelay` as the dependent variable, which is the field that we
Expand All @@ -116,15 +119,18 @@ recommended to exclude fields that either contain erroneous data or describe the
source data for training. While that value is low for this example, for many
large data sets using a small training sample greatly reduces runtime without
impacting accuracy.
.. Use the default feature importance values.
.. If you want to experiment with <<ml-feature-importance,feature importance>>,
specify a value in the advanced configuration options. In this example, we
choose to return a maximum of 10 feature importance values per document. This
option affects the speed of the analysis, so by default it is disabled.
.. Use the default memory limit for the job. If the job requires more than this
amount of memory, it fails to start. If the available memory on the node is
limited, this setting makes it possible to prevent job execution.
.. Add a job ID and optionally a job description.
.. Add the name of the destination index that will contain the results of the
analysis. It will contain a copy of the source index data where each document is
annotated with the results. If the index does not exist, it will be created
automatically.
analysis. In {kib}, the index name matches the job ID by default. It will
contain a copy of the source index data where each document is annotated with
the results. If the index does not exist, it will be created automatically.


.API example
Expand All @@ -140,13 +146,15 @@ PUT _ml/data_frame/analytics/model-flight-delay-classification
]
},
"dest": {
"index": "df-flight-delayed",
"index": "model-flight-delay-classification",
"results_field": "ml" <1>
},
"analysis": {
"classification": {
"dependent_variable": "FlightDelay",
"training_percent": 10
"training_percent": 10,
"num_top_classes": 10,
"num_top_feature_importance_values": 10 <2>
}
},
"analyzed_fields": {
Expand All @@ -160,7 +168,8 @@ PUT _ml/data_frame/analytics/model-flight-delay-classification
}
--------------------------------------------------
// TEST[skip:setup kibana sample data]
<1> The field name in the `dest` index that contains the analysis results.
<1> The field name in the `dest` index that contains the analysis results.
<2> To disable feature importance calculations, omit this option.
====
--

Expand Down Expand Up @@ -259,32 +268,31 @@ The API call returns the following response:
},
"analysis_stats" : {
"classification_stats" : {
"timestamp" : 1597182490577,
"timestamp" : 1599684771114,
"iteration" : 18,
"hyperparameters" : {
"class_assignment_objective" : "maximize_minimum_recall",
"alpha" : 11.630957564710283,
"downsample_factor" : 0.9418550623091531,
"eta" : 0.032382816833064335,
"eta_growth_rate_per_tree" : 1.0198807182688074,
"feature_bag_fraction" : 0.5504020748926737,
"gamma" : 0.08388388780939579,
"lambda" : 0.08628826657684924,
"alpha" : 6.648298686326093,
"downsample_factor" : 0.7435400845721971,
"eta" : 0.039957516522980074,
"eta_growth_rate_per_tree" : 1.0168333294220058,
"feature_bag_fraction" : 0.49761652263010625,
"gamma" : 0.21224183609258152,
"lambda" : 0.2572621613644672,
"max_attempts_to_add_tree" : 3,
"max_optimization_rounds_per_hyperparameter" : 2,
"max_trees" : 644,
"max_trees" : 590,
"num_folds" : 5,
"num_splits_per_feature" : 75,
"soft_tree_depth_limit" : 7.550606337307592,
"soft_tree_depth_tolerance" : 0.13448633124842999
"soft_tree_depth_limit" : 3.2719032647442443,
"soft_tree_depth_tolerance" : 0.14970565884872958
},
"timing_stats" : {
"elapsed_time" : 44206,
"iteration_time" : 1884
"elapsed_time" : 37915,
"iteration_time" : 2552
},
"validation_loss" : {
"loss_type" : "binomial_logistic",
"fold_values" : [ ]
"loss_type" : "binomial_logistic"
}
}
}
Expand Down Expand Up @@ -322,15 +330,15 @@ can examine its probability and score (`ml.prediction_probability` and
model is that the data point belongs to the named class. If you examine the
destination index more closely in the *Discover* app in {kib} or use the
standard {es} search command, you can see that the analysis predicts the
probability of all possible classes for the dependent variable. The
probability of all possible classes for the dependent variable. The
`top_classes` object contains the predicted classes with the highest scores.

.API example
[%collapsible]
====
[source,console]
--------------------------------------------------
GET df-flight-delayed/_search
GET model-flight-delay-classification/_search
--------------------------------------------------
// TEST[skip:TBD]
Expand All @@ -342,48 +350,91 @@ The snippet below shows a part of a document with the annotated results:
"FlightDelay" : false,
...
"ml" : {
"FlightDelay_prediction" : false,
"top_classes" : [ <1>
{
"class_probability" : 0.9198146781161334,
"class_score" : 0.36964390728677926,
"class_name" : false
"class_name" : false,
"class_probability" : 0.3933807062505216,
"class_score" : 0.3933807062505216
},
{
"class_probability" : 0.08018532188386665,
"class_score" : 0.08018532188386665,
"class_name" : true
"class_name" : true,
"class_probability" : 0.6066192937494784,
"class_score" : 0.22857258275913037
}
],
"prediction_score" : 0.36964390728677926,
"FlightDelay_prediction" : false,
"prediction_probability" : 0.9198146781161334,
"prediction_probability" : 0.3933807062505216,
"prediction_score" : 0.3933807062505216,
"feature_importance" : [
{
"feature_name" : "DistanceMiles",
"importance" : -3.039025449178423
"feature_name" : "FlightTimeMin",
"importance" : -2.823868829093038,
"classes" : [
{
"class_name" : false,
"importance" : -2.823868829093038
},
{
"class_name" : true,
"importance" : 2.823868829093038
}
]
},
{
"feature_name" : "FlightTimeMin",
"importance" : 2.4980756273399045
}
"feature_name" : "DistanceMiles",
"importance" : 0.9872151818111125,
"classes" : [
{
"class_name" : false,
"importance" : 0.9872151818111125
},
{
"class_name" : true,
"importance" : -0.9872151818111125
}
]
},
...
],
"is_training" : false
}
----
<1> An array of values specifying the probability of the prediction and the
<1> An array of values specifying the probability of the prediction and the
score for each class.
The class with the highest score is the prediction. In this example, `false` has
a `class_score` of 0.37 while `true` has only 0.08, so the prediction will be
a `class_score` of 0.39 while `true` has only 0.22, so the prediction will be
`false`. For more details about these values, see
<<dfa-classification-interpret>>.
====

If you chose to calculate feature importance, the destination index also
contains `ml.feature_importance` objects. Every field that is included in the
{classanalysis} (known as a _feature_ of the data point) is assigned a feature
importance value. However, only the most significant values (in this case, the
top 10) are stored in the index. These values indicate which features had the
biggest impact (positive or negative) on each prediction. In {kib}, you can see
this information displayed in the form of a decision plot.

////
It is chosen so that the decision to assign the
data point to the class with the highest score maximizes the minimum recall of
any class.
[role="screenshot"]
image::images/flights-classification-importance.png["A decision plot for feature importance values in {kib}"]
The sum of the feature importance values for a class (in this example, `false`)
in this data point approximates the logarithm of its odds
(or {wikipedia}/Logit[log-odds]).
While the probability of a class ranges between 0 and 1, its log-odds range
between negative and positive infinity. In {kib}, the decision path for each
class starts at the average probability for that class over the training data
set. From there, the feature importance values are added to the decision path.
The features with the most significant positive or negative impact appear at the
top. Thus in this example, the features related to flight time and distance had
the most significant influence on this prediction. This type of information can
help you to understand how models arrive at their predictions. It can also
indicate which aspects of your data set are most influential or least useful
when you are training and tuning your model.
////
====

[[flightdata-classification-evaluate]]
== Evaluating {classification} results
Expand All @@ -408,18 +459,18 @@ own results.
If you want to see the exact number of occurrences, select a quadrant in the
matrix. You can optionally filter the table to contain only testing data so you
can see how well the model performs on previously unseen data. In this example,
there are 2952 documents in the testing data that have the `true` class. 1893 of
them are predicted as `false`; this is called a _false negative_. 1059 are
there are 2952 documents in the testing data that have the `true` class. 2109 of
them are predicted as `false`; this is called a _false negative_. 843 are
predicted correctly as `true`; this is called a _true positive_. The confusion
matrix therefore shows us that 36% of the actual `true` values were correctly
predicted and 64% were incorrectly predicted in the test data set.
matrix therefore shows us that 29% of the actual `true` values were correctly
predicted and 71% were incorrectly predicted in the test data set.

Likewise if you select other quadrants in the matrix, it shows the number of
documents that have the `false` class as their actual value in the testing data
set. In this example, the model labeled 1033 documents out of 8802 correctly as
`false`; this is called a _true negative_. 7769 documents are predicted
incorrectly as `true`; this is called a _false positive_. Thus 12% of the actual
`false` values were correctly predicted and 88% were incorrectly predicted in
set. In this example, the model labeled 1544 documents out of 8802 correctly as
`false`; this is called a _true negative_. 7258 documents are predicted
incorrectly as `true`; this is called a _false positive_. Thus 18% of the actual
`false` values were correctly predicted and 82% were incorrectly predicted in
the test data set. When you perform {classanalysis} on your own data, it might
take multiple iterations before you are satisfied with the results and ready to
deploy the model.
Expand All @@ -438,7 +489,7 @@ performed on the training data set.
--------------------------------------------------
POST _ml/data_frame/_evaluate
{
"index": "df-flight-delayed",
"index": "model-flight-delay-classification",
"query": {
"term": {
"ml.is_training": {
Expand Down Expand Up @@ -467,7 +518,7 @@ performed on previously unseen data:
--------------------------------------------------
POST _ml/data_frame/_evaluate
{
"index": "df-flight-delayed",
"index": "model-flight-delay-classification",
"query": {
"term": {
"ml.is_training": {
Expand Down Expand Up @@ -506,11 +557,11 @@ were misclassified (`actual_class` does not match `predicted_class`):
"predicted_classes" : [
{
"predicted_class" : "false", <3>
"count" : 1033 <4>
"count" : 1544 <4>
},
{
"predicted_class" : "true",
"count" : 7769
"count" : 7258
}
],
"other_predicted_class_doc_count" : 0
Expand All @@ -521,11 +572,11 @@ were misclassified (`actual_class` does not match `predicted_class`):
"predicted_classes" : [
{
"predicted_class" : "false",
"count" : 1893
"count" : 2109
},
{
"predicted_class" : "true",
"count" : 1059
"count" : 843
}
],
"other_predicted_class_doc_count" : 0
Expand All @@ -548,6 +599,7 @@ When you have trained a satisfactory model, you can deploy it to make prediction
about new data. Those steps are not covered in this example. See
<<ml-inference>>.

If you don't want to keep the {dfanalytics-job}, you can delete it by using the
{ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When you delete
{dfanalytics-jobs}, the destination indices remain intact.
If you don't want to keep the {dfanalytics-job}, you can delete it in {kib} or
by using the {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When
you delete {dfanalytics-jobs} in {kib}, you have the option to also remove the
destination indices and index patterns.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 5d8027e

Please sign in to comment.