Skip to content

Commit

Permalink
[DOCS] Add feature importance to classification example
Browse files Browse the repository at this point in the history
  • Loading branch information
lcawl committed Sep 30, 2020
1 parent 715c3ee commit cac81f1
Show file tree
Hide file tree
Showing 4 changed files with 35 additions and 6 deletions.
2 changes: 1 addition & 1 deletion docs/en/stack/ml/df-analytics/dfa-classification.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -196,4 +196,4 @@ testing. This split of the data set is the _testing data set_. Once the model ha
been trained, you can let the model predict the value of the data points it has
never seen before and compare the prediction to the actual value by using the
evaluate {dfanalytics} API.
////
////
28 changes: 25 additions & 3 deletions docs/en/stack/ml/df-analytics/flightdata-classification.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ large data sets using a small training sample greatly reduces runtime without
impacting accuracy.
.. If you want to experiment with <<ml-feature-importance,{feat-imp}>>, specify
a value in the advanced configuration options. In this example, we choose to
return a maximum of 10 feature importance values per document. This option
return a maximum of 10 {feat-imp} values per document. This option
affects the speed of the analysis, so by default it is disabled.
.. Use the default memory limit for the job. If the job requires more than this
amount of memory, it fails to start. If the available memory on the node is
Expand Down Expand Up @@ -170,7 +170,7 @@ PUT _ml/data_frame/analytics/model-flight-delay-classification
--------------------------------------------------
// TEST[skip:setup kibana sample data]
<1> The field name in the `dest` index that contains the analysis results.
<2> To disable feature importance calculations, omit this option.
<2> To disable {feat-imp} calculations, omit this option.
====
--

Expand Down Expand Up @@ -331,7 +331,7 @@ can examine its probability and score (`ml.prediction_probability` and
model is that the data point belongs to the named class. If you examine the
destination index more closely in the *Discover* app in {kib} or use the
standard {es} search command, you can see that the analysis predicts the
probability of all possible classes for the dependent variable. The
probability of all possible classes for the dependent variable. The
`top_classes` object contains the predicted classes with the highest scores.

.API example
Expand Down Expand Up @@ -417,6 +417,28 @@ summarized information in {kib}:
[role="screenshot"]
image::images/flights-classification-total-importance.png["Total {feat-imp} values in {kib}"]

You can also see the {feat-imp} values for each individual prediction in the
form of a decision plot:

[role="screenshot"]
image::images/flights-classification-importance.png["A decision plot for {feat-imp} values in {kib}"]
////
The sum of the {feat-imp} values for a class (in this example, `false`)
in this data point approximates the logarithm of its odds
(or {wikipedia}/Logit[log-odds]).
While the probability of a class ranges between 0 and 1, its log-odds range
between negative and positive infinity. In {kib}, the decision path for each
class starts at the average probability for that class over the training data
set. From there, the {feat-imp} values are added to the decision path.
The features with the most significant positive or negative impact appear at the
top. Thus in this example, the features related to flight time and distance had
the most significant influence on this prediction. This type of information can
help you to understand how models arrive at their predictions. It can also
indicate which aspects of your data set are most influential or least useful
when you are training and tuning your model.
////

This type of information can help you to understand how models arrive at their
predictions. It can also indicate which aspects of your data set are most
influential or least useful when you are training and tuning your model.
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 9 additions & 2 deletions docs/en/stack/ml/df-analytics/ml-feature-importance.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,14 @@ data point to that baseline, you arrive at the numeric prediction value. If a
{feat-imp} value is negative, it reduces the prediction value. If a {feat-imp}
value is positive, it increases the prediction value.

//TBD: Add section about classification analysis.
////
For {classanalysis}, the baseline is the average of the probability values for a
specific class across all the data points in the training data set. When you add
the feature importance values for a particular data point to that baseline, you
arrive at the prediction probability for that class. If a {feat-imp} value is
negative, it reduces the prediction probability. If a {feat-imp} value is
positive, it increases the prediction probability.
////

By default, {feat-imp} values are not calculated. To generate this information,
when you create a {dfanalytics-job} you must specify the
Expand All @@ -65,4 +72,4 @@ exPlanations) method as described in
https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf[Lundberg, S. M., & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In NeurIPS 2017].

See also
https://www.elastic.co/blog/feature-importance-for-data-frame-analytics-with-elastic-machine-learning[{feat-imp-cap} for {dfanalytics} with Elastic {ml}].
https://www.elastic.co/blog/feature-importance-for-data-frame-analytics-with-elastic-machine-learning[{feat-imp-cap} for {dfanalytics} with Elastic {ml}].

0 comments on commit cac81f1

Please sign in to comment.