Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Add feature importance to classification example #1359

Merged
merged 4 commits into from
Oct 27, 2020

Conversation

lcawl
Copy link
Contributor

@lcawl lcawl commented Sep 10, 2020

Related to elastic/kibana#73561

This PR drafts changes to the classification example such that it includes feature importance explanations.

It will be backported to 7.10 and does not take into consideration changes in 7.11 and later for elastic/kibana#77874

Preview

https://stack-docs_1359.docs-preview.app.elstc.co/guide/en/machine-learning/master/flightdata-classification.html

@lcawl lcawl changed the title [DOCS] Add feature importance examples [DOCS] Add feature importance to classification example Sep 14, 2020
in your destination index. See the
{ml-docs}/flightdata-classification.html#flightdata-classification-results[Viewing {classification} results]
section in the {classification} example.
in your destination index.

[[dfa-classification-class-score]]
=== `class_score`

The value of `class_score` controls the probability at which a class label is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class_score is definitely not a probability, since if I choose k very very small, class_score may be arbitrary large, while probability is always between 0 and 1. It's better to call it a "likelihood". And also it doesn't "control" it but simple "shows". It is controlled by the threshold k, which we estimate automagically based on class_assignment_objective configuration.

values. A higher number means that the model is more confident.
If you want to understand how certain the model is about each prediction, you
can examine its probability and score (`ml.prediction_probability` and
`ml.prediction_score`). These values range between 0 and 1; the higher the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, class_score can be larger than 1 in some degenerated cases. So it's defined as larger or equal to 0.

Comment on lines 426 to 427
//Does this mean the sum of the feature importance values for false in this
example should equal the logit(p), where p is the class_probability for false?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct up to a constant. There is also a datapoint-independent constant -- an average log-odd overall all training points, which we add to the sum of feature importance before taking the inverse-logit to compute the probabilities.

any class.
//Does this mean the sum of the feature importance values for false in this
example should equal the logit(p), where p is the class_probability for false?
//Does this imply that the feature importance value itself is the result of a logit function? Or that we use the function to merely represent the distribution of feature importance values?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens is that the decision forest predicted the log-odds directly and then we compute feature importance on the log-odds values. When we evaluate a data point, we take the log-odds predicted by the decision forest and then apply the inverse of the logit function to get the class probability.

====
While the probability of a class ranges between 0 and 1, its log-odds range
between negative and positive infinity. In {kib}, the decision path for each
class starts near zero, which represents a class probability of 0.5.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunately a bit more complicated: 0 would represent the class probability of a constant baseline. It relates to the average class probability for the selected class (in Kibana UI) over entire training set.

If you select Canceled as the target variable in flight data, this nuance becomes obvious. Since there are many more data points with Canceled = False, let's assume that the average class probability over the entire training set would be something like 0.92. This means that if the class probability of a data point is larger than 0.92 (for example 0.98) than the decision path will go to the right (sum of feature importances is positive). On the other hand, if the class probability is smaller than 0.92 (for example 0.84), than the decision path will go to the left (sum of feature importances is negative).

While the probability of a class ranges between 0 and 1, its log-odds range
between negative and positive infinity. In {kib}, the decision path for each
class starts near zero, which represents a class probability of 0.5.
// Is this true for multi-class classification or just binary classification?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's true for both multi-class and binary since in the UI you are selecting a class of interest from a drop-down menu.

@lcawl lcawl force-pushed the feature-importance branch 2 times, most recently from 5d8027e to 691f8df Compare September 29, 2020 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants