Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHAP and orderedfeature minimizers #57

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

Conversation

Basharza
Copy link

An implementation of SHAP minimizer and the more general class orderedfeature minimizer, which greedily prunes feature DTs based on feature importance values in order to achieve generalizations.

@@ -13,4 +13,7 @@
It is also possible to export the generalizations as feature ranges.

"""
from apt.minimization.minimizer import GeneralizeToRepresentative
from .minimizer import GeneralizeToRepresentative
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use absolute imports instead of relative

def importance_ordered_features(self):
return self._importance_ordered_features

def _get_ordered_features(self, estimator, encoder, X_train, y_train, numerical_features, categorical_features,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the order is only computed when calling an internal method (begins with _). Please change this to be an external method (with documentation), and if it receives x and y data better to call it fit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is a hook that is called in the fit method in the superclass (OrderedFeatureMinimizer). It is a protected method that should be overriden to define a custom order to prune according during fit.

The implementation as it is allows access to the order through a property only after fit() is called, the property returns None otherwise. This property has a different name for different classes (here it is importace_ordered_features). I could add this to OrderedFeatureMinimizer as a generic property called features_order if you believe that is more suitable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand. I suggest to have a consistent naming for the property. Also, at the moment the fit code does not set self._ordered_features, which I think it should (instead of setting a local variable).

@@ -66,7 +66,7 @@ def __init__(self, estimator: Union[BaseEstimator, Model] = None, target_accurac
encoder: Optional[Union[OrdinalEncoder, OneHotEncoder]] = None,
features_to_minimize: Optional[Union[np.ndarray, list]] = None,
train_only_features_to_minimize: Optional[bool] = True,
is_regression: Optional[bool] = False):
is_regression: Optional[bool] = False, accuracy_score: Callable = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this new parameter to the class docstring, stating clearly what are the expectations from this callable (inputs and outputs).



# TODO: use mixins correctly
class OrderedFeatureMinimizer: # (BaseEstimator, MetaEstimatorMixin, TransformerMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a doctring for the class and all external (public) methods.

X[:, indices] = cls._transform_categorical_feature(dts[feature].tree_, X[:, indices],
generalizations_arrays[feature], depths[feature], 0)

@classmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed there is no need for class methods. These can be changed to static or removed from the class to some utils file.

return 1 + max(cls._calculate_tree_depth(dt, dt.tree_.children_left[node_id]),
cls._calculate_tree_depth(dt, dt.tree_.children_right[node_id]))

def fit(self, X: pd.DataFrame, y=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add doctsring, and state the assumptions about the data (order of features etc.). Also, is it necessary that the input be a dataframe? Could we use the generic data wrappers (e.g., ArrayDataset) instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add type of y

from sklearn.model_selection import train_test_split


class ShapMinimizer(OrderedFeatureMinimizer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also missing docstrings.

import pandas as pd
import numpy as np
from typing import List, Dict, Union
from . import OrderedFeatureMinimizer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use absolute import

background_size, feature_indices: Dict[str, List[int]], random_state, nsamples):
"""
Calculates global shap per feature.
:param nsamples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing param descriptions

@@ -0,0 +1,105 @@
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file (which includes the Benchmark class). Instead add some unit tests to the tests directory with appropriate asserts (see examples in test_minimizer.py)

from sklearn.tree._tree import Tree


# TODO: use mixins correctly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove TODOs and unused params (unless you still plan to add this).

Basharza and others added 19 commits January 2, 2023 13:40
…ed assumptions.

* Added general important documentation (not everything is documented).

* Explainer is commented out. Instead there is a placeholder so we could continue implementation.
* Added implementations
* Added splits in fit method
…ethodology, and a little bug in catergorical feature transformation.
Implemented generalizations partially. (some helpers). - not tested
Models now depend on random state to create reproducible results.
* Cleaned pruning code and moved to a class method.
* Pruning now only calls transform once per iteration. It saves data from previous iteration to reverse a generalization.
* transform() now depends on generalizations.
* Run experiment on np.arange(0, .8, 0.1). The original minimizer seems not to work past 0.8 target accuracy (or takes tooo long).
* Fixed shapminimizer to only take true label values into account
* Added MLP to comparison.ipynb
* level can reach 0 in pruning
* discard category is out of the stopping condition
* discard category is given by split not by majority
* Implemented DT Importance minimizer
* Minimizers now support different accuracy metrics (supplied at init)
* SHAP minimizer now makes use of kmeans instead of random samples.
* Implemented DT Importance minimizer
* Minimizers now support different accuracy metrics (supplied at init)
* SHAP minimizer now makes use of kmeans instead of random samples.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants