-
Notifications
You must be signed in to change notification settings - Fork 144
Contributing new estimator
REP estimators (classifiers, regressors) are (almost) scikit-learn
compatible, which means that they can be used (under some restrictions listed below) as usual scikit-learn
estimators.
API of sklearn and Rolling your own estimator from sklearn should be read before proceeding.
-
estimator.__init__(features=None, <other parameters>)
initialization is in general in the same way as in sklearn, but there is one mandatory argumentfeatures
(which takes a list of strings - that are name of features to be used in training, default value is None, in which case all of the training features are used). -
all the arguments in
__init__
should have default values (no mutable types here: no lists or user types - otherwise the same parameter value will be shared between all instances of classifier) -
estimator with default parameters should work and should provide moderate quality
-
parameter's validation should be done inside
fit
method, constructor just sets everything to attributes -
estimator.fit(X, y, <optional arguments>)
orestimator.fit(X, y, sample_weight=None, <optional arguments>)
, -
if classifier supports weights for samples, they are passed as
sample_weight
argument. -
Function should return estimator itself.
-
NB:
X
can bepandas.DataFrame
, theself.features
are names of features used from dataframe. Ifnumpy.array
is passed, it should be considered as pandas.Dataframe with default column names:Feature_0
,Feature_1
, etc. This provides compatibility with sklearn. -
If
self.features
was None, the names of all columns in DataFrame are saved toself.features
-
NB: For classification, 'y' is array-like with integers from [0, 1, .. n_classes-1], while sklearn supports any labels. This restriction simplifies usage of
predict_proba
and preparing reports. -
Fitting again erases results of previous training (unless opposite was explicitly pointed by user).
-
pickle
: classifier is assumed to be pickleable at any moment (before or after training), this is the default mechanism to save/load classifier or transfer it from one process/host to another. -
estimator.predict(X)
,estimator.staged_predict(X)
,estimator.predict_proba(X)
andestimator.staged_predict_proba(X)
work in the same way as in sklearn (and should return numpy.arrays), butX
may have different number or order of columns (though all variables used in Features should present).
Arguments passed to classifier (X, y, sample_weight) in any of methods should never be changed by classifier. Create copies if needed. -
estimator.get_params
/estimator.set_params
follows sklearn's interface completely, make sure that after cloning parameters there are no common objects inside original and clone (so no mutable objects were directly transferred).
-
set_params
should be able to get any parameter named in constructor. - if there is some complex parameter, and it is reasonable to change it parts independently, this should be possible. For instance, if there is
layers
argument for neural networks, we should be able to use
network.set_params(layers=[5, 7, 2]) network.set_params(layers__0=3)
Since this may be useful to modify layers independently.
-
self.classes_
for classifiers has the same meaning as for sklearn classifiers (so it will be always equal tonumpy.arange(n_classes)
) -
self.feature_importances_
(if it is implemented) is expected to returnnumpy.array
with importances of used features (so it's length and order of components correspond toself.features
) -
self.get_feature_importances()
(if it is implemented) should return pandas.DataFrame, with index=self.features, usuallyDataFrame
uses only one column namedeffect
, but if classifier supports several ways of computing importances,DataFrame
may contain several columns.
Let's implement the simplest classifier, which predicts the same probabilities for all events (equal to proportions we observe in training dataset)
from rep.estimators.interface import Classifier
from rep.estimators.utils import check_inputs
import numpy
# we derive from `rep.estimators.Classifier`
# classifier is derived from sklearn.BaseEstimator and sklearn.ClassifierMixin,
# so we meet expectations of sklearn.
class BasicClassifier(Classifier):
"""
This dummy classifier returns the same probabilities to all events
Parameters:
-----------
:param features: features used in training
:type features: list[str] or None
:param regularization: regularization, added to number of observed events in each class.
:type regularization: float
"""
def __init__(self, regularization=5., features=None):
# init simply saves everything to fields (fields have same names, that's important!)
self.regularization = regularization
Classifier.__init__(self, features=features)
def fit(self, X, y, sample_weight=None):
# performing parameter validation
assert isinstance(self.regularization, float), 'Regularization in BaseicClassifier should be float!'
# check inputs and sanize them
X, y, sample_weight = check_inputs(X, y, sample_weight=sample_weight, allow_none_weights=True)
# taking only those features named in self.features
# this function sets self.features if it was None
X = self._get_train_features(X)
# set self.classes_ and control that classes are enumerated as [0, 1, ...]
self._set_classes(y)
self._probabilities = numpy.bincount(y, weights=sample_weight) + self.regularization
self._probabilities /= numpy.sum(self._probabilities)
# features are not used, so:
self.feature_importances_ = numpy.zeros(len(self.features))
# don't forget to return self
return self
def predict_proba(self, X):
# If it was real classifier, this would be the first step
X = self._get_train_features(X)
# we return the same probabilities for all events
result = numpy.zeros([len(X), len(self._probabilities)])
result += self._probabilities[numpy.newaxis, :]
return result
# get_state, set_state are not implemented, because they present in Classifier
# predict is implemented is Classifier base class, and it uses predict_proba, so no need to overload it
# get_feature_importances implemented in Classifier and uses feature_importances_, so it works.
# Since all fields are simple, this classifier is picklable.
We are implementing multilayer networks
- Each network should have
layers
parameter, which correspond only to hidden layers, number of units in input and visible layers should be detected automatically - User should be able to set activation functions of all layers (in case of classification the output activation should be
softmax
, in case of regression -identity
) - NB. estimator should have
scaler
parameter which can be any sklearn.transformer. By default, this should beStandardScaler
, but user should be able to point explicitly that no scaler needed. Proposed three scenarious:network(), network(scaler=MinMaxScaler()), network(scaler=False)
, in the last case no scaling is used. [TODO подумать]
Parameters ofscaler
should be accessible, like this:network.set_params(scaler__with_mean=False)
, this is implemented in defaultset_params
. - All the standard tests should be passed
- Stacking with BaggingClassifier, AdaBoostClassifier (if supports weights), Pipeline.
- Do as expected or fail. If there was some argument/input you were not able to proceed (i.e. not supported) or there is no such argument, throw an exception during fit. It's better when user knows that this doesn't work rather then it works, but not as user thinks.
- Cloning should work, note that new object should not contain any common references with original objects (lists, dicts, and so on), use deepcopy-ing if needed on get_params.
- Tests should contain examples for different number of hidden layers (0, 1, 2), different activation functions, different trainers (if supported).
- All the parameters should be explained, for complex situations (like parameters of trainers which are very different for different trainers) a link to a page in library's documentation should be given.