Fix prediction fails with MOO ensemble and dummy is best #1518

eddiebergman · 2022-06-14T15:21:49Z

Closes predict can fail when dummy model is best with new Moo updates #1495

codecov · 2022-06-14T22:10:40Z

Codecov Report

Merging #1518 (d731952) into development (9d63cb5) will increase coverage by 0.25%.
The diff coverage is 92.85%.

@@               Coverage Diff               @@
##           development    #1518      +/-   ##
===============================================
+ Coverage        83.94%   84.19%   +0.25%     
===============================================
  Files              153      153              
  Lines            11654    11663       +9     
  Branches          2031     2033       +2     
===============================================
+ Hits              9783     9820      +37     
+ Misses            1326     1295      -31     
- Partials           545      548       +3

eddiebergman · 2022-06-17T12:24:31Z

Reproducing script, make the data random and stop after one model:

import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
import numpy as np

rand = np.random.RandomState(2)
X = rand.random((100, 50))
y = rand.randint(0, 2, (100,))
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)


def callback(
    smbo,
    run_info,
    result,
    time_left,
):
    if int(result.additional_info["num_run"]) > 0:
        return False


automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    get_trials_callback=callback,
    include={"classifier": ["bernoulli_nb"]},
    delete_tmp_folder_after_terminate=False
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

for ens in automl.get_pareto_set():
    ens.predict(X_test)

Error:

Traceback (most recent call last):
  File "test.py", line 41, in <module>
    ens.predict(X_test)
  File "/home/skantify/code/asklearn/dev/.venv/lib/python3.8/site-packages/sklearn/ensemble/_voting.py", line 309, in predict
    maj = np.argmax(self.predict_proba(X), axis=1)
  File "/home/skantify/code/asklearn/dev/.venv/lib/python3.8/site-packages/sklearn/ensemble/_voting.py", line 330, in _predict_proba
    weights=self._weights_not_none)
  File "/home/skantify/code/asklearn/dev/.venv/lib/python3.8/site-packages/sklearn/ensemble/_voting.py", line 56, in _weights_not_none
    return [w for est, w in zip(self.estimators, self.weights)
  File "/home/skantify/code/asklearn/dev/.venv/lib/python3.8/site-packages/sklearn/ensemble/_voting.py", line 57, in <listcomp>
    if est[1] != 'drop']
TypeError: 'MyDummyClassifier' object is not subscriptable

eddiebergman · 2022-06-17T13:58:55Z

The culprit is that all estimators are normally wrapped in a Pipeline while if it's a dummy, it's not.

auto-sklearn/autosklearn/evaluation/abstract_evaluator.py

Lines 238 to 252 in 9d63cb5

    
           if self.task_type in REGRESSION_TASKS: 
        
               if not isinstance(self.configuration, Configuration): 
        
                   self.model_class = MyDummyRegressor 
        
               else: 
        
                   self.model_class = ( 
        
                       autosklearn.pipeline.regression.SimpleRegressionPipeline 
        
                   ) 
        
               self.predict_function = self._predict_regression 
        
           else: 
        
               if not isinstance(self.configuration, Configuration): 
        
                   self.model_class = MyDummyClassifier 
        
               else: 
        
                   self.model_class = ( 
        
                       autosklearn.pipeline.classification.SimpleClassificationPipeline 
        
                   )

I opted to solve this by just directly modifying automl.py to just check for this. It's not an ideal solution but it makes it visible so it's clear. The ideal solution is probably to not treat dummy as anything special with respect to evaluation, it's a classifier choice option (with no HP's). The special treatment is we want to:

evaluate it first
stop if it fails
Not evaluate it more than once

The first of those three points is not so straightforward to fix I think. The second could be achieved with a callback to SMAC. The third I'm not sure we can directly enforce, will SMAC choose the configuration again if it's already been evaluated and only has one valid configuration?

I think this might also solve some other failures that occur when it says "data_preprocessor" not available in show_models.

mfeurer · 2022-06-17T14:08:58Z

The culprit is that all estimators are normally wrapped in a Pipeline while if it's a dummy, it's not.

If it is easier, the code for wrapping the Dummy into a Pipeline could also go directly into the abstract evaluator. I'm not sure if adding the dummy to the search space will really help us, because it could then be used together with preprocessing algorithm which would enlarge the search space (or the number of forbidden configurations).

eddiebergman · 2022-06-17T14:35:56Z

I pushed some cases that are automl instance with only a dummy in them, they should hopefully pass all the tests since they're fairly broad cases ["classifier", "fitted"].

I validated they are used and run with no issues now. If they pass the online tests I'll see about moving them into the abstract evaluator.

@case(tags=["classifier", "fitted"])
def case_classifier_fitted_only_dummy(
    make_cache: Callable[[str], Cache],
    make_backend: Callable[..., Backend],
    make_automl_classifier: Callable[..., AutoMLClassifier],
) -> AutoMLClassifier:
    """Case of a fitted classifier but only dummy was found"""
    key = "case_classifier_fitted_only_dummy"

    # This locks the cache for this item while we check, required for pytest-xdist

    with make_cache(key) as cache:
        if "model" not in cache:
            model = make_automl_classifier(
                temporary_directory=cache.path("backend"),
                delete_tmp_folder_after_terminate=False,
                include={"classifier": ["bernoulli_nb"]},  # Just a meh model
                get_trials_callback=stop_at_first,
            )
            rand = np.random.RandomState(2)
            _X = rand.random((100, 50))
            _y = rand.randint(0, 2, (100,))
            X, Xt, y, yt = sklearn.model_selection.train_test_split(
                _X, _y, random_state=1  # Required to ensure dummy is best
            )
            model.fit(X, y, dataset_name="random")

            # We now validate that indeed, the only model is the Dummy
            members = list(model.models_.values())
            if len(members) != 1 and not isinstance(members[0], MyDummyClassifier):
                raise ValueError("Should only have one model, dummy\n", members)

            cache.save(model, "model")

    model = cache.load("model")
    model._backend = copy_backend(old=model._backend, new=make_backend())

    return model

eddiebergman · 2022-06-17T15:03:42Z

I tried to move the Pipeline step into the AbstractEvaluator but it doesn't work very easily as it expects a autosklearn.pipeline.BasePipeline and won't work with a standard sklearn.pipeline.Pipeline, which needs to then support many other methods to do with configuration. I will leave it as future work then

…st (#1518)

* Init commit * Fix DummyClassifiers in _load_pareto_set * Add test for dummy only in classifiers * Update no ensemble docstring * Add automl case where automl only has dummy * Remove tmp file * Fix `include` statement to be regressor

eddiebergman added the maintenance Internal maintenance label Jun 14, 2022

eddiebergman added this to the V0.15 milestone Jun 14, 2022

eddiebergman self-assigned this Jun 14, 2022

eddiebergman added 6 commits June 17, 2022 16:53

Init commit

76394de

Fix DummyClassifiers in _load_pareto_set

73c8099

Add test for dummy only in classifiers

bc8774c

Update no ensemble docstring

8dd7d01

Add automl case where automl only has dummy

cb64ce0

Remove tmp file

d8d3a4f

eddiebergman force-pushed the 1495-predict-can-fail-when-dummy-model-is-best-with-new-moo-updates branch from fde9940 to d8d3a4f Compare June 17, 2022 14:56

Fix include statement to be regressor

d731952

eddiebergman requested a review from mfeurer June 23, 2022 09:14

eddiebergman merged commit 5e21e9c into development Jun 23, 2022

eddiebergman deleted the 1495-predict-can-fail-when-dummy-model-is-best-with-new-moo-updates branch June 23, 2022 12:31

github-actions bot pushed a commit that referenced this pull request Jun 23, 2022

Eddie Bergman: Fix prediction fails with MOO ensemble and dummy is be…

0650e81

…st (#1518)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix prediction fails with MOO ensemble and dummy is best #1518

Fix prediction fails with MOO ensemble and dummy is best #1518

eddiebergman commented Jun 14, 2022

codecov bot commented Jun 14, 2022 •

edited

Loading

eddiebergman commented Jun 17, 2022 •

edited

Loading

eddiebergman commented Jun 17, 2022

mfeurer commented Jun 17, 2022

eddiebergman commented Jun 17, 2022 •

edited

Loading

eddiebergman commented Jun 17, 2022

Fix prediction fails with MOO ensemble and dummy is best #1518

Fix prediction fails with MOO ensemble and dummy is best #1518

Conversation

eddiebergman commented Jun 14, 2022

codecov bot commented Jun 14, 2022 • edited Loading

Codecov Report

eddiebergman commented Jun 17, 2022 • edited Loading

eddiebergman commented Jun 17, 2022

mfeurer commented Jun 17, 2022

eddiebergman commented Jun 17, 2022 • edited Loading

eddiebergman commented Jun 17, 2022

codecov bot commented Jun 14, 2022 •

edited

Loading

eddiebergman commented Jun 17, 2022 •

edited

Loading

eddiebergman commented Jun 17, 2022 •

edited

Loading