Multioutput regression with cv, incorrect predict shape? #1169

oasidorshin · 2021-07-06T08:51:09Z

Describe the bug

After fitting multioutput regression with cv, shape of predictions is constant (and equal to (number_of_targets, number_of_cv_folds)), regardless of prediction sample shape.

I'm not sure whether this is intended. In any case I think it would be better if this behavior would be more thoroughly explained in the manual.

To Reproduce

Please see attached notebook with code and output.
issue_cv.zip

Expected behavior

One dimension of predict() output is equal to sample length.

eddiebergman · 2021-08-10T11:47:58Z

Sorry for the delay, I can confirm this happen with the following code:

import numpy as numpy
import autosklearn.regression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

if __name__ == "__main__":
    X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=3)

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

    automl_cv = autosklearn.regression.AutoSklearnRegressor(
        time_left_for_this_task=60, # In seconds
        disable_evaluator_output=False,
        resampling_strategy='cv', 
        resampling_strategy_arguments={'folds': 5},
        n_jobs = 2,
        memory_limit = 3072
    )
    automl_cv.fit(X_train, y_train)
    predictions = automl_cv.predict(X_test)
    
    print(y_test.shape) # (250, 3)
    print(predictions.shape) # (3,5)

I will look into this!

eddiebergman · 2021-08-11T11:53:30Z

After some more digging, this turns out to be related to how we using the sklearn.VotingRegressor, which is used to store the cross validation models and get their averaged score'. This does not actually doesn't support multi output regression as seen in the error log produced below and in their fit method which checks if it's 1d y. This is specified specified in the fit documentation.

However we fit models before hand and then manually set the estimators_ value which skips this check.

As this does not seem intended for Multioutput regression, the two solutions I see for autosklearn are:

To run our own Ensemble based class for cross validation models, to do these predictions for all different task types.
Just manually do this the averaging if the task is multioutput regression, which questions why use the VotingRegressor then?

# Testing multioutput regression
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Fit before hand and manually set
models = [DummyRegressor().fit(X_train, y_train) for _ in range(5)]
vr = VotingRegressor(estimators=None)
vr.estimators_ = models

# Raw model outputs are there
print(vr.transform(X_test).shape) # shape (3, 250, 5)

# VotingRegressor averages on wrong dimension for us
print(vr.predict(X_test).shape) # shape (3, 5)
# def predict(...):
#   return np.average(self._predict(X), axis=1)

# Manual averaging solution
print(np.average(vr.transform(X_test), axis=2).T.shape)

# Using it as intended causes error
models = [DummyRegressor() for _ in range(5)]
vr = VotingRegressor(estimators=models)
try:
    vr.fit(X_train, y_train)
except:
    traceback.print_exc()

# python test_voting_regressor.py
(3, 250, 5)                                                                                                                                                                                   
(3, 5)                                                                                                                                                                                        
(250, 3) 
Traceback (most recent call last):                                                                                                                                                            
  File "test_voting_regressor.py", line 33, in <module>                                                                                                                                       
    vr.fit(X_train, y_train)                                                                                                                                                                  
  File "/home/skantify/code/asklearn/issue_1169/auto-sklearn/.venv/lib/python3.8/site-packages/sklearn/ensemble/_voting.py", line 484, in fit                                                 
    y = column_or_1d(y, warn=True)                                                                                                                                                            
  File "/home/skantify/code/asklearn/issue_1169/auto-sklearn/.venv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f                                              
    return f(*args, **kwargs)                                                                                                                                                                 
  File "/home/skantify/code/asklearn/issue_1169/auto-sklearn/.venv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 921, in column_or_1d                                        
    raise ValueError(                                                                                                                                                                         
ValueError: y should be a 1d array, got an array of shape (750, 3) instead.

eddiebergman · 2021-08-11T17:01:45Z

Hi @oasidorshin,

The issue has been fixed in PR #1217 and we now test for it and other related situations. This should be in the development branch next week and hopefully in a release in the following week :)

oasidorshin · 2021-08-13T05:50:26Z

Sounds good, thanks a lot!

eddiebergman added the bug label Jul 19, 2021

eddiebergman mentioned this issue Aug 11, 2021

Multioutput regression fix #1217

Merged

eddiebergman closed this as completed Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multioutput regression with cv, incorrect predict shape? #1169

Multioutput regression with cv, incorrect predict shape? #1169

oasidorshin commented Jul 6, 2021

eddiebergman commented Aug 10, 2021

eddiebergman commented Aug 11, 2021 •

edited

Loading

eddiebergman commented Aug 11, 2021

oasidorshin commented Aug 13, 2021

Multioutput regression with cv, incorrect predict shape? #1169

Multioutput regression with cv, incorrect predict shape? #1169

Comments

oasidorshin commented Jul 6, 2021

Describe the bug

To Reproduce

Expected behavior

eddiebergman commented Aug 10, 2021

eddiebergman commented Aug 11, 2021 • edited Loading

eddiebergman commented Aug 11, 2021

oasidorshin commented Aug 13, 2021

eddiebergman commented Aug 11, 2021 •

edited

Loading