Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multioutput regression with cv, incorrect predict shape? #1169

Closed
oasidorshin opened this issue Jul 6, 2021 · 4 comments
Closed

Multioutput regression with cv, incorrect predict shape? #1169

oasidorshin opened this issue Jul 6, 2021 · 4 comments
Labels

Comments

@oasidorshin
Copy link

Describe the bug

After fitting multioutput regression with cv, shape of predictions is constant (and equal to (number_of_targets, number_of_cv_folds)), regardless of prediction sample shape.

I'm not sure whether this is intended. In any case I think it would be better if this behavior would be more thoroughly explained in the manual.

To Reproduce

Please see attached notebook with code and output.
issue_cv.zip

Expected behavior

One dimension of predict() output is equal to sample length.

@eddiebergman
Copy link
Contributor

Sorry for the delay, I can confirm this happen with the following code:

import numpy as numpy
import autosklearn.regression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

if __name__ == "__main__":
    X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=3)

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

    automl_cv = autosklearn.regression.AutoSklearnRegressor(
        time_left_for_this_task=60, # In seconds
        disable_evaluator_output=False,
        resampling_strategy='cv', 
        resampling_strategy_arguments={'folds': 5},
        n_jobs = 2,
        memory_limit = 3072
    )
    automl_cv.fit(X_train, y_train)
    predictions = automl_cv.predict(X_test)
    
    print(y_test.shape) # (250, 3)
    print(predictions.shape) # (3,5)

I will look into this!

@eddiebergman
Copy link
Contributor

eddiebergman commented Aug 11, 2021

After some more digging, this turns out to be related to how we using the sklearn.VotingRegressor, which is used to store the cross validation models and get their averaged score'. This does not actually doesn't support multi output regression as seen in the error log produced below and in their fit method which checks if it's 1d y. This is specified specified in the fit documentation.

However we fit models before hand and then manually set the estimators_ value which skips this check.

As this does not seem intended for Multioutput regression, the two solutions I see for autosklearn are:

  • To run our own Ensemble based class for cross validation models, to do these predictions for all different task types.
  • Just manually do this the averaging if the task is multioutput regression, which questions why use the VotingRegressor then?
# Testing multioutput regression
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Fit before hand and manually set
models = [DummyRegressor().fit(X_train, y_train) for _ in range(5)]
vr = VotingRegressor(estimators=None)
vr.estimators_ = models

# Raw model outputs are there
print(vr.transform(X_test).shape) # shape (3, 250, 5)

# VotingRegressor averages on wrong dimension for us
print(vr.predict(X_test).shape) # shape (3, 5)
# def predict(...):
#   return np.average(self._predict(X), axis=1)

# Manual averaging solution
print(np.average(vr.transform(X_test), axis=2).T.shape)

# Using it as intended causes error
models = [DummyRegressor() for _ in range(5)]
vr = VotingRegressor(estimators=models)
try:
    vr.fit(X_train, y_train)
except:
    traceback.print_exc()
# python test_voting_regressor.py
(3, 250, 5)                                                                                                                                                                                   
(3, 5)                                                                                                                                                                                        
(250, 3) 
Traceback (most recent call last):                                                                                                                                                            
  File "test_voting_regressor.py", line 33, in <module>                                                                                                                                       
    vr.fit(X_train, y_train)                                                                                                                                                                  
  File "/home/skantify/code/asklearn/issue_1169/auto-sklearn/.venv/lib/python3.8/site-packages/sklearn/ensemble/_voting.py", line 484, in fit                                                 
    y = column_or_1d(y, warn=True)                                                                                                                                                            
  File "/home/skantify/code/asklearn/issue_1169/auto-sklearn/.venv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f                                              
    return f(*args, **kwargs)                                                                                                                                                                 
  File "/home/skantify/code/asklearn/issue_1169/auto-sklearn/.venv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 921, in column_or_1d                                        
    raise ValueError(                                                                                                                                                                         
ValueError: y should be a 1d array, got an array of shape (750, 3) instead. 

@eddiebergman
Copy link
Contributor

Hi @oasidorshin,

The issue has been fixed in PR #1217 and we now test for it and other related situations. This should be in the development branch next week and hopefully in a release in the following week :)

@oasidorshin
Copy link
Author

Sounds good, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants