Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linearGAM with sklearn gridsearchCV #247

Open
hongkahjun opened this issue Jun 28, 2019 · 9 comments · May be fixed by #267
Open

linearGAM with sklearn gridsearchCV #247

hongkahjun opened this issue Jun 28, 2019 · 9 comments · May be fixed by #267

Comments

@hongkahjun
Copy link

hongkahjun commented Jun 28, 2019

Hi

I tried implementing LinearGAM with sklearn's GridsearchCV and got an error when gridsearchCV tried to clone the estimator. The code is below:

def gam(x, y):
    lams = np.random.rand(10, x.shape[1])
    lams = np.exp(lams)
    linear_gam = LinearGAM(n_splines=10, max_iter=1000)
    parameters = {
        'lam': [x for x in lams]
    }
    gam_cv = GridSearchCV(linear_gam, parameters, cv=5, iid=False, return_train_score=True, 
 refit=True, scoring='neg_mean_squared_error')
    gam_cv.fit(x, y)
    cv_results_df = pd.DataFrame(gam_cv.cv_results_).sort_values(by='mean_test_score', ascending=False)
    return gam_cv, cv_results_df

gam_rank, gam_cv_results = gam(x_all, y_all)

I get the error


RuntimeError Traceback (most recent call last)
in
----> 1 gam_rank, gam_cv_results = gam(x_all, y_all)

in gam(x, y)
7 }
8 gam_cv = GridSearchCV(linear_gam, parameters, cv=5, iid=False, return_train_score=True, >refit=True, scoring='neg_mean_squared_error')
----> 9 gam_cv.fit(x, y)
10 cv_results_df = pd.DataFrame(gam_cv.cv_results_).sort_values(by='mean_test_score', >ascending=False)
11 return gam_cv, cv_results_df

C:\Anaconda3\lib\site-packages\sklearn\model_selection_search.py in fit(self, X, y, groups, **fit_params)
630 n_splits = cv.get_n_splits(X, y, groups)
631
--> 632 base_estimator = clone(self.estimator)
633
634 parallel = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,

C:\Anaconda3\lib\site-packages\sklearn\base.py in clone(estimator, safe)
73 raise RuntimeError('Cannot clone object %s, as the constructor '
74 'either does not set or modifies parameter %s' %
---> 75 (estimator, name))
76 return new_object
77

RuntimeError: Cannot clone object LinearGAM(callbacks=['deviance', 'diffs'], fit_intercept=True,
max_iter=1000, n_splines=10, scale=None, terms='auto', tol=0.0001,
verbose=False), as the constructor either does not set or modifies parameter callbacks

The dataset I used was sklearn's california housing dataset.

@dswah
Copy link
Owner

dswah commented Jul 16, 2019

@hongkahjun unfortunately this is a somewhat deep issue. When writing the new terms functionality, i diverged from sklearn's requirement that the estimator instance's parameters are not changed after looking at data (even though the coefficients are allowed to change).

This will require a deeper fix right now :/

@burchill
Copy link

burchill commented Jul 29, 2019

Getting what looks like the same problem when using LinearGAM with sklearn's TransformedTargetRegressor.

Admittedly knowing absolutely nothing about the motivations for the changes you made, it seems like diverging from such an important package's requirements would be a bad idea? I know that if I can't get it to play nice with my company's existing sklearn infrastructure I'm probably going to have to abandon it.

Is there an older version that keeps to sklearn's requirements?

@judahrand
Copy link

judahrand commented Apr 8, 2020

@dswah

I see two solutions to this issue:

  1. According to the SkLearn developer guide since the terms depend on the data it could be argued that they are "estimated parameters". This would allow for self.terms_ to be assigned from self.terms in _validate_data_dep_params and then self.terms_ to be used throughout. This would, I believe, follow the contract set out by SkLearn.

  2. Another option would be to overload self.get_params() to return the terms that the estimator was initialized with rather than self.terms?

Please, let me know if either of these options is preferable and I will happily throw together a PR to fix this issue. Also, let me know if I've missed something!

@judahrand judahrand linked a pull request Apr 8, 2020 that will close this issue
@judahrand
Copy link

judahrand commented Apr 8, 2020

Actually, having looked at this further this does not seem to have anything at all to do with terms it instead occurs because the default value of the callbacks keyword argument for every GAM estimator is a list. This tends to be a big no no in Python as lists are mutable and so leads to confusing situations where it is possible to change the default value of a keyword argument.

I believe this was the root of this bug. When I instead apply the changes is #267 GridSearchCV succeeds with no issues.

Could you please look at this PR and see it is an acceptable fix?

@ZebinYang
Copy link

It seems this can be simply fixed by adding "callbacks=callbacks," to line 2267 of pygam.py.

@AlexandreRozier
Copy link

Also having this issue

@Niousha-Dehestani
Copy link

Is this not fixed yet?

@thistleknot
Copy link

#291

@Aminem83
Copy link

Aminem83 commented Jun 9, 2023

It seems this can be simply fixed by adding "callbacks=callbacks," to line 2267 of pygam.py.

here is complete correction to pygam.py (line 2461):

super(LinearGAM, self).init(
callbacks=callbacks,
terms=terms,
distribution=NormalDist(scale=self.scale),
link='identity',
max_iter=max_iter,
tol=tol,
fit_intercept=fit_intercept,
verbose=verbose,
**kwargs,
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants