Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data leakage in predict_next_purchases #5

Open
datajanko opened this issue Apr 6, 2018 · 2 comments
Open

data leakage in predict_next_purchases #5

datajanko opened this issue Apr 6, 2018 · 2 comments

Comments

@datajanko
Copy link

Hey,

in the notebook, when using:

clf.fit(X, y)
top_features = utils.feature_importances(clf, features_encoded, n=20)

we introduce data leakage, since we select the features on the whole data set.
With scikit-learn's pipelines it's possible to select the 20 (rather a fraction o features) best features for each fold with select from model

With the current set up, you are probably overestimating the AUC.
Besides, cross val score assumes IID samples. However, this will clearly not be the case, since one entity has typically several occurences. I think some thing like time series split or rather an adaption (since we don't have time series in a classical way but rather time slices) should be the correct thing to use here.

Comments on those issues?

Currently, at work, I have the same issues, so I really appreciate the library you developed so far. I haven't seen something similar so far. So thumbs up in any case

@lstmemery
Copy link

I've noticed that a lot of feature engineering tutorials make this mistake (not just featuretools). I think featuretools handles this through cutoff times https://docs.featuretools.com/automated_feature_engineering/handling_time.html

@datajanko
Copy link
Author

You are right. Actually, the other examples show the correct usage of "backtesting". Shouldn't we at least make a remark here. Changing this, completely, will provide different kaggle results and thus requires more effort to change things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants