Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classification vs regression #37

Open
benman1 opened this issue Feb 4, 2020 · 2 comments
Open

Classification vs regression #37

benman1 opened this issue Feb 4, 2020 · 2 comments

Comments

@benman1
Copy link

benman1 commented Feb 4, 2020

Hi
I think this package looks fantastic. I am wondering, however, what your plans are for implementing SkopeRules for regression. Are there any plans?

I've made a start for adding regression, and I had to make a lot of changes. I made this up as I went through the code really. I had to come up with measures comparable to precision and recall - the precision-like measure is based on the expected reduction in standard deviation; the recall-like measure is based on the z-score of the prediction versus the population of y. At the end, scores are integrated via softmax weighted rules. At the moment, I still get a lot of nans in predictions, because there are not enough rules. The overall mse error is still much worse than a baseline from linear regression.

I've also added comments and a test for regression. This is WIP, but I am happy for anyone to jump in.

Thanks!

@benman1
Copy link
Author

benman1 commented Feb 5, 2020

After a more testing it seems that for the diabetes dataset that I am using for benchmarking, the linear model actually outperforms the random forest regressor and the decision tree regressor (the latter by a lot); therefore I might have been a bit too strict judging the performance I was getting. I am now getting a performance very similar to both the random forest and linear models, although without rule filtering and without deduplication.

@wjj5881005
Copy link

I think the oob score computed in the fit function is wrong.

The authors get the oob samples by "mask = ~samples", and then apply X[mask, :] to get the oob samples.
Actually, I test the case and found that there are many same elements between samples and X[mask,:]。

I also turn to the implemtion of oob of randomforest, and I found following codes:

random_instance = check_random_state(random_state)
sample_indices = random_instance.randint(0, samples, max_samples)
sample_counts = np.bincount(sample_indices, minlength=len(samples))
unsampled_mask = sample_counts == 0
indices_range = np.arange(len(samples))
unsampled_indices = indices_range[unsampled_mask]

then the unsampled_indices is the truely oob sample indices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants