Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH permutation importance #142

Merged
merged 84 commits into from
Jan 30, 2023

Conversation

merveenoyan
Copy link
Collaborator

@merveenoyan merveenoyan commented Sep 16, 2022

I implemented permutation importance, will write tests if you like where it goes.

I have two problems:

  • Are we okay with matplotlib being a dependency here? I'll look for something better. Same applies for pandas.
  • I'm looking for a more standard way of getting feature names. For some base estimators or pipelines it varies, from feature_names_in_ to get_feature_names_out().

The example script looks like below (I'll change layout of plot to fit inside the plot itself):
Ekran Resmi 2022-09-16 19 37 37

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some specific comments for this PR and then a more general discussion.

For this PR, I would like to see some tests added. Also, I have a few comments for some lines of code, please check.

For the more general discussion, I want to ask the following question: Do we want to design skops model cards such that we add a bunch of narrow methods such as add_feature_importances or do we just want to have general methods like add_plot and require the user to do the work inside the method themselves?

E.g. for adding cross-validation results, we have put the burden on the user to process the data and then use the generic add_table method:

cv_results = model.cv_results_
clf_report = classification_report(
y_test, y_pred, output_dict=True, target_names=["malignant", "benign"]
)
# The classification report has to be transformed into a DataFrame first to have
# the correct format. This requires removing the "accuracy", which was added
# above anyway.
del clf_report["accuracy"]
clf_report = pd.DataFrame(clf_report).T.reset_index()
model_card.add_table(
folded=True,
**{
"Hyperparameter search results": cv_results,
"Classification report": clf_report,
},
)

For consistency, this should either be its own method,add_cv_results, or add_feature_importances should not be its own method.

The arguments for not providing specific methods are:

  • Less work for us (though we should still provide examples)
  • More flexibility for the user. E.g. here, the user cannot change the parameters of the plot, e.g. the what the title should be or what the name of the plot should be.
  • Don't need to worry about pandas dependency etc.

The argument for providing the specific methods:

  • Easier for the user if they don't want to modify the defaults.
  • Easier to discover than just providing examples.

skops/card/_model_card.py Outdated Show resolved Hide resolved
@@ -10,6 +10,8 @@
from reprlib import Repr
from typing import Any, Optional, Union

import matplotlib.pyplot as plt
import pandas as pd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is assumed that pandas is installed. However, pandas is not a strict dependency for skops. I think we specifically didn't want to add pandas because it is a "fat" dependency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I thought of it as well, I will try to plot without it.

@merveenoyan
Copy link
Collaborator Author

@BenjaminBossan I had a quick huddle last week with @adrinjalali where we discussed this and he told me to create a method to add feature importance graphs. I can remove this and add it to model card example instead.

@BenjaminBossan
Copy link
Collaborator

he told me to create a method to add feature importance graphs. I can remove this and add it to model card example instead.

If both of you agree that this is a good thing to have, I'm also fine with it. I just wanted to make sure that this decision is made deliberately, after taking into account the tradeoffs. Especially, it will probably result in quite a few very similar methods being added in the future (like the example of adding CV results).

@merveenoyan
Copy link
Collaborator Author

merveenoyan commented Sep 19, 2022

@BenjaminBossan I sort of agree and think the dependencies are a bit much if we want to move forward (I'll handle stuff with numpy instead). Would you like to wait for @adrinjalali to come back from his vacay to make a decision?

@BenjaminBossan
Copy link
Collaborator

Would you like to wait for @adrinjalali to come back from his vacay to make a decision?

If you can manage to remove the pandas dependency and if you already decided with Adrin that this would be a good feature to have, I think we can move forward.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @merveenoyan

examples/plot_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
@@ -366,6 +368,33 @@ def add_metrics(self, **kwargs: str) -> "Card":
self._eval_results[metric] = value
return self

def add_feature_importances(self, feature_importances) -> "Card":
"""Visualize permutation importance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in terms of API, it would make sense to accept importances in a certain way, like accepting both a dict and a pandas dataframe, and then add it to the modelcard.

It doesn't have to be permutation importance.

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of comments, please take a look.

Also, I have a more general comment:

What do we do if a user wants to add more than 1 permutation importance graph? E.g. one for accuracy and one for recall? Right now, this would not work because the file name feature_importances.png would be re-used, so we would get the same graph twice. Also, the section name is hard-coded, so the two sections would have the exact same name. I think we should have a solution for this.

skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
@merveenoyan
Copy link
Collaborator Author

Fixes #104

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, by adding the new arguments, my concerns could be addressed. There are still a few improvements, please see my comments. Also, I think it would be a good idea to update the plot_model_card.py and perhaps also add a sentence to docs/model_card.rst.

skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/card/_model_card.py Outdated Show resolved Hide resolved
skops/card/tests/test_card.py Outdated Show resolved Hide resolved
merveenoyan and others added 3 commits January 23, 2023 12:14
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@merveenoyan
Copy link
Collaborator Author

@BenjaminBossan codecov fails 😅🥲

@BenjaminBossan
Copy link
Collaborator

@merveenoyan Re-running solved the codecov issue. Not sure why codecov hates you specifically :D

Could you please merge the current main branch, as have now added CI for Python 3.11? Then I'll review again.

@merveenoyan
Copy link
Collaborator Author

@BenjaminBossan merged

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, again I have a few small comments. But I swear we're getting really close :)

@@ -16,3 +18,30 @@ def mock_import(name, *args, **kwargs):

with patch("builtins.__import__", side_effect=mock_import):
yield

import matplotlib # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this line. First of all, it's pandas here, not matplotlib, second, we don't delete pandas from the import cache, so re-importing is not necessary.

@@ -2,7 +2,7 @@

import io
from contextlib import contextmanager
from typing import Any, Generator, Literal, Sequence, Type, Union
from typing import Any, Generator, Literal, Sequence, Type, Union # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the # type: ignore here really necessary?



def import_or_raise(module, feature_name):
"""Raise error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a better description here.

module = import_module(module)
except ImportError as e:
package = module.split(".")[0]
raise ModuleNotFoundError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding: Why not a simple ImportError?

@BenjaminBossan
Copy link
Collaborator

@merveenoyan Some tests are failing. This is mostly fixed after merging with current main branch. Could you please do that?

@merveenoyan
Copy link
Collaborator Author

@BenjaminBossan should work now (magically ✨) but I realized pytest doesn't collect tests of parser for some reason on mac (assuming it's same on windows) (that's why nothing failed on my local, I guess)

I checked github workflow and looked for needs in case the workflow doesn't run it unless ubuntu tests pass (since I know ubuntu tests are usually fast so it's good to add such prerequisite, but here it's not the case). Then I decided to run the tests in isolation on my local both for whole file and specifically on the test that is failing, yet it refuses to collect on my local too! 🤯

(py39) ➜  card git:(feature_importance) python3 -m pytest -sv tests/test_parser.py
================================================= test session starts =================================================
platform darwin -- Python 3.9.16, pytest-7.2.1, pluggy-1.0.0 -- /opt/anaconda3/envs/py39/bin/python3
cachedir: .pytest_cache
rootdir: /Users/mervenoyan/Desktop/skops/skops, configfile: pyproject.toml
plugins: flaky-3.7.0, cov-4.0.0, anyio-3.6.2
collected 0 items / 2 skipped  

do you know what's going on?

@BenjaminBossan
Copy link
Collaborator

I realized pytest doesn't collect tests of parser for some reason on mac (assuming it's same on windows) (that's why nothing failed on my local, I guess)

Probably it's just because you haven't installed pandoc locally? This line makes it so that tests are skipped when pandoc is not found:

try:
check_pandoc_installed()
except FileNotFoundError:
# not installed, skip
pytest.skip(reason="These tests require a recent pandoc", allow_module_level=True)

The reason why it runs on Ubuntu CI is:

if [ ${{ matrix.os }} == "ubuntu-latest" ];
then sudo apt install pandoc && pandoc --version;
fi

I haven't added pandoc to the other builds because it's less trivial to install it on Mac and Windows and, as you mentioned, Ubuntu is the fastest.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still a few # type: ignores in here which I think can be removed, apart from that it's ready to be merged from my perspective.

@merveenoyan
Copy link
Collaborator Author

@BenjaminBossan I re-ran for codecov but no chance, it's failing 🥲🥲🥲🥲

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now we're good 🎉

@BenjaminBossan BenjaminBossan merged commit 81558aa into skops-dev:main Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants