Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT: Update repo URL of sklearn nightly build, fix sparse matrix persistence issue #375

Merged

Conversation

BenjaminBossan
Copy link
Collaborator

@BenjaminBossan
Copy link
Collaborator Author

BenjaminBossan commented Jun 27, 2023

I investigated further and the issue is not directly linked to sklearn, but to scipy. Sparse matrices in the scipy version used by sklearn nightly have an interesting behavior, where two attributes only exists after they are dynamically created by accessing another attribute:

>>> from scipy import sparse
>>> x = sparse.csr_matrix((3, 4))
>>> x._has_canonical_format
AttributeError: 'csr_matrix' object has no attribute '_has_canonical_format'
>>> x._has_sorted_indices
AttributeError: 'csr_matrix' object has no attribute '_has_sorted_indices'
>>> x.has_canonical_format  # this creates the missing attributes
True
>>> x._has_canonical_format
True
>>> x._has_sorted_indices
True

(code)

Our code fails because the original estimator will have those two attributes but after a save/load roundtrip, they no longer exist:

>>> from skops import io as sio
>>> y = sio.loads(sio.dumps(x), trusted=True)
y._has_canonical_format
AttributeError: 'csr_matrix' object has no attribute '_has_canonical_format'

The same error does not happen with pickle:

>>> import pickle
>>> z = pickle.loads(pickle.dumps(x))
>>> z._has_canonical_format
True

Therefore, I would assume that scipy devs won't fix the issue.

@skops-dev/maintainers What's your opinion on what to do? We could special case these attributes in our tests. In practice, the fact that they "vanish" after a roundtrip is probably irrelevant, as they are private. Alternatively, we could try to figure out why those attributes vanish when using skops but not when using pickle.

@BenjaminBossan
Copy link
Collaborator Author

Hmm, I'm really stumped about this error. First of all, even though pickle does not have the problem, saving and loading with scipy (which skops uses under the hood) has the same issue:

>>> from scipy.sparse import load_npz, save_npz
>>> from scipy import sparse
>>> x = sparse.csr_matrix((3, 4))
>>> x.has_canonical_format
True
>>> x._has_canonical_format
True
>>> save_npz('/tmp/sparse_matrix.npz', x)
>>> y = load_npz('/tmp/sparse_matrix.npz')
>>> y._has_canonical_format
AttributeError: 'csr_matrix' object has no attribute '_has_canonical_format'

So in a way it's already broken in scipy, but probably it's not considered a problem there, since it's a private attribute.

What surprises me, however, is that the issue only occurs now. The change in scipy is already 10 years old, so this is not caused by a new scipy version. So then I thought that there was a change in sklearn recently, but the corresponding attribute was also last touched a long time ago (5 years), therefore, this is also not an explanation.

The most likely explanation has thus to be some change in scipy.sparse.diags, which is what is being called by sklearn and has been changed in the last month. I suspect that somewhere inside that function, the attribute is indirectly created, but I cannot pinpoint out where and what recent change affected this.

Anyway, I'll push a solution that will ensure that the private attributes are present on the loaded object iff they were present on the original object. This is achieved by remembering if the attribute was there when storing the state. If it was, when loading the object, loaded.has_canonical_format is accessed, which, as a side effect, creates the private attributes. This is a somewhat ugly solution, but I think it is the "most correct" solution short of fixing the problem inside of scipy.

@BenjaminBossan
Copy link
Collaborator Author

Okay, CI is green, ready for review @skops-dev/maintainers. Until this, or a different fix, is merged, CI will be red for everyone :/

@BenjaminBossan BenjaminBossan changed the title MNT: Update repo URL of sklearn nightly build MNT: Update repo URL of sklearn nightly build, fix sparse matrix persistence issue Jul 3, 2023
@adrinjalali
Copy link
Member

This to me is similar to the flags issue we had with numpy, i.e. it doesn't affect the actual values in the matrix, and only gives an information about how it's stored.

I think we can safely just ignore this attribute instead of trying to save it.

@BenjaminBossan
Copy link
Collaborator Author

So the suggestion would be to add a special case to the tests to explicitly remove those two arguments?

@adrinjalali
Copy link
Member

yeah I'd just remove them.

Instead of restoring the private attributes after loading a sparse
matrix, just don't test for these attributes. They are not restored, but
that shouldn't affect users.
@BenjaminBossan BenjaminBossan force-pushed the MNT-update-sklearn-nightly-url branch from e0c098a to cb7d100 Compare July 5, 2023 10:16
@BenjaminBossan
Copy link
Collaborator Author

@adrinjalali pls take another look.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@adrinjalali adrinjalali merged commit 2e5632e into skops-dev:main Jul 5, 2023
@BenjaminBossan BenjaminBossan deleted the MNT-update-sklearn-nightly-url branch July 5, 2023 11:12
@jot-s-bindra
Copy link

Its not related to this commit
I just wanted your help in setting up environment in local machine

git clone git@github.com:jot-s-bindra/skops.git
git init
git checkout -b new-feature
conda create -c conda-forge -n skops python=3.10
conda activate skops
python -m pip install -e ".[tests,docs]"
conda install -c conda-forge pre-commit
pre-commit install
pytest

during pytest i am recieving variable declare errors , modules not installed errors like
ERROR examples/plot_california_housing.py - KeyError: 'HF_HUB_TOKEN'
ERROR examples/plot_hf_hub.py - KeyError: 'HF_HUB_TOKEN'
ERROR examples/plot_intelex.py - KeyError: 'HF_HUB_TOKEN'
ERROR scripts/clean_skops.py - OSError: pytest: reading from stdin while output is captured! Consider using -s.
ERROR spaces/skops_model_card_creator/app.py - ModuleNotFoundError: No module named 'streamlit'
ERROR spaces/skops_model_card_creator/create.py - ModuleNotFoundError: No module named 'streamlit'
ERROR spaces/skops_model_card_creator/edit.py - ModuleNotFoundError: No module named 'streamlit'
ERROR spaces/skops_model_card_creator/gethelp.py - ModuleNotFoundError: No module named 'streamlit'
ERROR spaces/skops_model_card_creator/start.py - ModuleNotFoundError: No module named 'streamlit'
ERROR spaces/skops_model_card_creator/tasks.py - ModuleNotFoundError: No module named 'streamlit'

@adrinjalali
Copy link
Member

In order to run hugging face related commands, you need to have your token present in HF_HUB_TOKEN environment variable.

The spaces folder is not for most users. If you want to run it locally, you need to have its dependencies including streamlit installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants