Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistence tests fail for LogisticRegression et al. with multiclass classification #233

Open
BenjaminBossan opened this issue Dec 1, 2022 · 7 comments
Labels
bug Something isn't working persistence Secure persistence feature
Milestone

Comments

@BenjaminBossan
Copy link
Collaborator

BenjaminBossan commented Dec 1, 2022

Right now, when testing persistence of classifiers, we create a binary classification task, and the classifiers all pass. However, when switching to a multiclass classification task, LogisticRegression and related estimators fail (e.g. CalibratedClassifierCV which uses lr under the hood by default).

To reproduce, replace the following lines:

X, y = make_classification(
n_samples=N_SAMPLES, n_features=N_FEATURES, random_state=0
)

by these lines:

        X, y = make_classification(
            n_samples=N_SAMPLES, n_features=N_FEATURES, random_state=0, n_classes=3, n_redundant=1, n_informative=N_FEATURES - 1,
        )

(note that n_redundant and n_informative are irrelevant here, they just need to be changed for make_classification to work)

The error is that the contiguity of the coef_ attributes is not the same. Strangely enough, it is the original estimator that seems to be "wrong":

>>> estimator.coef_.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

>>> loaded.coef_.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

I haven't investigated further, but I suspect that lr uses a different algorithm under the hood when dealing with binary classification, which is why it only occurs in the multiclass setting. EDIT: See below, that's not the reason.

@BenjaminBossan BenjaminBossan added bug Something isn't working persistence Secure persistence feature labels Dec 1, 2022
@adrinjalali adrinjalali added this to the v0.4 milestone Dec 2, 2022
@BenjaminBossan
Copy link
Collaborator Author

Okay, I could identify where the contiguity is "lost":

First of all, the test:

@pytest.mark.parametrize('fit_intercept', [True, False], ids=['w/ interc', 'w/o interc'])
@pytest.mark.parametrize('multi_class', ['auto', 'ovr', 'multinomial'])
@pytest.mark.parametrize('binary', [True, False], ids=['two classes', 'three classes'])
def test_lr(binary, multi_class, fit_intercept):
    X = [[0, 1], [2, 3], [4, 5]]
    if binary:
        y = [0, 1, 1]
    else:
        y = [0, 1, 2]

    estimator = LogisticRegression(multi_class=multi_class, fit_intercept=fit_intercept)
    estimator.fit(X, y)

    trusted =['sklearn.linear_model._logistic.LogisticRegression']
    loaded = loads(dumps(estimator), trusted=trusted)
    assert_params_equal(loaded.__dict__, estimator.__dict__)

The error is:

FAILED skops/io/tests/test_persist.py::test_lr[three classes-auto-w/ interc] - assert True is False
FAILED skops/io/tests/test_persist.py::test_lr[three classes-ovr-w/ interc] - assert True is False
FAILED skops/io/tests/test_persist.py::test_lr[three classes-multinomial-w/ interc] - assert True is False

So it only happens when there are 3 (or more) classes and fit_intercept=True.

The "offending" line is:

https://github.com/scikit-learn/scikit-learn/blob/4dde64d61e6e172da466cc7b7eaa21b832b96fbb/sklearn/linear_model/_logistic.py#L1330

As you can see, this line is only called with fit_intercept=True. So that explains this part. But why does it only occur with 3 or more classes? Let me demonstrate:

import numpy as np

# FIRST: show that (m, n) and (1, n) matrices behave differently for contiguity

x = np.random.random((5, 3))  # coef with > 2 classes 2d
x.flags
#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : False

x[:, :-1].flags
#  C_CONTIGUOUS : False
#  F_CONTIGUOUS : False

y = np.random.random((1, 3))  # coef with 2 classes is "1d"
y.flags
#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : True

y[:, :-1].flags
#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : True

# SECOND show that not-C not-F is lost  after saving, but both C and F is conserved

np.save('arr.npy', x[:, :-1])
xl = np.load('arr.npy')
#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : False

np.save('arr.npy', y[:, :-1])
yl = np.load('arr.npy')
yl.flags
#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : True

I bow to you if you already knew that :D

Anyway, how can we resolve this bug?

  1. I don't think that anything on the numpy side will be changed about how flags change with slicing, and I'm sure there is some deep logic about why this is the correct behavior.

  2. It's probably also unlikely that np.save/np.load will be changed to conserve contiguity when the array is neither C nor F contiguous.

  3. On the sklearn side, maybe one could argue that this behavior could result in a lower performance, not sure.

  4. On the skops side, we could ignore the contiguity in the unit test, but again, an error there could have performance implications.

  5. On the skops side, we could store the contiguity of numpy arrays and then after loading, enforce the contiguity -- but I'm not sure if it's possible to tell numpy to make an array neither C nor F contiguous / both C and F contiguous, AFAICT it's one or the other. Maybe it's possible in some hacky fashion.

@E-Aho
Copy link
Collaborator

E-Aho commented Dec 4, 2022

  1. On the skops side, we could ignore the contiguity in the unit test, but again, an error there could have performance implications.

I think this is an option, but it would definitely be a performance hit for users who depend on contiguous arrays. I also think that if a user stores something which is explicitly contiguous, they would expect it to persist that way, so I feel it's something we should solve in any case.

  1. On the skops side, we could store the contiguity of numpy arrays and then after loading, enforce the contiguity -- but I'm not sure if it's possible to tell numpy to make an array neither C nor F contiguous / both C and F contiguous, AFAICT it's one or the other. Maybe it's possible in some hacky fashion.

Arrays can definitely be both simultaneously (e.g 1D arrays, etc), e.g.:

>>> x = np.arange(9)
>>> x.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True

but outside of trivial examples, they can't be simultaneously C and F contiguous, as C contiguous is row-major, and F contiguous is column-major.

Arrays can also be discontinuous if they are made of strides, e.g:

>>> y = np.arange(9).reshape(3,3)[:,:2]
>>> y.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False

Though I don't know of a use case where a user would need an array to be discontiguous...

I will need to play around with the Numpy APIs a bit to see exactly how it handles trying to force either C or F contiguity. I think 1D arrays will always be either both C and F contiguous or discontiguous, and for non-trivial arrays, we could force either one or the other.

I think it would make sense to store the contiguity flags during persistance, then format arrays correctly on loading. I can play around with that a bit this weekend to see if I can get something working.

@BenjaminBossan
Copy link
Collaborator Author

I think this is an option, but it would definitely be a performance hit for users who depend on contiguous arrays.

I'm not an expert on numpy low level stuff, but purely conceptually, if you have code like:

self.coef_ = self.coef_[:, :-1]

where the last column is dropped, if it was C-contiguous before, would it not be "almost" contiguous still? Or would it treated like a randomly shuffled array? And if we load the array with skops and it is now C-contiguous, could there be any disadvantage? I would think probably not.

Investigating a bit further, I also found that pickling results in the same loss of preserving the contiguity:

estimator.coef_.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False

p = pickle.loads(pickle.dumps(estimator))
p.coef_.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False

Therefore, I think that we should be fine if going from neither C nor F contiguous to C contiguous.

Basically, what I think would be a good result is:

C contiguous before F contiguous before C contiguous after F contiguous after
T T T T
T F T F
F T F T
F F T F

I haven't checked yet if that needs any change in our code to make it happen, maybe only our testing code would need adjusting.

@E-Aho
Copy link
Collaborator

E-Aho commented Dec 5, 2022

I think we should persist as above, and I think the truth table you've provided would be a good idea to implement.

I really can't think of any use case where a user would need something to not be contiguous, so I think having the final column in your table should be fine.

@adrinjalali
Copy link
Member

I'll have a deeper look here soon, but in the meantime, I just realized numpy uses numpy.core.multiarray._reconstruct to construct things during unpickling.

@adrinjalali
Copy link
Member

I was gonna report a bug on the numpy side, but I realized this makes total sense. AFAICT, every time where the contiguous flag is different after load, it's also the case that OWNDATA is False before load, and it's True after load, which makes sense since these are more of a view on an existing array, and changing values on them would change values on the original array.

Which means we can have a weak test and check if OWNDATA=True, and if yes, then do a strict check, and if not, then we can't really know what comes out of the numpy.load.

I'd love it if @rgommers could confirm this.

@rgommers
Copy link

That sounds correct to me. You should be able to rely on the data values in pickled arrays, but AFAIK there is no way with pickling to preserve the view relationships between different arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working persistence Secure persistence feature
Projects
None yet
Development

No branches or pull requests

4 participants