-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistence tests fail for LogisticRegression et al. with multiclass classification #233
Comments
Okay, I could identify where the contiguity is "lost": First of all, the test: @pytest.mark.parametrize('fit_intercept', [True, False], ids=['w/ interc', 'w/o interc'])
@pytest.mark.parametrize('multi_class', ['auto', 'ovr', 'multinomial'])
@pytest.mark.parametrize('binary', [True, False], ids=['two classes', 'three classes'])
def test_lr(binary, multi_class, fit_intercept):
X = [[0, 1], [2, 3], [4, 5]]
if binary:
y = [0, 1, 1]
else:
y = [0, 1, 2]
estimator = LogisticRegression(multi_class=multi_class, fit_intercept=fit_intercept)
estimator.fit(X, y)
trusted =['sklearn.linear_model._logistic.LogisticRegression']
loaded = loads(dumps(estimator), trusted=trusted)
assert_params_equal(loaded.__dict__, estimator.__dict__) The error is:
So it only happens when there are 3 (or more) classes and The "offending" line is: As you can see, this line is only called with import numpy as np
# FIRST: show that (m, n) and (1, n) matrices behave differently for contiguity
x = np.random.random((5, 3)) # coef with > 2 classes 2d
x.flags
# C_CONTIGUOUS : True
# F_CONTIGUOUS : False
x[:, :-1].flags
# C_CONTIGUOUS : False
# F_CONTIGUOUS : False
y = np.random.random((1, 3)) # coef with 2 classes is "1d"
y.flags
# C_CONTIGUOUS : True
# F_CONTIGUOUS : True
y[:, :-1].flags
# C_CONTIGUOUS : True
# F_CONTIGUOUS : True
# SECOND show that not-C not-F is lost after saving, but both C and F is conserved
np.save('arr.npy', x[:, :-1])
xl = np.load('arr.npy')
# C_CONTIGUOUS : True
# F_CONTIGUOUS : False
np.save('arr.npy', y[:, :-1])
yl = np.load('arr.npy')
yl.flags
# C_CONTIGUOUS : True
# F_CONTIGUOUS : True I bow to you if you already knew that :D Anyway, how can we resolve this bug?
|
I think this is an option, but it would definitely be a performance hit for users who depend on contiguous arrays. I also think that if a user stores something which is explicitly contiguous, they would expect it to persist that way, so I feel it's something we should solve in any case.
Arrays can definitely be both simultaneously (e.g 1D arrays, etc), e.g.:
but outside of trivial examples, they can't be simultaneously C and F contiguous, as Arrays can also be discontinuous if they are made of strides, e.g:
Though I don't know of a use case where a user would need an array to be discontiguous... I will need to play around with the Numpy APIs a bit to see exactly how it handles trying to force either C or F contiguity. I think 1D arrays will always be either both C and F contiguous or discontiguous, and for non-trivial arrays, we could force either one or the other. I think it would make sense to store the contiguity flags during persistance, then format arrays correctly on loading. I can play around with that a bit this weekend to see if I can get something working. |
I'm not an expert on numpy low level stuff, but purely conceptually, if you have code like:
where the last column is dropped, if it was C-contiguous before, would it not be "almost" contiguous still? Or would it treated like a randomly shuffled array? And if we load the array with skops and it is now C-contiguous, could there be any disadvantage? I would think probably not. Investigating a bit further, I also found that pickling results in the same loss of preserving the contiguity: estimator.coef_.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : False
p = pickle.loads(pickle.dumps(estimator))
p.coef_.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False Therefore, I think that we should be fine if going from neither C nor F contiguous to C contiguous. Basically, what I think would be a good result is:
I haven't checked yet if that needs any change in our code to make it happen, maybe only our testing code would need adjusting. |
I think we should persist as above, and I think the truth table you've provided would be a good idea to implement. I really can't think of any use case where a user would need something to not be contiguous, so I think having the final column in your table should be fine. |
I'll have a deeper look here soon, but in the meantime, I just realized numpy uses |
I was gonna report a bug on the numpy side, but I realized this makes total sense. AFAICT, every time where the contiguous flag is different after load, it's also the case that Which means we can have a weak test and check if I'd love it if @rgommers could confirm this. |
That sounds correct to me. You should be able to rely on the data values in pickled arrays, but AFAIK there is no way with pickling to preserve the view relationships between different arrays. |
Right now, when testing persistence of classifiers, we create a binary classification task, and the classifiers all pass. However, when switching to a multiclass classification task,
LogisticRegression
and related estimators fail (e.g.CalibratedClassifierCV
which uses lr under the hood by default).To reproduce, replace the following lines:
skops/skops/io/tests/test_persist.py
Lines 421 to 423 in 2c2dd6e
by these lines:
(note that
n_redundant
andn_informative
are irrelevant here, they just need to be changed formake_classification
to work)The error is that the contiguity of the
coef_
attributes is not the same. Strangely enough, it is the original estimator that seems to be "wrong":I haven't investigated further, but I suspect that lr uses a different algorithm under the hood when dealing with binary classification, which is why it only occurs in the multiclass setting. EDIT: See below, that's not the reason.
The text was updated successfully, but these errors were encountered: