MNT Refactor with a SaveState and avoid duplicate numpy arrays #173

BenjaminBossan · 2022-10-05T11:07:34Z

Supersedes #150

Description

Previously, we were passing only the directory path, now we pass a dataclass that contains the path but also the protocol and a way to memoize objects.

This way, those functions that, for some reason, need to memoize the object, can now do so. This allows us, for instance, to make sure that the same numpy array is not saved multiple times on disk (see discussion in #150).

I went ahead and implemented memoization for numpy arrays (and scipy sparse matrices) in order to ensure that this refactor actually is sufficient to solve the initial problem.

Design

The design goal here was to make as few code changes as possible while still enabling much greater flexibility. E.g., I wanted to leave the individual get_state/get_instance methods as they are (except for renaming one of the parameters), since I find the current structure very readable. That's why I didn't go with a giant class with a high number of get_state/get_instance methods, which is another solution we discussed.

The SaveState object can serve a similar purpose as self would have served if we choose a class based approach, i.e. it can hold the state necessary for the functions to perform their work and in the future, we can add more state if necessary.

For the time being, I didn't change the get_instance methods at all, even though we could make an analogous change there, passing a LoadState instead of src everywhere. Let me know if I should make that change for consistency or if we should leave those functions untouched until it becomes necessary.

The memoization part has been modeled to be similar to what pickle does but tailored to our needs.

Coincidental changes

While working on #150, I discovered a minor bug where trying to store an object numpy array resulted in the creation of a broken .npy file being left over. This is because numpy tries to write to the file until it encounters an error and raises, but then doesn't clean up said file. Before this bugfix, we would include that broken file in the zip archive, although we wouldn't do anything with it. Now, no such file is being created. Since #150 won't be merged, I added the bugfix and a corresponding test here.

Moreover, while working on this, since I had to touch the signature of the get_state functions, I also added type hints in accordance to the rest of the code base ("light" types).

While working on that, I found that we have a single case where the get_state function would not return a dict, namely when the object is a primitive type that can be json-serialized, in which case we return a string. I changed the return type there to always be dict for consistency (so in this case a dict that contains the json-serialized string).

Previously, we were passing only the directory path, now we pass a dataclass that contains the path but also the protocol and a way to memoize objects. This way, those functions that, for some reason, need to memoize the object, can now do so. This allows us, for instance, to make sure that the same numpy array is not saved multiple times on disk (see discussion in skops-dev#150). I went ahead and implemented memoization for numpy arrays (and scipy sparse matrices) in order to ensure that this refactor actually is sufficient to solve the initial problem. Design The design goal here was to make as few code changes as possible while still enabling much greater flexibility. E.g., I wanted to leave the individual get_state/get_instance methods as they are (except for renaming one of the parameters), since I find the current structure very readable. That's why I didn't go with a giant class with a high number of get_state/get_instance methods, which is another solution we discussed. The SaveState object can serve a similar purpose as self would have served if we choose a class based approach, i.e. it can hold the state necessary for the functions to perform their work and in the future, we can add more state if necessary. For the time being, I didn't change the get_instance methods at all, even though we could make an analogous change there, passing a LoadState instead of src everywhere. Let me know if I should make that change for consistency or if we should leave those functions untouched until it becomes necessary. The memoization part has been modeled to be similar to what pickle does but tailored to our needs. Coincidental changes While working on skops-dev#150, I discovered a minor bug where trying to store an object numpy array resulted in the creation of a broken .npy file being left over. This is because numpy tries to write to the file until it encounters an error and raises, but then doesn't clean up said file. Before this bugfix, we would include that broken file in the zip archive, although we wouldn't do anything with it. Now, no such file is being created. Since skops-dev#150 won't be merged, I added the bugfix and a corresponding test here. Moreover, while working on this, since I had to touch the signature of the get_state functions, I also added type hints in accordance to the rest of the code base ("light" types). While working on that, I found that we have a single case where the get_state function would not return a dict, namely when the object is a primitive type that can be json-serialized, in which case we return a string. I changed the return type there to always be dict for consistency.

BenjaminBossan · 2022-10-05T11:26:08Z

@skops-dev/maintainers ready for review

Codecov:

Hmm...

adrinjalali

Other than the types, LGTM. The typehints really makes this less readable and I don't like the repeated types all over the place. Types are evil :D

adrinjalali · 2022-10-06T15:28:12Z

skops/io/_numpy.py

    # we use numpy's internal save mechanism to store the dtype by
    # saving/loading an empty array with that dtype.
-    tmp = np.ndarray(0, dtype=obj)
+    tmp: np.typing.NDArray = np.ndarray(0, dtype=obj)


not a fan of typing local variables. if mypy is failing on such a line, we should just disable those checks if possible.

BenjaminBossan · 2022-10-06T15:51:00Z

Other than the types, LGTM. The typehints really makes this less readable and I don't like the repeated types all over the place. Types are evil :D

I added them here for consistency and because it makes working with the new SaveState object a bit easier.

Since you merged: Do you want them removed or did you accept your fate? ^^

adrinjalali · 2022-10-06T18:34:51Z

Since you merged: Do you want them removed or did you accept your fate? ^^

I have accepted my fate for now to see how I feel about it while coding it lol. The same needs to happen for get_instance methods.

BenjaminBossan · 2022-10-07T11:25:56Z

The same needs to happen for get_instance methods.

Do you mean a refactor of get_instance that's analogous to this one? So what I wrote earlier:

For the time being, I didn't change the get_instance methods at all, even though we could make an analogous change there, passing a LoadState instead of src everywhere. Let me know if I should make that change for consistency or if we should leave those functions untouched until it becomes necessary.

adrinjalali · 2022-10-07T11:56:20Z

Yes, I thought I can leave that to the audit PR, since that's where we need it.

BenjaminBossan · 2022-10-07T14:41:56Z

Yes, I thought I can leave that to the audit PR, since that's where we need it.

Okay, LMK if you want me to take a look.

BenjaminBossan added 2 commits October 5, 2022 13:03

Add from future import for older Py versions

61121f5

adrinjalali approved these changes Oct 6, 2022

View reviewed changes

adrinjalali changed the title ~~Refactor saving to use a SaveState object~~ MNT Refactor with a SaveState and avoid duplicate numpy arrays Oct 6, 2022

adrinjalali merged commit 4d8d70f into skops-dev:main Oct 6, 2022

BenjaminBossan deleted the persist-refactor-save-state branch October 6, 2022 15:51

adrinjalali mentioned this pull request Oct 7, 2022

Avoid duplicating numpy arrays #150

Closed

BenjaminBossan mentioned this pull request Oct 26, 2022

Secure persistence: Avoid duplicate values #135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNT Refactor with a SaveState and avoid duplicate numpy arrays #173

MNT Refactor with a SaveState and avoid duplicate numpy arrays #173

BenjaminBossan commented Oct 5, 2022

BenjaminBossan commented Oct 5, 2022

adrinjalali left a comment

adrinjalali Oct 6, 2022

BenjaminBossan commented Oct 6, 2022

adrinjalali commented Oct 6, 2022

BenjaminBossan commented Oct 7, 2022

adrinjalali commented Oct 7, 2022

BenjaminBossan commented Oct 7, 2022

MNT Refactor with a SaveState and avoid duplicate numpy arrays #173

MNT Refactor with a SaveState and avoid duplicate numpy arrays #173

Conversation

BenjaminBossan commented Oct 5, 2022

Description

Design

Coincidental changes

BenjaminBossan commented Oct 5, 2022

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali Oct 6, 2022

Choose a reason for hiding this comment

BenjaminBossan commented Oct 6, 2022

adrinjalali commented Oct 6, 2022

BenjaminBossan commented Oct 7, 2022

adrinjalali commented Oct 7, 2022

BenjaminBossan commented Oct 7, 2022