Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT Refactor with a SaveState and avoid duplicate numpy arrays #173

Merged

Conversation

BenjaminBossan
Copy link
Collaborator

Supersedes #150

Description

Previously, we were passing only the directory path, now we pass a dataclass that contains the path but also the protocol and a way to memoize objects.

This way, those functions that, for some reason, need to memoize the object, can now do so. This allows us, for instance, to make sure that the same numpy array is not saved multiple times on disk (see discussion in #150).

I went ahead and implemented memoization for numpy arrays (and scipy sparse matrices) in order to ensure that this refactor actually is sufficient to solve the initial problem.

Design

The design goal here was to make as few code changes as possible while still enabling much greater flexibility. E.g., I wanted to leave the individual get_state/get_instance methods as they are (except for renaming one of the parameters), since I find the current structure very readable. That's why I didn't go with a giant class with a high number of get_state/get_instance methods, which is another solution we discussed.

The SaveState object can serve a similar purpose as self would have served if we choose a class based approach, i.e. it can hold the state necessary for the functions to perform their work and in the future, we can add more state if necessary.

For the time being, I didn't change the get_instance methods at all, even though we could make an analogous change there, passing a LoadState instead of src everywhere. Let me know if I should make that change for consistency or if we should leave those functions untouched until it becomes necessary.

The memoization part has been modeled to be similar to what pickle does but tailored to our needs.

Coincidental changes

While working on #150, I discovered a minor bug where trying to store an object numpy array resulted in the creation of a broken .npy file being left over. This is because numpy tries to write to the file until it encounters an error and raises, but then doesn't clean up said file. Before this bugfix, we would include that broken file in the zip archive, although we wouldn't do anything with it. Now, no such file is being created. Since #150 won't be merged, I added the bugfix and a corresponding test here.

Moreover, while working on this, since I had to touch the signature of the get_state functions, I also added type hints in accordance to the rest of the code base ("light" types).

While working on that, I found that we have a single case where the get_state function would not return a dict, namely when the object is a primitive type that can be json-serialized, in which case we return a string. I changed the return type there to always be dict for consistency (so in this case a dict that contains the json-serialized string).

Previously, we were passing only the directory path, now we pass a
dataclass that contains the path but also the protocol and a way to
memoize objects.

This way, those functions that, for some reason, need to memoize the
object, can now do so. This allows us, for instance, to make sure that
the same numpy array is not saved multiple times on disk (see discussion
in skops-dev#150).

I went ahead and implemented memoization for numpy arrays (and scipy
sparse matrices) in order to ensure that this refactor actually is
sufficient to solve the initial problem.

Design

The design goal here was to make as few code changes as possible while
still enabling much greater flexibility. E.g., I wanted to leave the
individual get_state/get_instance methods as they are (except for
renaming one of the parameters), since I find the current structure very
readable. That's why I didn't go with a giant class with a high number
of get_state/get_instance methods, which is another solution we
discussed.

The SaveState object can serve a similar purpose as self would have
served if we choose a class based approach, i.e. it can hold the state
necessary for the functions to perform their work and in the future, we
can add more state if necessary.

For the time being, I didn't change the get_instance methods at all,
even though we could make an analogous change there, passing a LoadState
instead of src everywhere. Let me know if I should make that change for
consistency or if we should leave those functions untouched until it
becomes necessary.

The memoization part has been modeled to be similar to what pickle does
but tailored to our needs.

Coincidental changes

While working on skops-dev#150, I discovered a minor bug where trying to store an
object numpy array resulted in the creation of a broken .npy file being
left over. This is because numpy tries to write to the file until it
encounters an error and raises, but then doesn't clean up said file.
Before this bugfix, we would include that broken file in the zip
archive, although we wouldn't do anything with it. Now, no such file is
being created. Since skops-dev#150 won't be merged, I added the bugfix and a
corresponding test here.

Moreover, while working on this, since I had to touch the signature of
the get_state functions, I also added type hints in accordance to the
rest of the code base ("light" types).

While working on that, I found that we have a single case where the
get_state function would not return a dict, namely when the object is a
primitive type that can be json-serialized, in which case we return a
string. I changed the return type there to always be dict for
consistency.
@BenjaminBossan
Copy link
Collaborator Author

@skops-dev/maintainers ready for review

Codecov:

Screenshot from 2022-10-05 13-25-19

Hmm...

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the types, LGTM. The typehints really makes this less readable and I don't like the repeated types all over the place. Types are evil :D

# we use numpy's internal save mechanism to store the dtype by
# saving/loading an empty array with that dtype.
tmp = np.ndarray(0, dtype=obj)
tmp: np.typing.NDArray = np.ndarray(0, dtype=obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan of typing local variables. if mypy is failing on such a line, we should just disable those checks if possible.

@adrinjalali adrinjalali changed the title Refactor saving to use a SaveState object MNT Refactor with a SaveState and avoid duplicate numpy arrays Oct 6, 2022
@adrinjalali adrinjalali merged commit 4d8d70f into skops-dev:main Oct 6, 2022
@BenjaminBossan
Copy link
Collaborator Author

Other than the types, LGTM. The typehints really makes this less readable and I don't like the repeated types all over the place. Types are evil :D

I added them here for consistency and because it makes working with the new SaveState object a bit easier.

Since you merged: Do you want them removed or did you accept your fate? ^^

@BenjaminBossan BenjaminBossan deleted the persist-refactor-save-state branch October 6, 2022 15:51
@adrinjalali
Copy link
Member

Since you merged: Do you want them removed or did you accept your fate? ^^

I have accepted my fate for now to see how I feel about it while coding it lol. The same needs to happen for get_instance methods.

@BenjaminBossan
Copy link
Collaborator Author

The same needs to happen for get_instance methods.

Do you mean a refactor of get_instance that's analogous to this one? So what I wrote earlier:

For the time being, I didn't change the get_instance methods at all, even though we could make an analogous change there, passing a LoadState instead of src everywhere. Let me know if I should make that change for consistency or if we should leave those functions untouched until it becomes necessary.

@adrinjalali
Copy link
Member

Yes, I thought I can leave that to the audit PR, since that's where we need it.

@BenjaminBossan
Copy link
Collaborator Author

Yes, I thought I can leave that to the audit PR, since that's where we need it.

Okay, LMK if you want me to take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants