Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH trusting np.ufuncs and np.dtypes #336

Merged
merged 13 commits into from
May 15, 2023

Conversation

omar-araboghli
Copy link
Contributor

Reference Issues/PRs

closes #224

What does this implement/fix? Explain your changes.

Trusting numpy.ufuncs and numpy.dtypes by default.

Any other comments?

  1. I compared the ufuncs list obtained in this PR with the available ufuncs. They are almost the same except the aliases. Take true_divide as an example that is an alias for divide and unfortunately cannot be found with the dir(...) and isinstance(...) approach. Is there a way to find numpy aliases ?
  2. test_can_persist_fitted didn't run successfully until I added NUMPY_DTYPE_TYPE_NAMES to the default trusted list of NdArrayNode. I am wondering why isn't DTypeNode the correct place ? Is there any other Nodes I am missing ?

@omar-araboghli omar-araboghli changed the title #224: trusting np.ufuncs and np.dtypes ENH trusting np.ufuncs and np.dtypes Apr 1, 2023
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skops/io/tests/test_audit.py Outdated Show resolved Hide resolved
skops/io/tests/test_audit.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take true_divide as an example that is an alias for divide and unfortunately cannot be found with the dir(...) and isinstance(...) approach. Is there a way to find numpy aliases

Not that I know of, but it's better to trust too little than too much, so it's fine for now.

test_can_persist_fitted didn't run successfully until I added NUMPY_DTYPE_TYPE_NAMES to the default trusted list of NdArrayNode

This might be a bug, let me investigate before we merge this.

skops/io/_trusted_types.py Outdated Show resolved Hide resolved
skops/io/_trusted_types.py Outdated Show resolved Hide resolved
skops/io/tests/test_audit.py Outdated Show resolved Hide resolved
@BenjaminBossan
Copy link
Collaborator

test_can_persist_fitted didn't run successfully until I added NUMPY_DTYPE_TYPE_NAMES to the default trusted list of NdArrayNode

This might be a bug, let me investigate before we merge this.

Okay, I believe this is not a bug. The reason is that we use NdArrayNode not only for numpy arrays, but also for numpy generics. If e.g. a numpy float is saved, it will use NdArrayNode, which is why the numpy dtypes have to be trusted there. This is indeed a bit confusing, maybe another name for the class would have been better.

Another source of confusion (at least for me) is what is meant by dtype. Here we mean e.g. a numpy.float64. However, there is also the "dtype" type, e.g. numpy.dtype[float64], which is persisted using the DtypeNode.

At the end of the day, I think everything is working here as expected, but the naming can cause confusion. What's your opinion @adrinjalali?

@adrinjalali
Copy link
Member

Changing the names here wouldn't be backward compatible, and will probably take some time, and I also don't have very good ideas for their names. So I think we can leave that for a future iteration.

@BenjaminBossan
Copy link
Collaborator

Changing the names here wouldn't be backward compatible, and will probably take some time, and I also don't have very good ideas for their names. So I think we can leave that for a future iteration.

True, we should probably add a comment though.

@omar-araboghli I think the docs also need to be updated here: https://skops.readthedocs.io/en/stable/persistence.html#usage

@omar-araboghli
Copy link
Contributor Author

@adrinjalali and @BenjaminBossan thanks for the review and clarification! Please let me know in case something still has to be changed.

I defined the function get_public_type_names that does the thing we need to do twice for now and used walrus operator as suggested, also multiple times. Does its name make sense ?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise it's looking great I think.

@@ -200,3 +201,36 @@ def get_type_paths(types: Any) -> list[str]:
types = [types]

return [get_type_name(t) if not isinstance(t, str) else t for t in types]


def get_public_type_names(module: ModuleType, _type: Type) -> list[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename _type to oftype?

{
type_name
for attr in dir(module)
if (isinstance(obj := getattr(module, attr), _type))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance or issubclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing it to

Suggested change
if (isinstance(obj := getattr(module, attr), _type))
if (issubclass((obj := getattr(module, attr)).__class__, _type))

gets the same results for both numpy and scipy. Since the ufuncs are classes and not objects (instances), issubclass makes more sense to use here. Thanks for the suggestion!

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good, thx. Only two minor issues for me, otherwise the PR can be merged.

@@ -52,6 +53,8 @@ def ndarray_get_state(obj: Any, save_context: SaveContext) -> dict[str, Any]:


class NdArrayNode(Node):
# TODO: NdArrayNode and DtypeNode names lead to confusion, see PR-336
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of sending users on a chase, let's just clear the confusion right away: That this is not only responsible for np arrays, but also for np generics.

Comment on lines 233 to 234
if (issubclass((obj := getattr(module, attr)).__class__, oftype))
and ((type_name := get_type_name(obj)).startswith(module_name))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both lines, are the outer parens necessary? It seems to me that they could be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't easy to use walrus operator in such "complex" statement 😄 Thanks for the catch - removed them!

@BenjaminBossan
Copy link
Collaborator

@omar-araboghli is this ready for review?

@omar-araboghli
Copy link
Contributor Author

omar-araboghli commented Apr 20, 2023

@omar-araboghli is this ready for review?

@BenjaminBossan Indeed! Sorry for not pingging you.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing the suggestions, this looks good now. You'd need to merge the current main and fix the potential merge conflict in changes.rst, then this is good to go.

It would have been nice if NUMPY_DTYPE_TYPE_NAMES could also use get_public_type_names but it wouldn't work as is. If you happen to come up with a clever solution to that works for both, it would be good to make that change, but it's not a blocker.

@BenjaminBossan
Copy link
Collaborator

I just noticed that test_external.py can also be cleaned up, as it contains a couple of numpy dtypes that no longer need to be trusted explicitly.

@omar-araboghli
Copy link
Contributor Author

@BenjaminBossan just resolved the conflict and cleaned test_externals.py as well as test_visualize.py up. I unfortunately couldn't come up with a generic solution to have NUMPY_DTYPE_TYPE_NAMES use get_public_type_names since numpy doesn't have a superclass covering them. Getting them from np.sctypes seems to be the simplest way to go with. Any hint on that or shall we keep it as it as for now ?

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work cleaning up the remaining instances, this looks very good now. There is a small typo in one of the docstrings, besides that this can be merged.

I unfortunately couldn't come up with a generic solution

Then it's fine, it would have been nice but it's not a requirement.

skops/io/_utils.py Outdated Show resolved Hide resolved
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@adrinjalali adrinjalali enabled auto-merge (squash) May 15, 2023 10:18
@omar-araboghli
Copy link
Contributor Author

@BenjaminBossan Thanks for the review and great catch! The typo is now resolved, thanks to @adrinjalali :)

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thx

@adrinjalali adrinjalali merged commit c68141e into skops-dev:main May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add {numpy, scipy}.ufuncs to the trusted list
3 participants