Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT Function to visualize skops files #317

Merged

Conversation

BenjaminBossan
Copy link
Collaborator

Resolves #301

This is not finished, just a basis for discussion.

Here is the output for a simple MinMaxScaler(feature_range=(-555, 123)):

sklearn.preprocessing._data.MinMaxScaler-attrs: builtins.dict---feature_range: builtins.tuple-----content: json-type(-555)
├-----content: json-type(123)
├---copy: json-type(true)
├---clip: json-type(false)
├---_sklearn_version: json-type("1.2.0")

And below is the output for a bigger Pipeline discussed here, but with the option show="untrusted", i.e. only showing untrusted types:

sklearn.pipeline.Pipeline [UNSAFE]
├-attrs: builtins.dict [UNSAFE]
├---steps: builtins.list [UNSAFE]
├-----content: builtins.tuple [UNSAFE]
├-------content: sklearn.pipeline.FeatureUnion [UNSAFE]
├---------attrs: builtins.dict [UNSAFE]
├-----------transformer_list: builtins.list [UNSAFE]
├-------------content: builtins.tuple [UNSAFE]
├---------------content: sklearn.preprocessing._data.StandardScaler [UNSAFE]
├-----------------attrs: builtins.dict [UNSAFE]
├-------------------n_samples_seen_: numpy.int64 [UNSAFE]
├-------------content: builtins.tuple [UNSAFE]
├---------------content: sklearn.pipeline.Pipeline [UNSAFE]
├-----------------attrs: builtins.dict [UNSAFE]
├-------------------steps: builtins.list [UNSAFE]
├---------------------content: builtins.tuple [UNSAFE]
├-----------------------content: sklearn.preprocessing._function_transformer.FunctionTransformer [UNSAFE]
├-------------------------attrs: builtins.dict [UNSAFE]
├---------------------------func: numpy.ufunc => numpy.core._multiarray_umath.sqrt [UNSAFE]

I.e. all the parent nodes are considered untrusted if a child is untrusted. This is now achieved by leveraging node.get_unsafe_set().

Before going too deep with this, I would like to discuss what other users and @skops-dev/maintainers think about this. Some open points that come to my mind:

  1. Should we check schema of nodes or not? Especially if they are nodes whose children are not visited*.
  2. Are we fine with the visual representation? Could it be prettier? Json-types are currently shown as json-type(123) for instance.
  3. Right now, some nodes have special dispatch for visualization, e.g. for functions (or else we would only show numpy.ufunc, not what ufunc is actually used). I wonder if we should move this to the Node classes?

*Why are some children not visited? Because they can't. E.g. for ndarray, the children are io.BytesIO, not nodes.

Resolves skops-dev#301

This is not finished, just a basis for discussion.
Don't add it to parent nodes.
@BenjaminBossan
Copy link
Collaborator Author

Okay, I made an update to only mark nodes as [UNSAFE] if the nodes themselves are unsafe, not if they are safe but one of their children is unsafe. This is how the example above is now shown:

root: sklearn.pipeline.Pipeline
├-attrs: builtins.dict
├---steps: builtins.list
├-----content: builtins.tuple
├-------content: sklearn.pipeline.FeatureUnion
├---------attrs: builtins.dict
├-----------transformer_list: builtins.list
├-------------content: builtins.tuple
├---------------content: sklearn.preprocessing._data.StandardScaler
├-----------------attrs: builtins.dict
├-------------------n_samples_seen_: numpy.int64 [UNSAFE]
├-------------content: builtins.tuple
├---------------content: sklearn.pipeline.Pipeline
├-----------------attrs: builtins.dict
├-------------------steps: builtins.list
├---------------------content: builtins.tuple
├-----------------------content: sklearn.preprocessing._function_transformer.FunctionTransformer
├-------------------------attrs: builtins.dict
├---------------------------func: numpy.ufunc => numpy.core._multiarray_umath.sqrt [UNSAFE]

@BenjaminBossan
Copy link
Collaborator Author

BenjaminBossan commented Mar 14, 2023

A variant using ansi colors and cleaner lines:

image

Code should be much more straightforward now.

Moreover, it is refactored so that the sink function now gets an
iterator of all nodes, not just of a single node. This is better because
printing of nodes can now happen in context. Just as an example, a node
could now be printed differently if it is the first node of its level.
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great, and better than what I originally had. I personally like this version the best: #317 (comment), but that's because I'm colorblind-ish and the colors in the last version bother me a bit, but it'd be nice to have that as an argument on how to format things.

return should_print


def _check_node_and_children_safe(node: Node, trusted: bool | Sequence[str]) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be okay to move this to nodes and cache them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, there is already node.is_safe() >__<


# use singledispatch so that we can register specialized visualization functions
@singledispatch
def format_node(node: Node) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also don't mind moving these to nodes.

@BenjaminBossan BenjaminBossan changed the title [WIP] Function to visualize skops files Function to visualize skops files Mar 17, 2023
@BenjaminBossan
Copy link
Collaborator Author

@skops-dev/maintainers This is now ready for review

As an example, here is the top part of the visualization of the following estimator:

pipeline = Pipeline([
    ("features", FeatureUnion([
        ("scaler", StandardScaler()),
        ("scaled-poly", Pipeline([
            ("polys", FeatureUnion([
                ("poly1", PolynomialFeatures()),
                ("poly2", PolynomialFeatures(degree=3, include_bias=False))
            ])),
            ("square-root", FunctionTransformer(unsafe_function)),
            ("scale", MinMaxScaler()),
        ])),
    ])),
    ("clf", LogisticRegression(random_state=0, solver="liblinear")),
]).fit([[0, 1], [2, 3], [4, 5]], [0, 1, 2])

file = sio.dumps(pipeline)
sio.visualize_tree(file)

image

Lots of stuff can be customized, e.g. if colors should be used (and which), how to mark untrusted types, etc. The visualization requires rich, but users can provide custom printing functions if they want to avoid that.

@BenjaminBossan BenjaminBossan marked this pull request as ready for review March 17, 2023 14:01
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty nice. We do need to add rich to setup.py with a key though, somehow in extras. I would have liked to keep your own implementation of sink, so that if rich is not installed, or a parameter is passed in a certain way, we just use that.

skops/io/_visualize.py Show resolved Hide resolved
This function visits all nodes of the object tree and determines:

- level: how nested the node is
- key: the key of the node, e.g. the key of a dict.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand 'key' here

Comment on lines +173 to +174
The key to the current node. If "key_types" is encountered, it is
skipped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be skipped? can it not be unsafe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is checked now

# key_types is not helpful, as it is artificially added by skops to
# circumvent the fact that json only allows keys to be strings. It is not
# useful to the user and adds a lot of noise, thus skip key_types.
# TODO: check that no funny business is going on in key types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could always hide them, unless they're unsafe maybe?

)


def visualize_tree(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters that it's a tree. What about just visualize? Or inspect (although there's the inspect module)?

@adrinjalali adrinjalali added this to the v0.6 milestone Mar 21, 2023
@BenjaminBossan
Copy link
Collaborator Author

I would have liked to keep your own implementation of sink, so that if rich is not installed, or a parameter is passed in a certain way, we just use that.

@adrinjalali I experimented a bit and could create a custom implementation that does not rely on rich.Tree. It produces the exact same output, so I'd switch over to it. However, I would not go down the road of reimplementing the colorization stuff, since that can become quite tricky. Therefore, I would propose that we don't use colors if rich is not installed and do use colors if it is installed. Of course, users can still pass use_colors=False to prevent colorization even if rich is installed. I think in that case, we don't need an optional install flag for rich, as there would be no error if rich is not installed. WDYT?

@adrinjalali
Copy link
Member

It produces the exact same output, so I'd switch over to it.

Perfect.

I think in that case, we don't need an optional install flag for rich, as there would be no error if rich is not installed. WDYT?

That's fair, but I'd still add rich to an extras group kinda thing in setup.py, so that users can easily have a way of installing skops with all optional dependencies.

- rename visualize_tree => visualize
- more explanatory comments
- when encountering 'key_types', check type and safety
- possibility to visualize without rich
- make rich an extra install
- more documentation
Copy link
Collaborator Author

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed your points, please review again.

skops/io/_visualize.py Show resolved Hide resolved
Comment on lines +173 to +174
The key to the current node. If "key_types" is encountered, it is
skipped.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is checked now

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!

@adrinjalali adrinjalali changed the title Function to visualize skops files FEAT Function to visualize skops files Mar 22, 2023
@adrinjalali adrinjalali merged commit 1646b18 into skops-dev:main Mar 22, 2023
@BenjaminBossan BenjaminBossan deleted the issue-301-visualize-skops-file branch March 23, 2023 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Utility function to better understand schema of a skops file
2 participants