Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Decision Tree Visualization #386

Merged
merged 12 commits into from
Aug 1, 2023
Merged

FIX: Decision Tree Visualization #386

merged 12 commits into from
Aug 1, 2023

Conversation

reidjohnson
Copy link
Contributor

@reidjohnson reidjohnson commented Jul 24, 2023

Fixes issue #385.

The root cause of the bug appears to be that the base sklearn tree constructor object is not wrapped as a valid skops Node type, but is still included in the visualized object tree.

This PR fixes the issue by appropriately processing this node.

Return empty if node is a constructor.
Swap LogisticRegression or RandomForestRegressor
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

("clf", LogisticRegression(random_state=0, solver="liblinear")),
("clf", RandomForestRegressor(random_state=0)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could test both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. One challenge here is that I don't think it'll be straightforward without a minor refactor of the pipeline fixture. The simplest approaches seem to be to parameterize the fixture or to use a GridSearchCV. Any preferences or suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also simply have a separate test only for RandomForestRegressor and RandomForestClsasifier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added separate test for DecisionTreeClassifier and DecisionTreeRegressor.

Comment on lines 235 to 236
if node_name == "constructor":
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a bit shady, it might skip things that the user should see. Can you show here an example of what's inside this node in our case?

One would also need to update the node_name docstring if we're skipping here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very sympathetic to the concerns. Suspect you all know better here, so would defer to your recommendations.

As far as I am aware, sklearn tree-based models are the only ones that produce a "constructor" object of this type (presumably as an artifact of how skops handles C Extension types).

Taking the example:

from sklearn.tree import DecisionTreeClassifier
from skops.io import dumps, loads, visualize
dumped = dumps(DecisionTreeClassifier().fit([[0, 1], [2, 3], [4, 5]], [0, 1, 2]))
loaded = loads(dumped)
visualize(dumped)

The skipped node is <class 'sklearn.tree._tree.Tree'>. Skipping processing of this object, the code outputs:

root: sklearn.tree._classes.DecisionTreeClassifier
└──
attrs: builtins.dict

├──
criterion: json-type("gini")

├──
splitter: json-type("best")

├──
max_depth: json-type(null)

├──
min_samples_split: json-type(2)

├──
min_samples_leaf: json-type(1)

├──
min_weight_fraction_leaf: json-type(0.0)

├──
max_features: json-type(null)

├──
max_leaf_nodes: json-type(null)

├──
random_state: json-type(null)

├──
min_impurity_decrease: json-type(0.0)

├──
class_weight: json-type(null)

├──
ccp_alpha: json-type(0.0)

├──
n_features_in_: json-type(2)

├──
n_outputs_: json-type(1)

├──
classes_: numpy.ndarray

├──
n_classes_: numpy.int64

├──
max_features_: json-type(2)

├──
tree_: sklearn.tree._tree.Tree

├──
attrs: builtins.dict
│ │
├──
max_depth: json-type(2)
│ │
├──
node_count: json-type(5)
│ │
├──
nodes: numpy.ndarray
│ │
└──
values: numpy.ndarray

├──
args: builtins.tuple
│ │
├──
content: json-type(2)
│ │
├──
content: numpy.ndarray
│ │
└──
content: json-type(1)

└──
_sklearn_version: json-type("1.3.0")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think the best solution here would be to actually handle type, as in, in the visualized tree print what the type is, rather than skipping it. I'm sure there would be other cases where we'd encounter types as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a first pass. So if isinstance(node, type), then print information about the type rather than skipping.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs a changelog.

Comment on lines 240 to 241
is_self_safe=False,
is_safe=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we should defer to the information in the Node instead of assuming it to be unsafe all the time. the Tree constructor for instance, is safe.

Copy link
Contributor Author

@reidjohnson reidjohnson Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I agree. However the challenge I'm running into is that it's not a valid Node object, so it does not have the necessary attributes:

AttributeError: type object 'sklearn.tree._tree.Tree' has no attribute 'is_self_safe'

If I understand correctly, it's because the constructor class is directly set as a child node here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so probably the parent when doing the walk_tree should send this info to the child.

Comment on lines 225 to 228
if "is_self_safe" in kwargs:
is_self_safe = kwargs["is_self_safe"]
if "is_safe" in kwargs:
is_safe = kwargs["is_safe"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd make sense to pass these as actual args rather than **kwargs, and they'd need to be added to the docstring

@reidjohnson reidjohnson changed the title Fix Decision Tree Visualization FIX: Decision Tree Visualization Jul 28, 2023
@reidjohnson reidjohnson marked this pull request as ready for review July 28, 2023 01:28
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite nice, thank you @reidjohnson

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the extension of the walk_tree API to include is_self_safe and is_safe made me skeptical, as it looks like we have a leaky abstraction here. I tried to understand why that is necessary and I think I found a deeper underlying issue.

I think the correct solution right now is that we should not visualize the constructor. It is not an attribute of the object, just a utility class we add to construct it, so its presence is not really needed. Therefore, I think we should actually just skip the node info on the constructor.

This should be perfectly safe with the current code, since we only have two examples where we should have type(node) is type, namely TreeNode and SGDNode, where we ensure that the constructor is indeed safe. However, this may of course change in the future. Therefore, we could do a check that the type is either a Tree or in ALLOWED_SGD_LOSSES.

If we make this change, then we can roll back the changes in the signature of walk_tree and avoid the leaky abstraction.

As a more long term solution, I think we might want to consider something like this: the children of a ReduceNode should not directly contain the constructor or any type. Instead, we should wrap the constructor into a TypeNode, which is capable of determining if the constructor is safe or not.

If we make that change, the walk_tree function would, if it encounters a ReduceNode, just rely on ReduceNode.is_safe, which under the hood calls is_safe on the TypeNode.

WDYT @adrinjalali?

@reidjohnson
Copy link
Contributor Author

@BenjaminBossan Thanks for the comments. Updated with wrapping constructor into a TypeNode; reverted changes to walk_tree.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much nicer, but we should handle this change in protocol the same way it's done via the files in the io.old folder.

"constructor": TypeNode(
{
"__class__": constructor.__name__,
"__module__": constructor.__module__,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__module__ doesn't always exist, we should use get_module instead.

@BenjaminBossan
Copy link
Collaborator

The failing test seems to be that in the latest sklearn version, they added a new parameter:

  • ├── monotonic_cst: json-type(null)
    

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, this LGTM.

I think the codecov complaints are false positives, so if @adrinjalali approves of this final version, we can merge.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the PR @reidjohnson , and thank you for being patient with our reviews.

@adrinjalali adrinjalali merged commit e8523d2 into skops-dev:main Aug 1, 2023
14 of 15 checks passed
@reidjohnson reidjohnson deleted the fix-decisiontree-visualize branch August 1, 2023 16:24
@reidjohnson
Copy link
Contributor Author

Thank you for the reviews, have been happy to contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants