Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctree_leaf_distributions() label_all_leaves parameter #220

Closed
wants to merge 3 commits into from

Conversation

mepland
Copy link
Collaborator

@mepland mepland commented Dec 28, 2022

Added label_all_leaves parameter which turns on (off) x axis ticks for each leaf, useful when there are few (many) leaves.

dtreeviz/trees.py Outdated Show resolved Hide resolved
dtreeviz/trees.py Outdated Show resolved Hide resolved
@parrt
Copy link
Owner

parrt commented Dec 28, 2022

If we decide this is the right approach, it seems like we need it on all of the leaf information functions to be consistent:

  • leaf_sizes
  • leaf_purity
  • rtree_leaf_distributions
  • ctree_leaf_distributions

@mepland
Copy link
Collaborator Author

mepland commented Dec 28, 2022

Yeah, that's a good point. I just find it useful to turn off the x-axis ticks if you are plotting a deep tree with lots of nodes. In that case, having each node getting its own tick and ticklabel makes the x-axis a mess of overlapped text.

Ideally these leaf info functions would automatically handle > ~20 nodes by making some node ids into major ticks and ignoring the rest. However, I'm not sure how to do this cleanly with the current approach of:

ax.set_xticks(range(0, len(index)))
ax.set_xticklabels(index)

We could code up our own logic around how to pick the major / minor node ids, but it would be nice to leverage any built-in matplotlib functionality.

@parrt parrt added this to the 2.0 milestone Dec 28, 2022
@parrt parrt added the enhancement New feature or request label Dec 28, 2022
@parrt
Copy link
Owner

parrt commented Dec 28, 2022

We can maybe clean that up later.

dtreeviz/trees.py Outdated Show resolved Hide resolved
@tlapusan
Copy link
Collaborator

tlapusan commented Jan 3, 2023

@mepland indeed, when the tree has a lot of leaves, the plot is not very nicely displayed.

@parrt I do remember that we discuss recently a similar issue and one possible solution was to either make a vertical plot or to plot the labels at a 45 degree.

If we look at the leaf_sizes() viz, we can see that we have the min_size and max_size arguments, which can help us to filter out the leaves which we are not interested. Would help to have it implemented for ctree_leaf_distributions.

@mepland to label only the major ticks I think this sample code can help, at least I tried it and put the tick and label on leaf 1 and 4.

ax.set_xticks([1,4])
ax.set_xticklabels(["leaf 1", "leaf 4"])

@mepland do you have an example of a messy plot ? would help to work on it and to see how the changes will reflect in display.

@parrt parrt modified the milestones: 2.0, 2.1 Jan 3, 2023
@mepland
Copy link
Collaborator Author

mepland commented Jan 4, 2023

@parrt I do remember that we discuss recently a similar issue and one possible solution was to either make a vertical plot or to plot the labels at a 45 degree.

I don't think this is a sustainable solution, there will always be a bigger tree with more nodes!

If we look at the leaf_sizes() viz, we can see that we have the min_size and max_size arguments, which can help us to filter out the leaves which we are not interested. Would help to have it implemented for ctree_leaf_distributions.

While a useful addition, I also don't think this is a full solution as users may want to have one plot with all the leaves.

@mepland to label only the major ticks I think this sample code can help, at least I tried it and put the tick and label on leaf 1 and 4.

ax.set_xticks([1,4])
ax.set_xticklabels(["leaf 1", "leaf 4"])

I can try something like this, it was just simpler to turn off the IDs to start 😄

@mepland do you have an example of a messy plot ? would help to work on it and to see how the changes will reflect in display.

Any significantly deep tree, here DecisionTreeClassifier(max_depth=6), will break it. See here for an example.

@parrt
Copy link
Owner

parrt commented Jan 6, 2023

That example makes it look like the leaf ideas actually aren't that useful. Are people really asking for leaves by id?

@mepland
Copy link
Collaborator Author

mepland commented Jan 7, 2023

It could be useful I suppose, though I am fine turning them off. I just think we should have a better solution in place to handle this mess of labels.

@mepland
Copy link
Collaborator Author

mepland commented Jan 9, 2023

@parrt where should we go from here? Should I make an issue and start a new PR that tries to do this more automatically, or are we happy with the brute force parameter that turns on/off all of the tick labels?

@tlapusan
Copy link
Collaborator

tlapusan commented Jan 9, 2023

@parrt the leaf id would be important to know when the viz will contain an 'interesting' samples distribution. In our titanic dataset would be a leaf where majority of people survived. Having the id of the leaf, I could use the note_stats(node_id) to get all the samples from that leaf and to investigate them.

@mepland when the tree will be very big and with a lot of leaves, even displaying without the leaf id will be a challange for the viz. Let's don't forget that the user has the possibility to specify the figsize, which will help. In general, when the tree will have a lot of nodes, dtreeviz will have challenges to present them nicely into a single plot.

Screenshot 2023-01-09 at 08 25 00

btw, I agree that we can have a parameter to turn off the leaf ids if the user wants to.

@mepland have you thought to other ideas when the number of leaves is large ?

@mepland
Copy link
Collaborator Author

mepland commented Jan 10, 2023

@tlapusan good point just making the plot bigger with figsize.

We could also plot the leaves on a "normal" axis and let matplotlib take care of the major / minor tick scaling to fit the available space. There will be potential gaps (in your plot between 5 and 10, 65 and 70, etc) if the node_ids are not sequential - but that should be fine.

Perhaps we have the current behavior, no labels, and my proposed normal matplotlib axis behavior as options?

@parrt
Copy link
Owner

parrt commented Jan 14, 2023

Having the id of the leaf, I could use the note_stats(node_id) to get all the samples from that leaf and to investigate them.

Good point. Yes, and they can simply make the figure wider when there are lots of IDs.

@tlapusan what do you think of allowing the plotting library to decide where the ticks go depending on the figure size? Might make it hard to identify an individual node...

I'd also like to avoid an option where we can display them and not display them. I think we should always display and then let them change the figure size. Also we need to decide whether each node gets an individual label or we let the Library decide.

@tlapusan
Copy link
Collaborator

When I proposed this viz, I had the usecase in my mind that some leaves could contain some interesting labels -> investigate them (using the leafid and node_stats(id)) -> improve domain knowledge about the dataset which could help to create new features.

Indeed, there could be another usecase when the user wants to have a general overview how the leaf distributions looks. In this case I think the leaf ids are not important and the user could have the option of not displaying them (especially when we have a lot of leaves and there is a mess for leaf labels).

About the implementing our own logic of choosing what leaf ids to present...I think that is hard to implement because there could be an infinity combinations of what a user think it's important for him to see. Better to let the user to choose what he wants to see, otherwise could be frustrating for him (at least for me :)) ).

That's why I still think that using two parameters (see the leaf_size viz) like min_size (or a better name could be min_sample) and max_size could help. I have in mind this scenario: the tree has a lot of leaves, but I can see from the viz (general overview) that some leaves have a lot of samples. In this case, I could set the min_size to a value to force the viz to show me only the leaves on which I'm interested and afterword to take some leaf ids and investigate them.

Bellow is an example for leaf_size() which could be applied to ctree_leaf_distributions() also:

Screenshot 2023-01-15 at 18 54 47

@parrt
Copy link
Owner

parrt commented Jan 15, 2023

Thanks for the explanation. How about something more general and instead of min_size, we pass in a function or lambda that takes a size parameter and returns to or false as to whether it should be included?

Then, we need an option to turn off/on the leaf ids.

Does that make sense? Seems we would then make an analogous change for the other leaf stats methods, right?

@mepland
Copy link
Collaborator Author

mepland commented Jan 15, 2023

I'm going to close this PR for now, let's make a new branch after 2.1 is released and we decide what changes we'd like to make.

@mepland
Copy link
Collaborator Author

mepland commented Jan 15, 2023

Started an issue to track: #247

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants