ctree_leaf_distributions() label_all_leaves parameter #220

mepland · 2022-12-28T21:27:35Z

Added label_all_leaves parameter which turns on (off) x axis ticks for each leaf, useful when there are few (many) leaves.

dtreeviz/trees.py

parrt · 2022-12-28T23:33:28Z

If we decide this is the right approach, it seems like we need it on all of the leaf information functions to be consistent:

leaf_sizes
leaf_purity
rtree_leaf_distributions
ctree_leaf_distributions

…r each leaf, useful when there are few (many) leaves.

mepland · 2022-12-28T23:41:28Z

Yeah, that's a good point. I just find it useful to turn off the x-axis ticks if you are plotting a deep tree with lots of nodes. In that case, having each node getting its own tick and ticklabel makes the x-axis a mess of overlapped text.

Ideally these leaf info functions would automatically handle > ~20 nodes by making some node ids into major ticks and ignoring the rest. However, I'm not sure how to do this cleanly with the current approach of:

ax.set_xticks(range(0, len(index)))
ax.set_xticklabels(index)

We could code up our own logic around how to pick the major / minor node ids, but it would be nice to leverage any built-in matplotlib functionality.

parrt · 2022-12-28T23:53:19Z

We can maybe clean that up later.

dtreeviz/trees.py

tlapusan · 2023-01-03T08:32:07Z

@mepland indeed, when the tree has a lot of leaves, the plot is not very nicely displayed.

@parrt I do remember that we discuss recently a similar issue and one possible solution was to either make a vertical plot or to plot the labels at a 45 degree.

If we look at the leaf_sizes() viz, we can see that we have the min_size and max_size arguments, which can help us to filter out the leaves which we are not interested. Would help to have it implemented for ctree_leaf_distributions.

@mepland to label only the major ticks I think this sample code can help, at least I tried it and put the tick and label on leaf 1 and 4.

ax.set_xticks([1,4])
ax.set_xticklabels(["leaf 1", "leaf 4"])

@mepland do you have an example of a messy plot ? would help to work on it and to see how the changes will reflect in display.

mepland · 2023-01-04T23:32:38Z

@parrt I do remember that we discuss recently a similar issue and one possible solution was to either make a vertical plot or to plot the labels at a 45 degree.

I don't think this is a sustainable solution, there will always be a bigger tree with more nodes!

If we look at the leaf_sizes() viz, we can see that we have the min_size and max_size arguments, which can help us to filter out the leaves which we are not interested. Would help to have it implemented for ctree_leaf_distributions.

While a useful addition, I also don't think this is a full solution as users may want to have one plot with all the leaves.

@mepland to label only the major ticks I think this sample code can help, at least I tried it and put the tick and label on leaf 1 and 4.

ax.set_xticks([1,4])
ax.set_xticklabels(["leaf 1", "leaf 4"])

I can try something like this, it was just simpler to turn off the IDs to start 😄

@mepland do you have an example of a messy plot ? would help to work on it and to see how the changes will reflect in display.

Any significantly deep tree, here DecisionTreeClassifier(max_depth=6), will break it. See here for an example.

parrt · 2023-01-06T22:33:25Z

That example makes it look like the leaf ideas actually aren't that useful. Are people really asking for leaves by id?

mepland · 2023-01-07T00:26:28Z

It could be useful I suppose, though I am fine turning them off. I just think we should have a better solution in place to handle this mess of labels.

mepland · 2023-01-09T02:53:05Z

@parrt where should we go from here? Should I make an issue and start a new PR that tries to do this more automatically, or are we happy with the brute force parameter that turns on/off all of the tick labels?

tlapusan · 2023-01-09T06:29:50Z

@parrt the leaf id would be important to know when the viz will contain an 'interesting' samples distribution. In our titanic dataset would be a leaf where majority of people survived. Having the id of the leaf, I could use the note_stats(node_id) to get all the samples from that leaf and to investigate them.

@mepland when the tree will be very big and with a lot of leaves, even displaying without the leaf id will be a challange for the viz. Let's don't forget that the user has the possibility to specify the figsize, which will help. In general, when the tree will have a lot of nodes, dtreeviz will have challenges to present them nicely into a single plot.

btw, I agree that we can have a parameter to turn off the leaf ids if the user wants to.

@mepland have you thought to other ideas when the number of leaves is large ?

mepland · 2023-01-10T00:36:07Z

@tlapusan good point just making the plot bigger with figsize.

We could also plot the leaves on a "normal" axis and let matplotlib take care of the major / minor tick scaling to fit the available space. There will be potential gaps (in your plot between 5 and 10, 65 and 70, etc) if the node_ids are not sequential - but that should be fine.

Perhaps we have the current behavior, no labels, and my proposed normal matplotlib axis behavior as options?

parrt · 2023-01-14T19:08:53Z

Having the id of the leaf, I could use the note_stats(node_id) to get all the samples from that leaf and to investigate them.

Good point. Yes, and they can simply make the figure wider when there are lots of IDs.

@tlapusan what do you think of allowing the plotting library to decide where the ticks go depending on the figure size? Might make it hard to identify an individual node...

I'd also like to avoid an option where we can display them and not display them. I think we should always display and then let them change the figure size. Also we need to decide whether each node gets an individual label or we let the Library decide.

tlapusan · 2023-01-15T16:57:05Z

When I proposed this viz, I had the usecase in my mind that some leaves could contain some interesting labels -> investigate them (using the leafid and node_stats(id)) -> improve domain knowledge about the dataset which could help to create new features.

Indeed, there could be another usecase when the user wants to have a general overview how the leaf distributions looks. In this case I think the leaf ids are not important and the user could have the option of not displaying them (especially when we have a lot of leaves and there is a mess for leaf labels).

About the implementing our own logic of choosing what leaf ids to present...I think that is hard to implement because there could be an infinity combinations of what a user think it's important for him to see. Better to let the user to choose what he wants to see, otherwise could be frustrating for him (at least for me :)) ).

That's why I still think that using two parameters (see the leaf_size viz) like min_size (or a better name could be min_sample) and max_size could help. I have in mind this scenario: the tree has a lot of leaves, but I can see from the viz (general overview) that some leaves have a lot of samples. In this case, I could set the min_size to a value to force the viz to show me only the leaves on which I'm interested and afterword to take some leaf ids and investigate them.

Bellow is an example for leaf_size() which could be applied to ctree_leaf_distributions() also:

parrt · 2023-01-15T18:19:33Z

Thanks for the explanation. How about something more general and instead of min_size, we pass in a function or lambda that takes a size parameter and returns to or false as to whether it should be included?

Then, we need an option to turn off/on the leaf ids.

Does that make sense? Seems we would then make an analogous change for the other leaf stats methods, right?

mepland · 2023-01-15T19:06:42Z

I'm going to close this PR for now, let's make a new branch after 2.1 is released and we decide what changes we'd like to make.

mepland · 2023-01-15T21:46:30Z

Started an issue to track: #247

parrt reviewed Dec 28, 2022

View reviewed changes

dtreeviz/trees.py Outdated Show resolved Hide resolved

parrt reviewed Dec 28, 2022

View reviewed changes

dtreeviz/trees.py Outdated Show resolved Hide resolved

mepland added 2 commits December 28, 2022 18:34

Added label_all_leaves parameter which turns on (off) x axis ticks fo…

cde707a

…r each leaf, useful when there are few (many) leaves.

reorder params

45c2b81

mepland force-pushed the label_all_leaves branch from b2fc8ef to 45c2b81 Compare December 28, 2022 23:34

parrt added this to the 2.0 milestone Dec 28, 2022

parrt added the enhancement New feature or request label Dec 28, 2022

parrt reviewed Dec 28, 2022

View reviewed changes

dtreeviz/trees.py Outdated Show resolved Hide resolved

mepland mentioned this pull request Dec 30, 2022

Graphical tweaks and bug fixes #200

Closed

label_all_leaves -> show_leaf_labels

54860c0

parrt modified the milestones: 2.0, 2.1 Jan 3, 2023

mepland closed this Jan 15, 2023

mepland mentioned this pull request Jan 15, 2023

Handle numerous leaves better in leaf stats plots - no overlaps #247

Open

tlapusan mentioned this pull request Jan 29, 2023

Add different display types for leaf_distributions x-axis. #254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ctree_leaf_distributions() label_all_leaves parameter #220

ctree_leaf_distributions() label_all_leaves parameter #220

mepland commented Dec 28, 2022

parrt commented Dec 28, 2022

mepland commented Dec 28, 2022

parrt commented Dec 28, 2022

tlapusan commented Jan 3, 2023 •

edited

Loading

mepland commented Jan 4, 2023

parrt commented Jan 6, 2023

mepland commented Jan 7, 2023

mepland commented Jan 9, 2023

tlapusan commented Jan 9, 2023

mepland commented Jan 10, 2023

parrt commented Jan 14, 2023

tlapusan commented Jan 15, 2023

parrt commented Jan 15, 2023

mepland commented Jan 15, 2023

mepland commented Jan 15, 2023

ctree_leaf_distributions() label_all_leaves parameter #220

ctree_leaf_distributions() label_all_leaves parameter #220

Conversation

mepland commented Dec 28, 2022

parrt commented Dec 28, 2022

mepland commented Dec 28, 2022

parrt commented Dec 28, 2022

tlapusan commented Jan 3, 2023 • edited Loading

mepland commented Jan 4, 2023

parrt commented Jan 6, 2023

mepland commented Jan 7, 2023

mepland commented Jan 9, 2023

tlapusan commented Jan 9, 2023

mepland commented Jan 10, 2023

parrt commented Jan 14, 2023

tlapusan commented Jan 15, 2023

parrt commented Jan 15, 2023

mepland commented Jan 15, 2023

mepland commented Jan 15, 2023

tlapusan commented Jan 3, 2023 •

edited

Loading