Add different display types for leaf_distributions x-axis. #254

mepland · 2023-01-22T19:16:39Z

No description provided.

mepland · 2023-01-22T19:18:10Z

Here is a demo for ctree_leaf_distributions with a small tree where xaxis_display_type='individual' would be preferred:

mepland · 2023-01-22T19:20:20Z

And here is a demo for ctree_leaf_distributions with a large / deep tree where xaxis_display_type='auto' would be preferred:

mepland · 2023-01-22T19:24:25Z

@parrt @tlapusan LMK what you think! I like this as one potential solution, the gaps between the node IDs aren't too distracting if it there are a lot of leaves. We can add other options to xaxis_display_type too, if you want to try other solutions.

Also, when we are happy with the path forward, the changes will also need to be made to leaf_sizes, leaf_purity, rtree_leaf_distributions.

mepland · 2023-01-22T19:29:49Z

Probably should standardize the x-axis title to "Leaf IDs" in rtree_leaf_distributions as well.

parrt · 2023-01-22T19:31:05Z

Seems to me that node IDs just aren't useful unless you can identify a specific node, so for the auto I just don't think I would ever find it useful.

I think we should order the information by its usefulness not the arbitrary node ids. Too bad we can't simply have a hover over the bar that would pop up the node id if you want it. I don't think we can do interactive stuff with matplotlib.

mepland · 2023-01-22T19:42:57Z

I know I would find xaxis_display_type='auto' useful to make an overview plot of all the nodes.

For example, I would use it to make a first figure of all the nodes to say "here are all of the nodes of the tree, as you can see nodes X, Y, Z are interesting because of ...", and then include a second figure with xaxis_display_type='individual' cropped down to those nodes of interest. The major axis ticks will give a reader a rough idea of what nodes are interesting in the first plot while contextualizing them, and the second plot would give precise info on each node's properties.

parrt · 2023-01-22T21:34:01Z

I suppose the node ID General area could be useful and it doesn't hurt to include but fewer options is always good.

What do you think about ordering the graph by magnitude of y axis to get a better idea of what's important?

mepland · 2023-01-23T03:57:06Z

I think we should order the information by its usefulness not the arbitrary node ids. Too bad we can't simply have a hover over the bar that would pop up the node id if you want it. I don't think we can do interactive stuff with matplotlib.

What do you think about ordering the graph by magnitude of y axis to get a better idea of what's important?

Yeah, we could do interactive stuff with something like plotly, but then we couldn't save it to a flat file.

I'll add a xaxis_display_type='y_sorted' which will display the nodes in decreasing y value, without any x-axis tick labels.

mepland · 2023-01-23T04:24:36Z

What do you think about ordering the graph by magnitude of y axis to get a better idea of what's important?

@parrt how does this look for xaxis_display_type='y_sorted'?

tlapusan · 2023-01-29T08:53:49Z

I know I would find xaxis_display_type='auto' useful to make an overview plot of all the nodes.

For example, I would use it to make a first figure of all the nodes to say "here are all of the nodes of the tree, as you can see nodes X, Y, Z are interesting because of ...", and then include a second figure with xaxis_display_type='individual' cropped down to those nodes of interest. The major axis ticks will give a reader a rough idea of what nodes are interesting in the first plot while contextualizing them, and the second plot would give precise info on each node's properties.

@mepland curios what do you mean by "cropped down to those nodes of interes" :)
If the plot will contain a lot of leaves, by using 'individual' will still be useless since the user couldn't clearly get the node id.

I see useful the utility of "auto" in this case :

user get an idea in which range of ids are the leaves he wants to investigate
switch back to "individual", BUT the ctree_leaf_distributions() should have a parameter, like show_leaf_id_range, which will display only the leaves from that range. In this case, the leaf ids could be clearly seen. This is kind of the same leaf filtering strategy which I described here ctree_leaf_distributions() label_all_leaves parameter #220 (comment), but it seems it doesn't have that much attention :D

mepland · 2023-02-02T03:36:49Z

@tlapusan I meant basically what you suggested. We could include any combo of:

Specify the range of node_id values to include via show_leaf_id_range as you suggest.
Specify the exact node_id values to include via show_leaf_id_list
Specify for the leafs in terms of the range of y-value via show_leaf_y_range

I think 1 and 2 are the most useful, with 2 anything, such as 3, can be added on by the user.

mepland · 2023-02-20T19:17:39Z

Hi @parrt and @tlapusan, circling back to this. I think if I implement show_leaf_id_list, 2 above, this should give the end user the flexibility to do whatever they want without making the dtreeviz code overly complex. What do you think?

Would love to get this merged this month!

tlapusan · 2023-02-25T10:54:48Z

Hi @mepland,

I pull your code and ran the ctree_leaf_distributions for each xaxis_display_type value.

It seems to be some issues (worked in sklearn notebook). For auto, the x values are not correct distributed and for "y_sorted" there is a code issue (maybe because meanwhile we changed the code to support multi class labels (not only binary classification)).

mepland · 2023-02-25T17:05:44Z

@tlapusan @parrt I fixed the code so now it works with multiclass, and added the show_leaf_id_list parameter:

tlapusan · 2023-02-26T06:42:38Z

@mepland nice, those bugs were fixed :)

few suggestions :

when the plot will contain a lot of leaves, still hard to identify a node id by following the xaxis_display_type="auto" -> show_leaf_id_list=[x, y, z].
As we discussed here, Add different display types for leaf_distributions x-axis. #254 (comment), the step 1 and 3 could be more helpful. In the end, if someone could get the node id from the 'raw' visualisation... why would he need to use the show_leaf_id_list ?
xaxis_display_type="auto"
Would be helpful to have int values on X axis... seeing floating numbers is a little odd since the values should represent leaf ids.
xaxis_display_type="y_sorted"
Looks nice, do you think would be helpful to have also the leaf ids on X axis ?

parrt · 2023-02-26T19:09:24Z

@parrt how does this look for xaxis_display_type='y_sorted'?

Looks pretty cool to me!

parrt · 2023-02-26T19:14:41Z

Would it be possible to get the bars the same width (maybe only when there is room)? I guess it's no big deal and they aren't uniform as they are now.

I'm still not convinced we'll get a good way to identify leaves of interest looking at x axis labels. What about tooltip text? Unlike the tree view, these are just matplotlib plots so we can do tooltips:

tlapusan · 2023-02-27T18:21:34Z

tooltip could also help :)

I tried the first example from https://mplcursors.readthedocs.io/en/stable/ and it's not working for me... and I do meet the requirements "Python 3, and Matplotlib≥3.1."

I see tooltip as a 'hidden' feature... not sure that majority of people will discover it.

parrt · 2023-02-28T19:57:04Z

haha. figures mplcursors didn't work.

I guess I am OK with this PR; if it's useful to you guys that's good enough for me!

parrt · 2023-02-28T19:57:31Z

@mepland let me know when you are happy with it and I will merge.

mepland · 2023-03-03T01:08:04Z

Hi @parrt and @tlapusan, I just added a show_leaf_filter parameter that takes a function which a user can configure to implement any filtering criteria they would like on the leaf values, per class. Here is the doc string and a demo:

:param show_leaf_filter: Callable[[np.ndarray], bool], optional
   The filtering function to apply to leaf values before displaying the leaves.
   The function is applied to a numpy array with the class i sample value in row i.
   For example, to filter to leaves with 100 < total samples, and 5 < class 1 samples, use show_leaf_filter = lambda x: (100 < np.sum(x)) & (5 < x[1])

parrt · 2023-03-03T01:15:36Z

cool! Typo here though: "to filter to leaves with 100". Should that be "filtering out"?

mepland · 2023-03-03T01:20:43Z

Replying to your comments @tlapusan:

As we discussed here, Add different display types for leaf_distributions x-axis. #254 (comment), the step 1 and 3 could be more helpful. In the end, if someone could get the node id from the 'raw' visualisation... why would he need to use the show_leaf_id_list ?

show_leaf_filter gives you option 3, and more! Option 1, giving a range of node_id values, can be implemented easily by the user with an expansive show_leaf_id_list, so I do not think we should implement it specifically.

xaxis_display_type="auto"
Would be helpful to have int values on X axis... seeing floating numbers is a little odd since the values should represent leaf ids.

I agree it doesn't look the best, but if we force the x-axis ticks to have an int format then the tick labels do not make sense. In this example it would be 2, 5, 7, 10, 12, 15. I think it is better to leave as is, especially as users will be using auto when there are a lot of leaves and you are in more of this case.

xaxis_display_type="y_sorted"
Looks nice, do you think would be helpful to have also the leaf ids on X axis ?

When sorted by y value the leaf node_id values are all out of order, and would need to be individually labeled, like individual. However, when y_sorted is useful there are probably too many leaves for this to happen without overlapping text - putting us back in this situation.

mepland · 2023-03-03T01:23:27Z

cool! Typo here though: "to filter to leaves with 100". Should that be "filtering out"?

@parrt, not really a typo, the filter keeps leaves that pass its criteria. I'd be open to rewording it though, if it's not clear. Any suggestions?

mepland · 2023-03-03T01:27:13Z

Would it be possible to get the bars the same width (maybe only when there is room)? I guess it's no big deal and they aren't uniform as they are now.

@parrt, aren't the bars all the same width now? They just expand and contract to fit the available horizontal space. I guess I'm not sure what you mean by this.

And yeah, tooltip text is nice in theory - but frequently breaks and can't be printed to a flat image anyway 😄

parrt · 2023-03-03T02:02:42Z

aren't the bars all the same width now? They just expand and contract to fit the available horizontal space. I guess I'm not sure what you mean by this.

no worries. i just meant two diff nodes have diff width bars.

parrt · 2023-03-03T02:05:40Z

to filter to leaves

yeah, that doesn't make sense to me. Maybe: "to filter for"?

For example, to filter for leaves with 100 < total samples, and 5 < class 1 samples, use show_leaf_filter = lambda x: (100 < np.sum(x)) & (5 < x[1])

Or,

For example, to view only those leaves with more than 100 total samples and less than 5 class 1 samples, use show_leaf_filter = lambda x: (100 < np.sum(x)) & (5 < x[1])

Why do you use "100 < total" btw? Ain't that backwards? Never seen that except C programmers in 1980s.

mepland · 2023-03-03T04:33:23Z

For example, to view only those leaves with more than...

Updated to this.

Why do you use "100 < total" btw? Ain't that backwards? Never seen that except C programmers in 1980s.

Well, I did learn to program in C++ from particle physicists mostly trained in the 80s and early 90s so that might be part of it 😃, but I also like when the inequalities increase from left to right, like a number line.

mepland · 2023-03-03T04:34:15Z

@parrt I think it's ready to merge!

parrt · 2023-03-03T17:14:11Z

Ah with that interpretation of an axis, I can understand why that makes sense. Unfortunately no one expects that ;)

parrt · 2023-03-03T17:14:57Z

yay! Thanks very much for these important changes...

Added xaxis_display_type for ctree_leaf_distributions

b37d3bd

mepland mentioned this pull request Jan 22, 2023

Handle numerous leaves better in leaf stats plots - no overlaps #247

Open

mepland added 2 commits January 22, 2023 23:22

added y_sorted option

c86b1c9

Merge branch 'dev' into leaf_distributions_xaxis

a262bef

mepland added 2 commits January 22, 2023 23:25

remove commented out sort

3e564d0

Standardized x-axis title to Leaf IDs

fc4f95b

merged dev

7d8b82b

mepland added 2 commits February 25, 2023 11:33

fix for multiclass

e51b316

added show_leaf_id_list

e80f002

Merge branch 'dev' into leaf_distributions_xaxis

3bef3e4

mepland marked this pull request as ready for review February 25, 2023 17:06

Added show_leaf_filter

3849e95

updated doc string

7c721c7

parrt added this to the 2.2.1 milestone Mar 3, 2023

parrt added the enhancement New feature or request label Mar 3, 2023

parrt merged commit abdc697 into dev Mar 3, 2023

mepland deleted the leaf_distributions_xaxis branch March 3, 2023 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add different display types for leaf_distributions x-axis. #254

Add different display types for leaf_distributions x-axis. #254

mepland commented Jan 22, 2023

mepland commented Jan 22, 2023

mepland commented Jan 22, 2023

mepland commented Jan 22, 2023

mepland commented Jan 22, 2023

parrt commented Jan 22, 2023

mepland commented Jan 22, 2023

parrt commented Jan 22, 2023

mepland commented Jan 23, 2023

mepland commented Jan 23, 2023

tlapusan commented Jan 29, 2023

mepland commented Feb 2, 2023

mepland commented Feb 20, 2023

tlapusan commented Feb 25, 2023

mepland commented Feb 25, 2023

tlapusan commented Feb 26, 2023

parrt commented Feb 26, 2023

parrt commented Feb 26, 2023 •

edited

Loading

tlapusan commented Feb 27, 2023

parrt commented Feb 28, 2023

parrt commented Feb 28, 2023

mepland commented Mar 3, 2023

parrt commented Mar 3, 2023

mepland commented Mar 3, 2023

mepland commented Mar 3, 2023

mepland commented Mar 3, 2023

parrt commented Mar 3, 2023

parrt commented Mar 3, 2023

mepland commented Mar 3, 2023

mepland commented Mar 3, 2023

parrt commented Mar 3, 2023

parrt commented Mar 3, 2023

Add different display types for leaf_distributions x-axis. #254

Add different display types for leaf_distributions x-axis. #254

Conversation

mepland commented Jan 22, 2023

mepland commented Jan 22, 2023

mepland commented Jan 22, 2023

mepland commented Jan 22, 2023

mepland commented Jan 22, 2023

parrt commented Jan 22, 2023

mepland commented Jan 22, 2023

parrt commented Jan 22, 2023

mepland commented Jan 23, 2023

mepland commented Jan 23, 2023

tlapusan commented Jan 29, 2023

mepland commented Feb 2, 2023

mepland commented Feb 20, 2023

tlapusan commented Feb 25, 2023

mepland commented Feb 25, 2023

tlapusan commented Feb 26, 2023

parrt commented Feb 26, 2023

parrt commented Feb 26, 2023 • edited Loading

tlapusan commented Feb 27, 2023

parrt commented Feb 28, 2023

parrt commented Feb 28, 2023

mepland commented Mar 3, 2023

parrt commented Mar 3, 2023

mepland commented Mar 3, 2023

mepland commented Mar 3, 2023

mepland commented Mar 3, 2023

parrt commented Mar 3, 2023

parrt commented Mar 3, 2023

mepland commented Mar 3, 2023

mepland commented Mar 3, 2023

parrt commented Mar 3, 2023

parrt commented Mar 3, 2023

parrt commented Feb 26, 2023 •

edited

Loading