Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add different display types for leaf_distributions x-axis. #254

Merged
merged 11 commits into from
Mar 3, 2023

Conversation

mepland
Copy link
Collaborator

@mepland mepland commented Jan 22, 2023

No description provided.

@mepland
Copy link
Collaborator Author

mepland commented Jan 22, 2023

Here is a demo for ctree_leaf_distributions with a small tree where xaxis_display_type='individual' would be preferred:

image

@mepland
Copy link
Collaborator Author

mepland commented Jan 22, 2023

And here is a demo for ctree_leaf_distributions with a large / deep tree where xaxis_display_type='auto' would be preferred:

image

@mepland
Copy link
Collaborator Author

mepland commented Jan 22, 2023

@parrt @tlapusan LMK what you think! I like this as one potential solution, the gaps between the node IDs aren't too distracting if it there are a lot of leaves. We can add other options to xaxis_display_type too, if you want to try other solutions.

Also, when we are happy with the path forward, the changes will also need to be made to leaf_sizes, leaf_purity, rtree_leaf_distributions.

@mepland
Copy link
Collaborator Author

mepland commented Jan 22, 2023

Probably should standardize the x-axis title to "Leaf IDs" in rtree_leaf_distributions as well.

@parrt
Copy link
Owner

parrt commented Jan 22, 2023

Seems to me that node IDs just aren't useful unless you can identify a specific node, so for the auto I just don't think I would ever find it useful.

I think we should order the information by its usefulness not the arbitrary node ids. Too bad we can't simply have a hover over the bar that would pop up the node id if you want it. I don't think we can do interactive stuff with matplotlib.

@mepland
Copy link
Collaborator Author

mepland commented Jan 22, 2023

I know I would find xaxis_display_type='auto' useful to make an overview plot of all the nodes.

For example, I would use it to make a first figure of all the nodes to say "here are all of the nodes of the tree, as you can see nodes X, Y, Z are interesting because of ...", and then include a second figure with xaxis_display_type='individual' cropped down to those nodes of interest. The major axis ticks will give a reader a rough idea of what nodes are interesting in the first plot while contextualizing them, and the second plot would give precise info on each node's properties.

@parrt
Copy link
Owner

parrt commented Jan 22, 2023

I suppose the node ID General area could be useful and it doesn't hurt to include but fewer options is always good.

What do you think about ordering the graph by magnitude of y axis to get a better idea of what's important?

@mepland
Copy link
Collaborator Author

mepland commented Jan 23, 2023

I think we should order the information by its usefulness not the arbitrary node ids. Too bad we can't simply have a hover over the bar that would pop up the node id if you want it. I don't think we can do interactive stuff with matplotlib.

What do you think about ordering the graph by magnitude of y axis to get a better idea of what's important?

Yeah, we could do interactive stuff with something like plotly, but then we couldn't save it to a flat file.

I'll add a xaxis_display_type='y_sorted' which will display the nodes in decreasing y value, without any x-axis tick labels.

@mepland
Copy link
Collaborator Author

mepland commented Jan 23, 2023

What do you think about ordering the graph by magnitude of y axis to get a better idea of what's important?

@parrt how does this look for xaxis_display_type='y_sorted'?

image

@tlapusan
Copy link
Collaborator

I know I would find xaxis_display_type='auto' useful to make an overview plot of all the nodes.

For example, I would use it to make a first figure of all the nodes to say "here are all of the nodes of the tree, as you can see nodes X, Y, Z are interesting because of ...", and then include a second figure with xaxis_display_type='individual' cropped down to those nodes of interest. The major axis ticks will give a reader a rough idea of what nodes are interesting in the first plot while contextualizing them, and the second plot would give precise info on each node's properties.

@mepland curios what do you mean by "cropped down to those nodes of interes" :)
If the plot will contain a lot of leaves, by using 'individual' will still be useless since the user couldn't clearly get the node id.

I see useful the utility of "auto" in this case :

  1. user get an idea in which range of ids are the leaves he wants to investigate
  2. switch back to "individual", BUT the ctree_leaf_distributions() should have a parameter, like show_leaf_id_range, which will display only the leaves from that range. In this case, the leaf ids could be clearly seen. This is kind of the same leaf filtering strategy which I described here ctree_leaf_distributions() label_all_leaves parameter  #220 (comment), but it seems it doesn't have that much attention :D

@mepland
Copy link
Collaborator Author

mepland commented Feb 2, 2023

@tlapusan I meant basically what you suggested. We could include any combo of:

  1. Specify the range of node_id values to include via show_leaf_id_range as you suggest.
  2. Specify the exact node_id values to include via show_leaf_id_list
  3. Specify for the leafs in terms of the range of y-value via show_leaf_y_range

I think 1 and 2 are the most useful, with 2 anything, such as 3, can be added on by the user.

@mepland
Copy link
Collaborator Author

mepland commented Feb 20, 2023

Hi @parrt and @tlapusan, circling back to this. I think if I implement show_leaf_id_list, 2 above, this should give the end user the flexibility to do whatever they want without making the dtreeviz code overly complex. What do you think?

Would love to get this merged this month!

@tlapusan
Copy link
Collaborator

Hi @mepland,

I pull your code and ran the ctree_leaf_distributions for each xaxis_display_type value.

Screenshot 2023-02-25 at 12 50 46

It seems to be some issues (worked in sklearn notebook). For auto, the x values are not correct distributed and for "y_sorted" there is a code issue (maybe because meanwhile we changed the code to support multi class labels (not only binary classification)).

@mepland
Copy link
Collaborator Author

mepland commented Feb 25, 2023

@tlapusan @parrt I fixed the code so now it works with multiclass, and added the show_leaf_id_list parameter:
image

@mepland mepland marked this pull request as ready for review February 25, 2023 17:06
@tlapusan
Copy link
Collaborator

@mepland nice, those bugs were fixed :)

few suggestions :

  1. when the plot will contain a lot of leaves, still hard to identify a node id by following the xaxis_display_type="auto" -> show_leaf_id_list=[x, y, z].
    As we discussed here, Add different display types for leaf_distributions x-axis. #254 (comment), the step 1 and 3 could be more helpful. In the end, if someone could get the node id from the 'raw' visualisation... why would he need to use the show_leaf_id_list ?

  2. xaxis_display_type="auto"
    Would be helpful to have int values on X axis... seeing floating numbers is a little odd since the values should represent leaf ids.

  3. xaxis_display_type="y_sorted"
    Looks nice, do you think would be helpful to have also the leaf ids on X axis ?

@parrt
Copy link
Owner

parrt commented Feb 26, 2023

@parrt how does this look for xaxis_display_type='y_sorted'?

Looks pretty cool to me!

@parrt
Copy link
Owner

parrt commented Feb 26, 2023

Would it be possible to get the bars the same width (maybe only when there is room)? I guess it's no big deal and they aren't uniform as they are now.

I'm still not convinced we'll get a good way to identify leaves of interest looking at x axis labels. What about tooltip text? Unlike the tree view, these are just matplotlib plots so we can do tooltips:

@tlapusan
Copy link
Collaborator

tooltip could also help :)

I tried the first example from https://mplcursors.readthedocs.io/en/stable/ and it's not working for me... and I do meet the requirements "Python 3, and Matplotlib≥3.1."

I see tooltip as a 'hidden' feature... not sure that majority of people will discover it.

@parrt
Copy link
Owner

parrt commented Feb 28, 2023

haha. figures mplcursors didn't work.

I guess I am OK with this PR; if it's useful to you guys that's good enough for me!

@parrt
Copy link
Owner

parrt commented Feb 28, 2023

@mepland let me know when you are happy with it and I will merge.

@mepland
Copy link
Collaborator Author

mepland commented Mar 3, 2023

Hi @parrt and @tlapusan, I just added a show_leaf_filter parameter that takes a function which a user can configure to implement any filtering criteria they would like on the leaf values, per class. Here is the doc string and a demo:

:param show_leaf_filter: Callable[[np.ndarray], bool], optional
   The filtering function to apply to leaf values before displaying the leaves.
   The function is applied to a numpy array with the class i sample value in row i.
   For example, to filter to leaves with 100 < total samples, and 5 < class 1 samples, use show_leaf_filter = lambda x: (100 < np.sum(x)) & (5 < x[1])

image

@parrt
Copy link
Owner

parrt commented Mar 3, 2023

cool! Typo here though: "to filter to leaves with 100". Should that be "filtering out"?

@mepland
Copy link
Collaborator Author

mepland commented Mar 3, 2023

Replying to your comments @tlapusan:

As we discussed here, Add different display types for leaf_distributions x-axis. #254 (comment), the step 1 and 3 could be more helpful. In the end, if someone could get the node id from the 'raw' visualisation... why would he need to use the show_leaf_id_list ?

show_leaf_filter gives you option 3, and more! Option 1, giving a range of node_id values, can be implemented easily by the user with an expansive show_leaf_id_list, so I do not think we should implement it specifically.

  1. xaxis_display_type="auto"
    Would be helpful to have int values on X axis... seeing floating numbers is a little odd since the values should represent leaf ids.

I agree it doesn't look the best, but if we force the x-axis ticks to have an int format then the tick labels do not make sense. In this example it would be 2, 5, 7, 10, 12, 15. I think it is better to leave as is, especially as users will be using auto when there are a lot of leaves and you are in more of this case.

  1. xaxis_display_type="y_sorted"
    Looks nice, do you think would be helpful to have also the leaf ids on X axis ?

When sorted by y value the leaf node_id values are all out of order, and would need to be individually labeled, like individual. However, when y_sorted is useful there are probably too many leaves for this to happen without overlapping text - putting us back in this situation.

@mepland
Copy link
Collaborator Author

mepland commented Mar 3, 2023

cool! Typo here though: "to filter to leaves with 100". Should that be "filtering out"?

@parrt, not really a typo, the filter keeps leaves that pass its criteria. I'd be open to rewording it though, if it's not clear. Any suggestions?

@mepland
Copy link
Collaborator Author

mepland commented Mar 3, 2023

Would it be possible to get the bars the same width (maybe only when there is room)? I guess it's no big deal and they aren't uniform as they are now.

@parrt, aren't the bars all the same width now? They just expand and contract to fit the available horizontal space. I guess I'm not sure what you mean by this.

And yeah, tooltip text is nice in theory - but frequently breaks and can't be printed to a flat image anyway 😄

@parrt
Copy link
Owner

parrt commented Mar 3, 2023

aren't the bars all the same width now? They just expand and contract to fit the available horizontal space. I guess I'm not sure what you mean by this.

no worries. i just meant two diff nodes have diff width bars.

@parrt
Copy link
Owner

parrt commented Mar 3, 2023

to filter to leaves

yeah, that doesn't make sense to me. Maybe: "to filter for"?

For example, to filter for leaves with 100 < total samples, and 5 < class 1 samples, use show_leaf_filter = lambda x: (100 < np.sum(x)) & (5 < x[1])

Or,

For example, to view only those leaves with more than 100 total samples and less than 5 class 1 samples, use show_leaf_filter = lambda x: (100 < np.sum(x)) & (5 < x[1])

Why do you use "100 < total" btw? Ain't that backwards? Never seen that except C programmers in 1980s.

@mepland
Copy link
Collaborator Author

mepland commented Mar 3, 2023

For example, to view only those leaves with more than...

Updated to this.

Why do you use "100 < total" btw? Ain't that backwards? Never seen that except C programmers in 1980s.

Well, I did learn to program in C++ from particle physicists mostly trained in the 80s and early 90s so that might be part of it 😃, but I also like when the inequalities increase from left to right, like a number line.

@mepland
Copy link
Collaborator Author

mepland commented Mar 3, 2023

@parrt I think it's ready to merge!

@parrt
Copy link
Owner

parrt commented Mar 3, 2023

Ah with that interpretation of an axis, I can understand why that makes sense. Unfortunately no one expects that ;)

@parrt parrt added this to the 2.2.1 milestone Mar 3, 2023
@parrt parrt added the enhancement New feature or request label Mar 3, 2023
@parrt
Copy link
Owner

parrt commented Mar 3, 2023

yay! Thanks very much for these important changes...

@parrt parrt merged commit abdc697 into dev Mar 3, 2023
@mepland mepland deleted the leaf_distributions_xaxis branch March 3, 2023 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants