Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pie charts show for data that classifier was trained on and not new data #269

Closed
HannahAlexander opened this issue Feb 23, 2023 · 28 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@HannahAlexander
Copy link

We fit a model using a DecisionTreeClassifier on upsampled data. We then wanted to visualise this model using the original data. In the plot all the numbered labels are correct but the ratios in the final pie charts are for the upsampled data.

@tlapusan
Copy link
Collaborator

It would be better to use the same dataset for both training and dtreeviz visualisations.
Internally, dtreeviz is using both the tree metadata and the dataset sent as paramenter.

btw, what ml library are you using ?

@HannahAlexander
Copy link
Author

HannahAlexander commented Feb 23, 2023 via email

@tlapusan
Copy link
Collaborator

Based on the current dtreeviz implementation, I think no :(
But it's good that you raised this issue, we could take it into consideration as a next possible feature.

@HannahAlexander
Copy link
Author

HannahAlexander commented Feb 25, 2023 via email

@thomsentner
Copy link
Contributor

I'm running into a similar problem. I trained a decision tree on my train set, but would like to visualize its performance on the test set. When I use the test set as input for dtreeviz.model, the plots are incorrect, the plots show a weird mix of the data I gave it and the data the model was trained on.

As a train test split is a very common procedure, is there no workaround for this?

@tlapusan tlapusan self-assigned this Mar 16, 2023
@tlapusan
Copy link
Collaborator

hi @thomsentner, I put here some limitations : #269 (comment)

btw, that library are you using ?
I will try to take a look the next days, for sklearn we could have a chance to make it work for new data, I guess.

@thomsentner
Copy link
Contributor

Thanks. I'm using sklearn as well.

@tlapusan
Copy link
Collaborator

Looking into the code, the change supposed to be small.. but things get complicated a little in case the class_weight parameter is used at model training.
I have to spend more time to better understand the overall picture.

@thomsentner
Copy link
Contributor

I created a PR with some very quick and dirty changes that already seem to solve the issue I faced personally. Hopefully this helps in resolving this issue.

Original behavior is preserved:
Screenshot 2023-03-18 at 13 50 47

X_test is displayed correctly:
Screenshot 2023-03-18 at 13 50 24

@tlapusan
Copy link
Collaborator

@thomsentner thanks for the PR, I just observed it while creating my PR also for this issue :d
#282

I will take a look also on yours ;)

@parrt
Copy link
Owner

parrt commented Mar 18, 2023

I'm not sure how I feel about this. Point of this library is to visualize how a decision tree carves up feature space and makes decisions based upon the training data. The only roll for testing data is to see how a specific test case would run down the tree right? How would you show a decision tree for data that was not part of the construction of that tree? To me that means you simply train a new tree on the testing data and show that. Sorry if I am misunderstanding

@thomsentner
Copy link
Contributor

@parrt for me it would be to visualize the validation dataset, and as such, visualize the true world performance I can expect from the tree, not so much to test just any specific test case. Looking at training samples will give me a very biased view of what will be happening.

@tlapusan
Copy link
Collaborator

@HannahAlexander your feedback would also help :)

@tlapusan
Copy link
Collaborator

@parrt the plan is to use the tree structure/metadata learned from training set and make the plots based on another dataset, like validation.

As we know, an important step in any ML project is to do a good train/validation (and even test) split, which should reflect the production data. Only interpreting the tree based on train dataset, doesn't mean that the model will perform the same in production also.

For example let's say we have 92% accuracy on train and 80% on validation (or even 99%). The question is why ? Interpreting the tree structure(learned from train) and making visualisations based on validation data should help to get the answers.

@tlapusan
Copy link
Collaborator

tlapusan commented Mar 21, 2023

Here are some visualisation which could help to understand the purpose.
The first viz is based on trian data, the second on validation and the third one on the same validation, but I randomly changed the target values (it's an exaggeration :)) but just to serve our purpose when a train/validation split is not correct)

Screenshot 2023-03-21 at 09 34 49

@tlapusan
Copy link
Collaborator

Another useful think would be to compare in parallel the visualisations for train and validation and to check for some differences. Ideally the node value distribution/ranges should be the same, but sometime they are not.
Screenshot 2023-03-21 at 09 39 18

@tlapusan
Copy link
Collaborator

tlapusan commented Apr 1, 2023

@parrt any thoughts on this ? :)

@parrt
Copy link
Owner

parrt commented Apr 1, 2023

Sorry for the delay. I'm 100% focused on some machine learning stuff at work haha.

Ok, I think I understand the purpose now. You want a mechanism to visualize how the tree structure interprets the validation set in a large sense instead of where we run a single test instance down the tree now. In other words the tree structure does not change but the distributions in the decision notes and the leaf know it does, according to the information in the validation set. Do I have that correct?

@tlapusan
Copy link
Collaborator

tlapusan commented Apr 1, 2023

Indeed, it's for the entire validation set and the tree structure (decision split nodes learned during training) doesn't change. In other words, it's how the tree sees/interprete/predict on a new dataset (which is different from training).

It would be a pretty powerful feature I think for the library. No other ones allow this, from what I know :)

@parrt
Copy link
Owner

parrt commented Apr 2, 2023

I'm a bit nervous about the feature, even though I see the utility and understanding a large validation set. Would this require a lot of changes or add complexity to the code base?

@tlapusan
Copy link
Collaborator

tlapusan commented Apr 3, 2023

It should be a minimal change in the code. For sklearn is this PR 648c24b which as we can see there are just few lines of code.
For the other libraries it should be the same, but I have to double check.

Still, I propose to do this change for sklearn first and to see the community feedback.

@parrt
Copy link
Owner

parrt commented Apr 3, 2023

Still, I propose to do this change for sklearn first and to see the community feedback

Agreed, let's give it a try and let people report back.

@thomsentner
Copy link
Contributor

thomsentner commented Apr 6, 2023

@tlapusan tried working with your PR already and it seems to work great! Just running into one problem still, if a non-leaf node contains only one class in the test set, the plot of that node will fail. dtreeviz tries to plot both classes in all non-leaf nodes, but cannot find any data for the opposite class in this case. For me it occurs for the node just above these leafs:
image

@tlapusan
Copy link
Collaborator

tlapusan commented Apr 6, 2023

@thomsentner nice that you had time to check it ! thanks

The issue which you mentioned happened to me also and we just merged a PR for it into master few days ago..#284

@parrt we have #282 for this issue. If we can merge it, @thomsentner should have the mentioned issue solved.

@parrt
Copy link
Owner

parrt commented Apr 6, 2023

@tlapusan merged! thanks :)

@parrt parrt added this to the 2.2.1 milestone Apr 6, 2023
@parrt parrt added the enhancement New feature or request label Apr 6, 2023
@parrt
Copy link
Owner

parrt commented Apr 6, 2023

Resolved by #282

@parrt parrt closed this as completed Apr 6, 2023
@mepland
Copy link
Collaborator

mepland commented Apr 11, 2023

Sorry I've been busy getting up to speed in my new position, but I'm very glad this was implemented - a great addition to dtreeviz!

@parrt
Copy link
Owner

parrt commented Apr 13, 2023

no prob! I'm super busy too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants