Pie charts show for data that classifier was trained on and not new data #269

HannahAlexander · 2023-02-23T15:53:59Z

We fit a model using a DecisionTreeClassifier on upsampled data. We then wanted to visualise this model using the original data. In the plot all the numbered labels are correct but the ratios in the final pie charts are for the upsampled data.

tlapusan · 2023-02-23T18:27:43Z

It would be better to use the same dataset for both training and dtreeviz visualisations.
Internally, dtreeviz is using both the tree metadata and the dataset sent as paramenter.

btw, what ml library are you using ?

HannahAlexander · 2023-02-23T20:20:50Z

Sklearn. Okay that makes sense. I have a highly skewed dataset so I'm training my model on upsampled data, but I then want to visualise the decision tree using the fitted classifier on the original data. Is there a way to do this?On 23 Feb 2023 18:27, Tudor Lapusan ***@***.***> wrote: It would be better to use the same dataset for both training and dtreeviz visualisations. Internally, dtreeviz is using both the tree metadata and the dataset sent as paramenter. btw, what ml library are you using ? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

tlapusan · 2023-02-25T10:30:11Z

Based on the current dtreeviz implementation, I think no :(
But it's good that you raised this issue, we could take it into consideration as a next possible feature.

HannahAlexander · 2023-02-25T10:45:44Z

That's great thank you!On 25 Feb 2023 10:30, Tudor Lapusan ***@***.***> wrote: Based on the current dtreeviz implementation, I think no :( But it's good that you raised this issue, we could take it into consideration as a next possible feature. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

thomsentner · 2023-03-16T09:45:53Z

I'm running into a similar problem. I trained a decision tree on my train set, but would like to visualize its performance on the test set. When I use the test set as input for dtreeviz.model, the plots are incorrect, the plots show a weird mix of the data I gave it and the data the model was trained on.

As a train test split is a very common procedure, is there no workaround for this?

tlapusan · 2023-03-16T11:10:06Z

hi @thomsentner, I put here some limitations : #269 (comment)

btw, that library are you using ?
I will try to take a look the next days, for sklearn we could have a chance to make it work for new data, I guess.

thomsentner · 2023-03-16T11:43:22Z

Thanks. I'm using sklearn as well.

tlapusan · 2023-03-18T08:50:47Z

Looking into the code, the change supposed to be small.. but things get complicated a little in case the class_weight parameter is used at model training.
I have to spend more time to better understand the overall picture.

thomsentner · 2023-03-18T13:00:16Z

I created a PR with some very quick and dirty changes that already seem to solve the issue I faced personally. Hopefully this helps in resolving this issue.

Original behavior is preserved:

X_test is displayed correctly:

tlapusan · 2023-03-18T15:34:46Z

@thomsentner thanks for the PR, I just observed it while creating my PR also for this issue :d
#282

I will take a look also on yours ;)

parrt · 2023-03-18T18:06:04Z

I'm not sure how I feel about this. Point of this library is to visualize how a decision tree carves up feature space and makes decisions based upon the training data. The only roll for testing data is to see how a specific test case would run down the tree right? How would you show a decision tree for data that was not part of the construction of that tree? To me that means you simply train a new tree on the testing data and show that. Sorry if I am misunderstanding

thomsentner · 2023-03-18T20:08:04Z

@parrt for me it would be to visualize the validation dataset, and as such, visualize the true world performance I can expect from the tree, not so much to test just any specific test case. Looking at training samples will give me a very biased view of what will be happening.

tlapusan · 2023-03-21T07:11:06Z

@HannahAlexander your feedback would also help :)

tlapusan · 2023-03-21T07:26:20Z

@parrt the plan is to use the tree structure/metadata learned from training set and make the plots based on another dataset, like validation.

As we know, an important step in any ML project is to do a good train/validation (and even test) split, which should reflect the production data. Only interpreting the tree based on train dataset, doesn't mean that the model will perform the same in production also.

For example let's say we have 92% accuracy on train and 80% on validation (or even 99%). The question is why ? Interpreting the tree structure(learned from train) and making visualisations based on validation data should help to get the answers.

tlapusan · 2023-03-21T07:35:07Z

Here are some visualisation which could help to understand the purpose.
The first viz is based on trian data, the second on validation and the third one on the same validation, but I randomly changed the target values (it's an exaggeration :)) but just to serve our purpose when a train/validation split is not correct)

tlapusan · 2023-03-21T07:44:58Z

Another useful think would be to compare in parallel the visualisations for train and validation and to check for some differences. Ideally the node value distribution/ranges should be the same, but sometime they are not.

tlapusan · 2023-04-01T13:34:03Z

@parrt any thoughts on this ? :)

parrt · 2023-04-01T16:38:55Z

Sorry for the delay. I'm 100% focused on some machine learning stuff at work haha.

Ok, I think I understand the purpose now. You want a mechanism to visualize how the tree structure interprets the validation set in a large sense instead of where we run a single test instance down the tree now. In other words the tree structure does not change but the distributions in the decision notes and the leaf know it does, according to the information in the validation set. Do I have that correct?

tlapusan · 2023-04-01T19:28:03Z

Indeed, it's for the entire validation set and the tree structure (decision split nodes learned during training) doesn't change. In other words, it's how the tree sees/interprete/predict on a new dataset (which is different from training).

It would be a pretty powerful feature I think for the library. No other ones allow this, from what I know :)

parrt · 2023-04-02T18:22:14Z

I'm a bit nervous about the feature, even though I see the utility and understanding a large validation set. Would this require a lot of changes or add complexity to the code base?

tlapusan · 2023-04-03T06:34:16Z

It should be a minimal change in the code. For sklearn is this PR 648c24b which as we can see there are just few lines of code.
For the other libraries it should be the same, but I have to double check.

Still, I propose to do this change for sklearn first and to see the community feedback.

parrt · 2023-04-03T18:52:31Z

Still, I propose to do this change for sklearn first and to see the community feedback

Agreed, let's give it a try and let people report back.

thomsentner · 2023-04-06T11:06:02Z

@tlapusan tried working with your PR already and it seems to work great! Just running into one problem still, if a non-leaf node contains only one class in the test set, the plot of that node will fail. dtreeviz tries to plot both classes in all non-leaf nodes, but cannot find any data for the opposite class in this case. For me it occurs for the node just above these leafs:

tlapusan · 2023-04-06T11:53:09Z

@thomsentner nice that you had time to check it ! thanks

The issue which you mentioned happened to me also and we just merged a PR for it into master few days ago..#284

@parrt we have #282 for this issue. If we can merge it, @thomsentner should have the mentioned issue solved.

parrt · 2023-04-06T15:38:55Z

@tlapusan merged! thanks :)

parrt · 2023-04-06T15:39:33Z

Resolved by #282

mepland · 2023-04-11T23:34:09Z

Sorry I've been busy getting up to speed in my new position, but I'm very glad this was implemented - a great addition to dtreeviz!

parrt · 2023-04-13T18:21:58Z

no prob! I'm super busy too!

tlapusan self-assigned this Mar 16, 2023

thomsentner mentioned this issue Mar 18, 2023

WIP implementation of test set visualization of sklearn trees #281

Closed

parrt added this to the 2.2.1 milestone Apr 6, 2023

parrt added the enhancement New feature or request label Apr 6, 2023

parrt closed this as completed Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pie charts show for data that classifier was trained on and not new data #269

Pie charts show for data that classifier was trained on and not new data #269

HannahAlexander commented Feb 23, 2023

tlapusan commented Feb 23, 2023

HannahAlexander commented Feb 23, 2023 via email

tlapusan commented Feb 25, 2023

HannahAlexander commented Feb 25, 2023 via email

thomsentner commented Mar 16, 2023

tlapusan commented Mar 16, 2023

thomsentner commented Mar 16, 2023

tlapusan commented Mar 18, 2023

thomsentner commented Mar 18, 2023

tlapusan commented Mar 18, 2023

parrt commented Mar 18, 2023

thomsentner commented Mar 18, 2023

tlapusan commented Mar 21, 2023

tlapusan commented Mar 21, 2023

tlapusan commented Mar 21, 2023 •

edited

Loading

tlapusan commented Mar 21, 2023

tlapusan commented Apr 1, 2023

parrt commented Apr 1, 2023

tlapusan commented Apr 1, 2023

parrt commented Apr 2, 2023

tlapusan commented Apr 3, 2023

parrt commented Apr 3, 2023

thomsentner commented Apr 6, 2023 •

edited

Loading

tlapusan commented Apr 6, 2023

parrt commented Apr 6, 2023

parrt commented Apr 6, 2023

mepland commented Apr 11, 2023

parrt commented Apr 13, 2023

Pie charts show for data that classifier was trained on and not new data #269

Pie charts show for data that classifier was trained on and not new data #269

Comments

HannahAlexander commented Feb 23, 2023

tlapusan commented Feb 23, 2023

HannahAlexander commented Feb 23, 2023 via email

tlapusan commented Feb 25, 2023

HannahAlexander commented Feb 25, 2023 via email

thomsentner commented Mar 16, 2023

tlapusan commented Mar 16, 2023

thomsentner commented Mar 16, 2023

tlapusan commented Mar 18, 2023

thomsentner commented Mar 18, 2023

tlapusan commented Mar 18, 2023

parrt commented Mar 18, 2023

thomsentner commented Mar 18, 2023

tlapusan commented Mar 21, 2023

tlapusan commented Mar 21, 2023

tlapusan commented Mar 21, 2023 • edited Loading

tlapusan commented Mar 21, 2023

tlapusan commented Apr 1, 2023

parrt commented Apr 1, 2023

tlapusan commented Apr 1, 2023

parrt commented Apr 2, 2023

tlapusan commented Apr 3, 2023

parrt commented Apr 3, 2023

thomsentner commented Apr 6, 2023 • edited Loading

tlapusan commented Apr 6, 2023

parrt commented Apr 6, 2023

parrt commented Apr 6, 2023

mepland commented Apr 11, 2023

parrt commented Apr 13, 2023

tlapusan commented Mar 21, 2023 •

edited

Loading

thomsentner commented Apr 6, 2023 •

edited

Loading