-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pie charts show for data that classifier was trained on and not new data #269
Comments
It would be better to use the same dataset for both training and dtreeviz visualisations. btw, what ml library are you using ? |
Sklearn. Okay that makes sense. I have a highly skewed dataset so I'm training my model on upsampled data, but I then want to visualise the decision tree using the fitted classifier on the original data. Is there a way to do this?On 23 Feb 2023 18:27, Tudor Lapusan ***@***.***> wrote:
It would be better to use the same dataset for both training and dtreeviz visualisations.
Internally, dtreeviz is using both the tree metadata and the dataset sent as paramenter.
btw, what ml library are you using ?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Based on the current dtreeviz implementation, I think no :( |
That's great thank you!On 25 Feb 2023 10:30, Tudor Lapusan ***@***.***> wrote:
Based on the current dtreeviz implementation, I think no :(
But it's good that you raised this issue, we could take it into consideration as a next possible feature.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I'm running into a similar problem. I trained a decision tree on my train set, but would like to visualize its performance on the test set. When I use the test set as input for dtreeviz.model, the plots are incorrect, the plots show a weird mix of the data I gave it and the data the model was trained on. As a train test split is a very common procedure, is there no workaround for this? |
hi @thomsentner, I put here some limitations : #269 (comment) btw, that library are you using ? |
Thanks. I'm using sklearn as well. |
Looking into the code, the change supposed to be small.. but things get complicated a little in case the class_weight parameter is used at model training. |
@thomsentner thanks for the PR, I just observed it while creating my PR also for this issue :d I will take a look also on yours ;) |
I'm not sure how I feel about this. Point of this library is to visualize how a decision tree carves up feature space and makes decisions based upon the training data. The only roll for testing data is to see how a specific test case would run down the tree right? How would you show a decision tree for data that was not part of the construction of that tree? To me that means you simply train a new tree on the testing data and show that. Sorry if I am misunderstanding |
@parrt for me it would be to visualize the validation dataset, and as such, visualize the true world performance I can expect from the tree, not so much to test just any specific test case. Looking at training samples will give me a very biased view of what will be happening. |
@HannahAlexander your feedback would also help :) |
@parrt the plan is to use the tree structure/metadata learned from training set and make the plots based on another dataset, like validation. As we know, an important step in any ML project is to do a good train/validation (and even test) split, which should reflect the production data. Only interpreting the tree based on train dataset, doesn't mean that the model will perform the same in production also. For example let's say we have 92% accuracy on train and 80% on validation (or even 99%). The question is why ? Interpreting the tree structure(learned from train) and making visualisations based on validation data should help to get the answers. |
Here are some visualisation which could help to understand the purpose. |
@parrt any thoughts on this ? :) |
Sorry for the delay. I'm 100% focused on some machine learning stuff at work haha. Ok, I think I understand the purpose now. You want a mechanism to visualize how the tree structure interprets the validation set in a large sense instead of where we run a single test instance down the tree now. In other words the tree structure does not change but the distributions in the decision notes and the leaf know it does, according to the information in the validation set. Do I have that correct? |
Indeed, it's for the entire validation set and the tree structure (decision split nodes learned during training) doesn't change. In other words, it's how the tree sees/interprete/predict on a new dataset (which is different from training). It would be a pretty powerful feature I think for the library. No other ones allow this, from what I know :) |
I'm a bit nervous about the feature, even though I see the utility and understanding a large validation set. Would this require a lot of changes or add complexity to the code base? |
It should be a minimal change in the code. For sklearn is this PR 648c24b which as we can see there are just few lines of code. Still, I propose to do this change for sklearn first and to see the community feedback. |
Agreed, let's give it a try and let people report back. |
@tlapusan tried working with your PR already and it seems to work great! Just running into one problem still, if a non-leaf node contains only one class in the test set, the plot of that node will fail. dtreeviz tries to plot both classes in all non-leaf nodes, but cannot find any data for the opposite class in this case. For me it occurs for the node just above these leafs: |
@thomsentner nice that you had time to check it ! thanks The issue which you mentioned happened to me also and we just merged a PR for it into master few days ago..#284 @parrt we have #282 for this issue. If we can merge it, @thomsentner should have the mentioned issue solved. |
@tlapusan merged! thanks :) |
Resolved by #282 |
Sorry I've been busy getting up to speed in my new position, but I'm very glad this was implemented - a great addition to dtreeviz! |
no prob! I'm super busy too! |
We fit a model using a DecisionTreeClassifier on upsampled data. We then wanted to visualise this model using the original data. In the plot all the numbered labels are correct but the ratios in the final pie charts are for the upsampled data.
The text was updated successfully, but these errors were encountered: