-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine special characters into single token #462
Conversation
…nce, unhiding uneven sentence length errors
I ran the tests locally and they pass of course. Only when using the dashboard, I still got some 'Could not create tensor from given input list' errors. For instance: These work fine though: I also tried some of these by substituting the current text in the unit test with the texts above. There they seem to be processed just fine. Could you take another look at the behavior in the dashboard? |
Thanks for checking the dashboard! I'm looking at it now, I see the dashboard actually uses the Spacy tokenizer directly instead of the tokenizer class we defined in DIANNA. I'll try to fix that. |
…utive in the raw input, instead of in the token list
All those cases should pass now (but please check if you have time), except for
The special character fix doesn't handle this, because Spacy for some reason doesn't create a new token starting at the first I'm not sure if/how we can fix this. For now I would say that the behaviour is already much improved and hopefully the still breaking cases don't really occur in the wild anyway. |
Hmm weird error in the CI, seems unrelated to the PR: https://github.com/dianna-ai/dianna/actions/runs/4184062075/jobs/7249224208 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great improvement indeed.
We talked about adding a list of extra tests of all strange cases that we cover now. Those would be great still.
We should also create an issue for the remaining cases that fail, even if they are very unlikely cases.
I don't know why importing onnx in python 3.9 under macos gives problems. It happens outside this branch as well. |
It also has a conflict. |
After merging main into this and solving a single conflict, which seemed straight forward, lime is giving explanations omitting the special characters. I will look into this later. |
This original change was done in commit 0afdbc2 But I don't understand why. It doesn't seem to fix anything. Right now it was breaking my tests because special characters were missing in the explanation object. The failing test was: tests/test_lime.py::LimeOnText::test_lime_text_special_chars
There is still something wrong with running the notebooks. We should see if it's an actual bug. If so, we should write tests so the actual tests are failing as well. Right now it looks like everything is just fine except for the notebooks (which is probably not the case). |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
The linting is failing on import order of time series visualization. This has nothing to do with this PR and has already been fixed in main (but after triggering the actions for this PR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good! Go ahead and merge.
Fixes #437 . See that issue for more details on why the tokenizer was updated as a fix.