Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine special characters into single token #462

Merged
merged 21 commits into from
Mar 6, 2023
Merged

Conversation

loostrum
Copy link
Member

@loostrum loostrum commented Feb 2, 2023

Fixes #437 . See that issue for more details on why the tokenizer was updated as a fix.

@cwmeijer
Copy link
Contributor

I ran the tests locally and they pass of course.

Only when using the dashboard, I still got some 'Could not create tensor from given input list' errors. For instance:
let'''s watch this together
review with !!!?
review with! ?
review with???!
such a bad movie "!?\'" particularly strange one, as it is from your (passing) unit test

These work fine though:
review with !!!
review with!?
review with???

I also tried some of these by substituting the current text in the unit test with the texts above. There they seem to be processed just fine.

Could you take another look at the behavior in the dashboard?

@loostrum
Copy link
Member Author

Thanks for checking the dashboard! I'm looking at it now, I see the dashboard actually uses the Spacy tokenizer directly instead of the tokenizer class we defined in DIANNA. I'll try to fix that.

@loostrum
Copy link
Member Author

All those cases should pass now (but please check if you have time), except for let'''s watch this together.
That is a particularly tricky one, because the special characters are in the middle of a word. This is how Spacy tokenizes that plus a version where the middle ' is masked:

In [9]: spacy("let'''s")
Out[9]: ['let', "''", "'s"]

In [10]: spacy("let'UNKNOWNWORD's")
Out[10]: ["let'UNKNOWNWORD", "'s"]

The special character fix doesn't handle this, because Spacy for some reason doesn't create a new token starting at the first ' in the second case.

I'm not sure if/how we can fix this. For now I would say that the behaviour is already much improved and hopefully the still breaking cases don't really occur in the wild anyway.

@loostrum
Copy link
Member Author

Hmm weird error in the CI, seems unrelated to the PR: https://github.com/dianna-ai/dianna/actions/runs/4184062075/jobs/7249224208

Copy link
Contributor

@cwmeijer cwmeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement indeed.
We talked about adding a list of extra tests of all strange cases that we cover now. Those would be great still.
We should also create an issue for the remaining cases that fail, even if they are very unlikely cases.

@cwmeijer
Copy link
Contributor

I don't know why importing onnx in python 3.9 under macos gives problems. It happens outside this branch as well.

@elboyran
Copy link
Contributor

elboyran commented Feb 21, 2023

It also has a conflict.
Earlier I've merged an approved PR from Leon ( #463 ) and then the error started, but don't know how to fix it.

@cwmeijer
Copy link
Contributor

After merging main into this and solving a single conflict, which seemed straight forward, lime is giving explanations omitting the special characters. I will look into this later.

This original change was done in commit 0afdbc2
But I don't understand why. It doesn't seem to fix anything. Right now it was
breaking my tests because special characters were missing in the explanation
object. The failing test was:
tests/test_lime.py::LimeOnText::test_lime_text_special_chars
@cwmeijer
Copy link
Contributor

cwmeijer commented Mar 1, 2023

There is still something wrong with running the notebooks. We should see if it's an actual bug. If so, we should write tests so the actual tests are failing as well. Right now it looks like everything is just fine except for the notebooks (which is probably not the case).

Case lowering is needed for the movie review model only, so we moved it into the specific notebook.
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@cwmeijer cwmeijer requested a review from cpranav93 March 6, 2023 11:07
@cwmeijer
Copy link
Contributor

cwmeijer commented Mar 6, 2023

The linting is failing on import order of time series visualization. This has nothing to do with this PR and has already been fixed in main (but after triggering the actions for this PR).

Copy link
Contributor

@cpranav93 cpranav93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good! Go ahead and merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lime explanation crashes sometimes - special character related
4 participants