Combine special characters into single token #462

loostrum · 2023-02-02T11:45:32Z

Fixes #437 . See that issue for more details on why the tokenizer was updated as a fix.

…nce, unhiding uneven sentence length errors

… single token

cwmeijer · 2023-02-15T12:19:18Z

I ran the tests locally and they pass of course.

Only when using the dashboard, I still got some 'Could not create tensor from given input list' errors. For instance:
let'''s watch this together
review with !!!?
review with! ?
review with???!
such a bad movie "!?\'" particularly strange one, as it is from your (passing) unit test

These work fine though:
review with !!!
review with!?
review with???

I also tried some of these by substituting the current text in the unit test with the texts above. There they seem to be processed just fine.

Could you take another look at the behavior in the dashboard?

loostrum · 2023-02-15T12:25:51Z

Thanks for checking the dashboard! I'm looking at it now, I see the dashboard actually uses the Spacy tokenizer directly instead of the tokenizer class we defined in DIANNA. I'll try to fix that.

…utive in the raw input, instead of in the token list

loostrum · 2023-02-15T13:03:02Z

All those cases should pass now (but please check if you have time), except for let'''s watch this together.
That is a particularly tricky one, because the special characters are in the middle of a word. This is how Spacy tokenizes that plus a version where the middle ' is masked:

In [9]: spacy("let'''s")
Out[9]: ['let', "''", "'s"]

In [10]: spacy("let'UNKNOWNWORD's")
Out[10]: ["let'UNKNOWNWORD", "'s"]

The special character fix doesn't handle this, because Spacy for some reason doesn't create a new token starting at the first ' in the second case.

I'm not sure if/how we can fix this. For now I would say that the behaviour is already much improved and hopefully the still breaking cases don't really occur in the wild anyway.

loostrum · 2023-02-15T13:05:27Z

Hmm weird error in the CI, seems unrelated to the PR: https://github.com/dianna-ai/dianna/actions/runs/4184062075/jobs/7249224208

cwmeijer

Great improvement indeed.
We talked about adding a list of extra tests of all strange cases that we cover now. Those would be great still.
We should also create an issue for the remaining cases that fail, even if they are very unlikely cases.

cwmeijer · 2023-02-21T10:47:31Z

I don't know why importing onnx in python 3.9 under macos gives problems. It happens outside this branch as well.

elboyran · 2023-02-21T11:46:27Z

It also has a conflict.
Earlier I've merged an approved PR from Leon ( #463 ) and then the error started, but don't know how to fix it.

cwmeijer · 2023-02-23T10:47:21Z

After merging main into this and solving a single conflict, which seemed straight forward, lime is giving explanations omitting the special characters. I will look into this later.

This original change was done in commit 0afdbc2 But I don't understand why. It doesn't seem to fix anything. Right now it was breaking my tests because special characters were missing in the explanation object. The failing test was: tests/test_lime.py::LimeOnText::test_lime_text_special_chars

cwmeijer · 2023-03-01T15:57:23Z

There is still something wrong with running the notebooks. We should see if it's an actual bug. If so, we should write tests so the actual tests are failing as well. Right now it looks like everything is just fine except for the notebooks (which is probably not the case).

Case lowering is needed for the movie review model only, so we moved it into the specific notebook.

review-notebook-app · 2023-03-06T11:06:35Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…448)

cwmeijer · 2023-03-06T11:38:37Z

The linting is failing on import order of time series visualization. This has nothing to do with this PR and has already been fixed in main (but after triggering the actions for this PR).

cpranav93

It looks good! Go ahead and merge.

loostrum added 7 commits February 2, 2023 12:17

Update movie reviews model runnner in tests to run all sentences at o…

a810bc3

…nce, unhiding uneven sentence length errors

Add test for text with special chars, failing for now

e9888f0

Change default tokenizer to treat consecutive special characters as a…

4ab77d1

… single token

Add proper values to lime special character test

25b36c3

Fix linter warning

aed8a18

Fix import order

afea2a7

Remove debug print

90be27d

elboyran requested a review from cwmeijer February 14, 2023 13:13

loostrum added 2 commits February 15, 2023 13:41

use DIANNA custom tokenizer in dashboard

780ab7f

Only combine special chars into one token if they are actually consec…

e033686

…utive in the raw input, instead of in the token list

cwmeijer requested changes Feb 21, 2023

View reviewed changes

cwmeijer added 4 commits February 22, 2023 09:36

refactor lime text tests to reduce duplication

31a1758

refactor rise and lime text tests to reduce duplication

c0905df

add regression tests (refs #437)

c6786e2

refactor duplicate code in lime test

c00ff94

cwmeijer approved these changes Feb 23, 2023

View reviewed changes

cwmeijer added 2 commits February 23, 2023 11:15

Merge branch 'main' into 437-fix-special-chars

f7e5ec2

fix linter issues

f802362

cwmeijer added 4 commits February 23, 2023 14:23

improve assertion message

326e1c9

remove unused import and make some tests static functions

214fb7e

remove use of 3.7 incompatible python features

8604310

remove all case lowering from dianna (refs #437, #448)

e238787

Case lowering is needed for the movie review model only, so we moved it into the specific notebook.

cwmeijer requested a review from cpranav93 March 6, 2023 11:07

change expected visualization outcome to contain original case (refs #…

1d0b68d

…448)

cpranav93 approved these changes Mar 6, 2023

View reviewed changes

cwmeijer merged commit b28ccb0 into main Mar 6, 2023

cwmeijer deleted the 437-fix-special-chars branch March 6, 2023 19:31

laurasootes mentioned this pull request Apr 6, 2023

Lime text fails in case of multiple sentences separated by punctuation #531

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine special characters into single token #462

Combine special characters into single token #462

loostrum commented Feb 2, 2023

cwmeijer commented Feb 15, 2023

loostrum commented Feb 15, 2023

loostrum commented Feb 15, 2023

loostrum commented Feb 15, 2023

cwmeijer left a comment

cwmeijer commented Feb 21, 2023

elboyran commented Feb 21, 2023 •

edited

Loading

cwmeijer commented Feb 23, 2023

cwmeijer commented Mar 1, 2023

review-notebook-app bot commented Mar 6, 2023

cwmeijer commented Mar 6, 2023

cpranav93 left a comment

Combine special characters into single token #462

Combine special characters into single token #462

Conversation

loostrum commented Feb 2, 2023

cwmeijer commented Feb 15, 2023

loostrum commented Feb 15, 2023

loostrum commented Feb 15, 2023

loostrum commented Feb 15, 2023

cwmeijer left a comment

Choose a reason for hiding this comment

cwmeijer commented Feb 21, 2023

elboyran commented Feb 21, 2023 • edited Loading

cwmeijer commented Feb 23, 2023

cwmeijer commented Mar 1, 2023

review-notebook-app bot commented Mar 6, 2023

cwmeijer commented Mar 6, 2023

cpranav93 left a comment

Choose a reason for hiding this comment

elboyran commented Feb 21, 2023 •

edited

Loading