Fix the special character problem #585

stefsmeets · 2023-05-09T13:33:37Z

This PR attempts to fix the 'special character problem'.

I have identified this as a failing string:

Hello, to the world!

Oddly enough, this works:

Hello, world!

I updated the fix in SpacyTokenizer to fix the sentence before sending it to the spacy tokenizer instead of trying to make sence of the tokens after tokenization. The new fix will add spaces around all instances of UNKWORDZ where missing using regexes. And hopefully, it's a bit more future-proof when we come accros new issues.

The implementation differs slightly in that this does not consider multiple special characters as a single token. This is inline with the tokenization expected from spacy.

I found the error message very confusing, so I also added a check to make sure that the number of tokens is consistent for all sentences.

I also updated the tests to specifically test the tokenizer, rather than the full explain text pipeline.

Todo

Adjust tokenizer in dashboard
Use compiled regexes

stefsmeets · 2023-05-10T09:53:40Z

Hi @laurasootes Can you have a look at this PR and let me know what you think?

The Ubuntu/Python3.7 build is failing because of #587

stefsmeets · 2023-05-30T11:09:40Z

Hi @laurasootes did you have any time to look at this? Should we ask someone else to do the review?

laurasootes

Nice! Very clean fix.

I was still able to find some combination that would break the code, but that could not really be considered a sentence. The sentences that I found earlier do now all work.

elboyran · 2023-05-30T15:18:21Z

[heart] Elena Ranguelova reacted to your message:

…

________________________________ From: Laura ***@***.***> Sent: Tuesday, May 30, 2023 3:01:37 PM To: dianna-ai/dianna ***@***.***> Cc: Subscribed ***@***.***> Subject: Re: [dianna-ai/dianna] Fix the special character problem (PR #585) @laurasootes approved this pull request. Nice! Very clean fix. I was still able to find some combination that would break the code, but that could not really be considered a sentence. The sentences that I found earlier do now all work. — Reply to this email directly, view it on GitHub<#585 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAYYBWPV6EALTCMETAILYHLXIYDVDANCNFSM6AAAAAAX3JT7UA>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

stefsmeets added 8 commits May 9, 2023 15:31

Add failing test

a87e2b6

Add unit test for SpacyTokenizer.tokenize

5798052

Try regex based solution

7a2ec92

Add word-token-word regex

b9627e5

Update test values to be in line with what the first tokenizer returns

3408299

Add check and raise if token mismatch is detected

64e1aa8

Trim tests

b70907c

Tweak code

833c2a6

stefsmeets marked this pull request as ready for review May 10, 2023 09:26

stefsmeets changed the title ~~Investigate the special character problem~~ Fix the special character problem May 10, 2023

stefsmeets added 2 commits May 10, 2023 11:32

Fix build on python3.7/ubuntu

2e678d8

Revert, because 2.12 does not exist for 3.7

d23c5f5

stefsmeets requested a review from laurasootes May 10, 2023 09:53

laurasootes approved these changes May 30, 2023

View reviewed changes

stefsmeets merged commit c1a3c52 into main May 31, 2023

stefsmeets deleted the fix-special-characters branch May 31, 2023 11:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the special character problem #585

Fix the special character problem #585

stefsmeets commented May 9, 2023 •

edited

Loading

stefsmeets commented May 10, 2023

stefsmeets commented May 30, 2023

laurasootes left a comment

elboyran commented May 30, 2023 via email

Fix the special character problem #585

Fix the special character problem #585

Conversation

stefsmeets commented May 9, 2023 • edited Loading

Todo

stefsmeets commented May 10, 2023

stefsmeets commented May 30, 2023

laurasootes left a comment

Choose a reason for hiding this comment

elboyran commented May 30, 2023 via email

stefsmeets commented May 9, 2023 •

edited

Loading