-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the special character problem #585
Conversation
Hi @laurasootes Can you have a look at this PR and let me know what you think? The Ubuntu/Python3.7 build is failing because of #587 |
Hi @laurasootes did you have any time to look at this? Should we ask someone else to do the review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Very clean fix.
I was still able to find some combination that would break the code, but that could not really be considered a sentence. The sentences that I found earlier do now all work.
[heart] Elena Ranguelova reacted to your message:
…________________________________
From: Laura ***@***.***>
Sent: Tuesday, May 30, 2023 3:01:37 PM
To: dianna-ai/dianna ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [dianna-ai/dianna] Fix the special character problem (PR #585)
@laurasootes approved this pull request.
Nice! Very clean fix.
I was still able to find some combination that would break the code, but that could not really be considered a sentence. The sentences that I found earlier do now all work.
—
Reply to this email directly, view it on GitHub<#585 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAYYBWPV6EALTCMETAILYHLXIYDVDANCNFSM6AAAAAAX3JT7UA>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
This PR attempts to fix the 'special character problem'.
I have identified this as a failing string:
Hello, to the world!
Oddly enough, this works:
Hello, world!
I updated the fix in
SpacyTokenizer
to fix the sentence before sending it to the spacy tokenizer instead of trying to make sence of the tokens after tokenization. The new fix will add spaces around all instances of UNKWORDZ where missing using regexes. And hopefully, it's a bit more future-proof when we come accros new issues.The implementation differs slightly in that this does not consider multiple special characters as a single token. This is inline with the tokenization expected from spacy.
I found the error message very confusing, so I also added a check to make sure that the number of tokens is consistent for all sentences.
I also updated the tests to specifically test the tokenizer, rather than the full explain text pipeline.
Todo