Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the special character problem #585

Merged
merged 10 commits into from
May 31, 2023
Merged

Fix the special character problem #585

merged 10 commits into from
May 31, 2023

Conversation

stefsmeets
Copy link
Contributor

@stefsmeets stefsmeets commented May 9, 2023

This PR attempts to fix the 'special character problem'.

I have identified this as a failing string:

Hello, to the world!

Oddly enough, this works:

Hello, world!

I updated the fix in SpacyTokenizer to fix the sentence before sending it to the spacy tokenizer instead of trying to make sence of the tokens after tokenization. The new fix will add spaces around all instances of UNKWORDZ where missing using regexes. And hopefully, it's a bit more future-proof when we come accros new issues.

The implementation differs slightly in that this does not consider multiple special characters as a single token. This is inline with the tokenization expected from spacy.

I found the error message very confusing, so I also added a check to make sure that the number of tokens is consistent for all sentences.

I also updated the tests to specifically test the tokenizer, rather than the full explain text pipeline.

Todo

  • Adjust tokenizer in dashboard
  • Use compiled regexes

image

@stefsmeets stefsmeets marked this pull request as ready for review May 10, 2023 09:26
@stefsmeets stefsmeets changed the title Investigate the special character problem Fix the special character problem May 10, 2023
@stefsmeets
Copy link
Contributor Author

Hi @laurasootes Can you have a look at this PR and let me know what you think?

The Ubuntu/Python3.7 build is failing because of #587

@stefsmeets
Copy link
Contributor Author

Hi @laurasootes did you have any time to look at this? Should we ask someone else to do the review?

Copy link
Contributor

@laurasootes laurasootes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Very clean fix.

I was still able to find some combination that would break the code, but that could not really be considered a sentence. The sentences that I found earlier do now all work.

@elboyran
Copy link
Contributor

elboyran commented May 30, 2023 via email

@stefsmeets stefsmeets merged commit c1a3c52 into main May 31, 2023
@stefsmeets stefsmeets deleted the fix-special-characters branch May 31, 2023 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants