Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SentencePiece is not working properly for the Sinhala Language due to Zero Width Joiner is getting removed #629

Closed
sarubi opened this issue Feb 23, 2021 · 0 comments · Fixed by #630

Comments

@sarubi
Copy link
Contributor

sarubi commented Feb 23, 2021

As per the reported issue, Empty tokens in output vocabulary #276, Zero Width Joiner character has been replaced by whitespace. Due to that languages that require Zero Width Joiner getting altered and resulted wong output. This issue will be there for languages like Sinhala, Devanagari, Kannada and Malayalam [1].

Ideally, we shouldn't replace Zero Width Joiner (200D) with whitespace or empty since it indicates to join two chars without zero width(no whitespace). Also, It requires to present in order to decode the segmentation to raw test successfully. We should keep these special characters as it is.

[1] https://en.wikipedia.org/wiki/Zero-width_joiner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant