Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization doesn't preserve diacritics #40

Closed
caffeine96 opened this issue Jan 7, 2022 · 3 comments
Closed

Tokenization doesn't preserve diacritics #40

caffeine96 opened this issue Jan 7, 2022 · 3 comments

Comments

@caffeine96
Copy link

I was working recently with the IndicBERT SentencePiece tokenizer and found something which I was curious about. It turns out that when we encode sentences, a good amount of diacritics do not get encoded. So for example, in Hindi, the sentences - "मेंने उसकी गेंद दी।" and "मैने उसको गेंद दी।" have the same encodings despite one having the genitive and the other the dative marker. I have seen this for Gujarati and Hindi. The reason I think the diacritics are ignored is that when the encodings are decoded, some diacritics are missing.

I was curious to know why this happens and if there is a work-around.

@GokulNC GokulNC transferred this issue from AI4Bharat/indicnlp_catalog Jan 8, 2022
@anoopkunchukuttan
Copy link
Collaborator

Can you share the segmentation outputs for this example (as well as the Gujarati example) you shared over mail? Please share the text (not the images)?

@gowtham1997
Copy link
Member

gowtham1997 commented Jan 8, 2022

import transformers
# instead of this : tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
# print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह")) # returns True if you use above line
# use this:
tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert', keep_accents=True)
print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह")) # returns False

^ use this snippet to initialize the tokenizer to preserve accents or diacritics

This is explained in this issue #26 (There is also a note to this on our readme section in case you missed it)

Please let us know if this works

@caffeine96
Copy link
Author

Thanks for pointing that out. That solves the issues with both Hindi and Gujarati.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants