Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization issues #26

Closed
aryamanarora opened this issue Jun 3, 2021 · 2 comments
Closed

Tokenization issues #26

aryamanarora opened this issue Jun 3, 2021 · 2 comments

Comments

@aryamanarora
Copy link

aryamanarora commented Jun 3, 2021

Running into some potentially troublesome issues in the tokenization for indic-bert. It seems all vowel matras (diacritics) are getting dropped in the tokenization, which loses a lot of information about the word. Perhaps some sort of Unicode issue?

Minimal example (prints True) where two very different words get treated as the same token.

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह"))

bert-base-multilingual-cased does not have this issue.

Is this an issue on my end? I have this problem on Colab and on my machine (Mac, Python 3.8.8). @nitinvwaran also has this issue. I had to install sentencepiece to get the tokenizer to work btw.

@aryamanarora
Copy link
Author

aryamanarora commented Jun 3, 2021

Found a fix thanks to @pranavmaneriker.

import transformers
-tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
+tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert', keep_accents=True)
print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह")) # returns False

Would be nice to mention this in the README.

@gowtham1997
Copy link
Member

Sorry for the late reply.
Added this to the readme and referenced your issue.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants