Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] About the Tokenizer #2

Open
loretoparisi opened this issue Feb 5, 2019 · 2 comments
Open

[QUESTION] About the Tokenizer #2

loretoparisi opened this issue Feb 5, 2019 · 2 comments

Comments

@loretoparisi
Copy link

For a romanization project I'm working on I'm using the polyglot-tokenizer with good results for most of the indian languages. Are you aware of it? My question is if the NLTK is better in the tokenization.
Thank you.

@adamshamsudeen
Copy link
Owner

Thanks for the suggestion. I tried using it. There are no significant improvements for Malayalam. Maybe it works well for other languages mentioned.
screen shot 2019-02-06 at 4 06 47 pm

@loretoparisi
Copy link
Author

loretoparisi commented Feb 6, 2019

Thank you for testing it! I was aware of NLTK, but at the end I have preferred that one because of the extended indian languages support. Currently I'm looking to the Byte Pair Encoding approach in order to get rid of a specific language model, so to build a cross-lingual model. I saw you are working on the same. Also in my case I do indian language classification (the source is Wikipedia as well for indian script languages), and most of the problems were due to the Tokenizer actually, more than the classifier itself. Hopefully the BPE will give better results for a language agnostic approach!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants