[QUESTION] About the Tokenizer #2

loretoparisi · 2019-02-05T10:29:05Z

For a romanization project I'm working on I'm using the polyglot-tokenizer with good results for most of the indian languages. Are you aware of it? My question is if the NLTK is better in the tokenization.
Thank you.

adamshamsudeen · 2019-02-06T10:44:31Z

Thanks for the suggestion. I tried using it. There are no significant improvements for Malayalam. Maybe it works well for other languages mentioned.

loretoparisi · 2019-02-06T17:30:15Z

Thank you for testing it! I was aware of NLTK, but at the end I have preferred that one because of the extended indian languages support. Currently I'm looking to the Byte Pair Encoding approach in order to get rid of a specific language model, so to build a cross-lingual model. I saw you are working on the same. Also in my case I do indian language classification (the source is Wikipedia as well for indian script languages), and most of the problems were due to the Tokenizer actually, more than the classifier itself. Hopefully the BPE will give better results for a language agnostic approach!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] About the Tokenizer #2

[QUESTION] About the Tokenizer #2

loretoparisi commented Feb 5, 2019

adamshamsudeen commented Feb 6, 2019

loretoparisi commented Feb 6, 2019 •

edited

Loading

[QUESTION] About the Tokenizer #2

[QUESTION] About the Tokenizer #2

Comments

loretoparisi commented Feb 5, 2019

adamshamsudeen commented Feb 6, 2019

loretoparisi commented Feb 6, 2019 • edited Loading

loretoparisi commented Feb 6, 2019 •

edited

Loading