Skip to content

Tokenization

Mika Hämäläinen edited this page Mar 26, 2022 · 5 revisions

Tokenization

Tokenization is an important NLP task that can be done on the level of words or sentences. UralicNLP comes with a functionality for tokenizing text. UralicNLP can handle abbreviations in all languages that are supported by the Universal Dependencies project.

Full tokenization

To tokenize a text, all you need to do is to run the following:

from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.tokenize(text)
>> [['My', 'dog', 'ran', '.'], ['Then', 'a', 'cat', 'showed', 'up', '!']]

This returns a list of sentences that contain a list of words

Sentence tokenization

It is also possible to tokenize text on a sentence level:

from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.sentences(text)
>> ['My dog ran.', 'Then a cat showed up!']

This returns a list of sentences.

Word tokenization

One can also get a list of words without sentence boundaries:

from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.words(text)
>> ['My', 'dog', 'ran', '.', 'Then', 'a', 'cat', 'showed', 'up', '!']

This returns a list of words.

Tokenize Arabic

UralicNLP has a special method that tokenizes and lemmatizes Arabic text. The input and output are the same as for the full tokenizer.

from uralicNLP import tokenizer
tokenizer.tokenize_arabic("ومن الناس من يقول آمنا بالله وباليوم الآخر وما هم بمؤمنين")
>> [['وَ', 'مَنّ', 'الناس', 'مَنّ', 'قال', 'آمنا', 'بِ', 'الله', 'وَ', 'بـ', 'يوم', 'ال', 'آخر', 'وَ', 'ما', 'هم', 'بِ', 'مؤمن']]
# Web browsers may show this list in an inverted order; the first element is وَ

The method relies on the Arabic FST which needs to be downloaded using

python3 -m uralicNLP.download -l ara