uralicNLP.string_processing

The uralicNLP.string_processing module has the following methods:

iso_to_name

Returns the English name for the language ISO code

  from uralicNLP import string_processing
  string_processing.iso_to_name("kpv")
  >> Komi-Zyrian

char_split

Splits words into characters better than Python's own " ".join("") method. This tries to maintain diacritics with the character they belong to instead of separating them. Take a look at the following example:

  from uralicNLP import string_processing
  s = 'h̭ɛ̮ŋkkɐᴅ'
  " ".join(s)
  >> h ̭ ɛ ̮ ŋ k k ɐ ᴅ
  string_processing.char_split(s)
  >> ['h̭', 'ɛ̮', 'ŋ', 'k', 'k', 'ɐ', 'ᴅ']

In short, it takes a string and returns a list split in characters.

filter_arabic

This return the parts of text that are written in Arabic. The parameters are

text The text to process
keep_vowels=True Whether diacritics should be removed

combine_by="" Joins the Arabic text fragments by this string, could be set to a space

from uralicNLP import string_processing
a = "تحميل PDF"
string_processing.filter_arabic(a)
>> تحميل

UralicNLP is an open-source Python library by Mika Hämäläinen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uralicNLP.string_processing

iso_to_name

char_split

filter_arabic

Clone this wiki locally