Skip to content

Morphological analysis for Christian Urmi (North-Eastern Neo-Aramaic)

License

Notifications You must be signed in to change notification settings

timarkh/uniparser-grammar-urmi

Repository files navigation

Urmi morphological analyzer

This is a rule-based morphological analyzer for Christian Urmi (Afro-Asiatic > North-Eastern Neo-Aramaic). It is based on a formalized description of Urmi morphology and uses uniparser-morph for parsing. It performs full morphological analysis of Urmi words (lemmatization, POS tagging, grammatical tagging). The text to be analyzed should be written in the Latin-based alphabet (the Assyrian New Alphabet).

How to use

Python package

The analyzer is available as a Python package. If you want to analyze Urmi texts in Python, install the module:

pip3 install uniparser-urmi

Import the module and create an instance of UrmiAnalyzer class. Set mode='strict' if you are going to process text in standard Assyrian New Alphabet, or mode='nodiacritics' if you expect some words to lack the diacritics (e.g. t instead of ). After that, you can either parse tokens or lists of tokens with analyze_words(), or parse a frequency list with analyze_wordlist(). Here is a simple example:

from uniparser_urmi import UrmiAnalyzer
a = UrmiAnalyzer(mode='strict')

analyses = a.analyze_words('вajjannux')
# The parser is initialized before first use, so expect
# some delay here (usually several seconds)

# You will get a list of Wordform objects
# The analysis attributes are stored in its properties
# as string values, e.g.:
for ana in analyses:
        print(ana.wf, ana.lemma, ana.gramm)

# You can also pass lists (even nested lists) and specify
# output format ('xml', 'json' or 'conll')
# If you pass a list, you will get a list of analyses
# with the same structure
analyses = a.analyze_words([['вajjannux'], ['Ptixli', 'tarra', 'd', 'xə', 'вetə', '.']],
	                       format='xml')
analyses = a.analyze_words([['вajjannux'], ['Ptixli', 'tarra', 'd', 'xə', 'вetə', '.']],
	                       format='conll')
analyses = a.analyze_words(['вajjannux', [['вəxtə'], ['Ptixli', 'tarra', 'd', 'xə', 'вetə', '.']]],
	                       format='json')

Refer to the uniparser-morph documentation for the full list of options.

If you want to quickly check an analysis for one particular word, you can also use the command-line interface. Here is an example for the word вajjannux:

python3 -m uniparser_urmi вajjannux

Word lists

Alternatively, you can use a preprocessed word list. The wordlists directory contains a list of words from a 622-thousand-word Christian Urmi corpus (wordlist.csv) with 63,000 unique tokens, list of analyzed tokens (wordlist_analyzed.txt; each line contains all possible analyses for one word in an XML format), and list of tokens the parser could not analyze (wordlist_unanalyzed.txt). The recall of the analyzer on the corpus texts is about 76%.

Description format

The description is carried out in the uniparser-morph format and involves a description of the inflection (paradigms.txt) and a grammatical dictionary (lexemes.txt). The dictionary contains descriptions of individual lexemes, each of which is accompanied by information about its stem, its part-of-speech tag and some other grammatical information, its consonant root, its inflectional type (paradigm), and English and/or Russian translations. See more about the format in the uniparser-morph documentation.

About

Morphological analysis for Christian Urmi (North-Eastern Neo-Aramaic)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages