Skip to content

A Python3 package for extracting syntactic complexity measures from CoNLL-U annotations.

License

Notifications You must be signed in to change notification settings

eaklykova/syntaxcomp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

syntaxcomp

This package is designed for calculating syntactic complexity measures on the basis of morphosyntactically annotated texts in CoNLL-U format. It also enables sentence segmentation (T-unit and clause extraction) and NP extraction.

Disclaimer: correct results are only guaranteed for texts annotated with UDPipe 2.12. Please note that syntaxcomp relies heavily on CoNLL-U Parser.

Installation

pip install syntaxcomp

Usage Example

>>> from syntaxcomp.complexity import SentenceComplexity, TextComplexity

>>> example = """
# udpipe_model = english-ewt-ud-2.12-230717
# sent_id = 1
# text = This is a text containing two sentences.
1	This	this	PRON	DT	Number=Sing|PronType=Dem	4	nsubj	_	_
2	is	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	4	cop	_	_
3	a	a	DET	DT	Definite=Ind|PronType=Art	4	det	_	_
4	text	text	NOUN	NN	Number=Sing	0	root	_	_
5	containing	contain	VERB	VBG	VerbForm=Ger	4	acl	_	_
6	two	two	NUM	CD	NumForm=Word|NumType=Card	7	nummod	_	_
7	sentences	sentence	NOUN	NNS	Number=Plur	5	obj	_	SpaceAfter=No
8	.	.	PUNCT	.	_	4	punct	_	_

# sent_id = 2
# text = This is the second sentence.
1	This	this	PRON	DT	Number=Sing|PronType=Dem	5	nsubj	_	_
2	is	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	5	cop	_	_
3	the	the	DET	DT	Definite=Def|PronType=Art	5	det	_	_
4	second	second	ADJ	JJ	Degree=Pos|NumType=Ord	5	amod	_	_
5	sentence	sentence	NOUN	NN	Number=Sing	0	root	_	SpaceAfter=No
6	.	.	PUNCT	.	_	5	punct	_	SpaceAfter=No
"""

>>> tc = TextComplexity(example)
>>> tc.info()
Number of Sentences: 2
Number of Words: 12
Number of Clauses: 3
Number of T-Units: 2
Mean Sentence Length: 6.0
Mean Clause Length: 4.0
Mean T-Unit Length: 6.0
Mean Number of Clauses per Sentence: 1.5
Mean Number of Clauses per T-Unit: 1.5
Mean Tree Depth: 3
Median Tree Depth: 3.0
Minimum Tree Depth: 2
Maximum Tree Depth: 4
Mean Dependency Distance: 2.42
Node-to-Terminal-Node Ratio: 1.5
Average Levenshtein Distance between POS: 3
Average Levenshtein Distance between deprel: 4
Average NP Length: 1.8
Complex NP Ratio: 0.6
Number of Combined Clauses: 1
Number of Coordinate Clauses: 0
Number of Subordinate Clauses: 1
Coordinate to Combined Clause Ratio: 0.0
Subordinate to Combined Clause Ratio: 1.0
Coordinate to Subordinate Clause Ratio: 0.0
Coordinate Clause to Sentence Ratio: 0.0
Subordinate Clause to Sentence Ratio: 0.5
Percentage of root Clauses: 67.0%
Percentage of acl Clauses: 33.0%

Alternatively, you can directly pass the result of conllu.parse as input:

>>> from conllu import parse
>>> anno = parse(example)
>>> tc = TextComplexity(anno)

For SentenceComplexity, conllu.models.TokenList is currently the only accepted input:

>>> sc = SentenceComplexity(anno[0])
>>> sc.info()
Number of Words: 7
Number of Clauses: 2
Clauses: ['This is a text', 'containing two sentences']
Number of T-Units: 1
T-Units: ['This is a text containing two sentences']
Number of NPs: 3
NPs: ['This', 'a text', 'two sentences']
Tree Depth: 4
Mean Dependency Distance: 2
POS Chain: ['PRON', 'AUX', 'DET', 'NOUN', 'VERB', 'NUM', 'NOUN']
deprel Chain: ['nsubj', 'cop', 'det', 'root', 'acl', 'nummod', 'obj']

To display the text and the dependency tree, pass verbose=True (for TextComplexity, only the text will be printed):

>>> SentenceComplexity(anno[0], verbose=True)
This is a text containing two sentences.
(deprel:root) form:text lemma:text upos:NOUN [4]
    (deprel:nsubj) form:This lemma:this upos:PRON [1]
    (deprel:cop) form:is lemma:be upos:AUX [2]
    (deprel:det) form:a lemma:a upos:DET [3]
    (deprel:acl) form:containing lemma:contain upos:VERB [5]
        (deprel:obj) form:sentences lemma:sentence upos:NOUN [7]
            (deprel:nummod) form:two lemma:two upos:NUM [6]
    (deprel:punct) form:. lemma:. upos:PUNCT [8]

Releases

No releases published

Packages

No packages published

Languages