TreebankAnalytics

What is it?

TreebankAnalytics is a NLP tool that helps you compute valuable information on graphbanks: deep syntactic corpora such as those released for the SemEval 2014 task 8 shared task (Broad-coverage semantic dependency parsing) or such as the DeepSequoia, a graph-based deep syntax treebank for French.

Analyzing the treebank to extract interesting properties (% of cycles, non planar graphs, number of nodes that are semantically void, ...).
Evaluating a system output against gold reference and extract different kind of stats:
- Classic precision/recall/f-score metrics
- Breakdown by labels (possibility to exclude ou keep certain labels)
- Breakdown by sentence bins (possibility to set bin size, etc)
- Classic precision/recall/f-score metrics with some labels left out
- Classics precision/recall/f-score metrics on certain labels only
- ...
Converting input format into a different output format.

How to install

TreebankAnalytics is developped with Python 3, so you need a recent version of Python 3 such as Python 3.2 and above. You can install it by cloning this repository and run (sudo may be necessary to execute this command):

python3 setup.py install

The setup.py script will take care of dependencies for you (PyYAML only for now).

Formats

TreebankAnalytics supports the following formats:

deepsequoia format (called sequoia in the software) which is used to annotate the DeepSequoia.
sdp format: the one used during SemEval 2014 shared task.
sagae format: the one used in the DAGParser adapted from Sagae and Tsujii (2008). The format is an extension of the CoNLL format that encodes multi-governors by repeating the token with a different head id and label.
Standard CoNLL-X format (since sequoia and sagae are both retro-compatible).

My format is not supported.

You can add your own format through a simple API. TBA

Analyzers

TreebankAnalytics is shipped with several kinds of analyzers:

VoidAnalyzer which analyzes the number of semantically empty tokens (ie. no incoming or outgoing edges) in a treebank.
CrossingEdgesAnalyzer which analyzes the number of crossing edges in a treebank.
NonPlanarAnalyzer which analyzes the number of non planar graphs in a treebank.
CyclesAnalyze which analyzes the number of cycles, graphs and DAGs in a treebank.
LabelsAnalyzer which analyzes the labels distribution in a treebank.

Using analyzers

Analyzers are used through the analyze command (TreebankAnalytics analyze -h for more details). Analyzers are customizable by using a configuration file in a YAML format.

In this config file, you need to specify which analyzers you'd like to use:

Analyzers :
    - VoidAnalyzer
    - NonPlanarAnalyzer

This will use two different analyzers (VoidAnalyzer, NonPlanarAnalyzer).

Scorers

TreebankAnalytics is shipped with several kinds of scorers:

AllScorer which gives the Labeled precision (LP), recall (LR) and F-score (LF) as well as the Unlabeled precision (UP), recall (UR) and f-score (UF).
LabelsScorer which gives the LP/LR/LF and UP/UR/UF for each label type.
FilteredScorer which gives the global LP/LR/LF and UP/UR/UF for certain labels only (or for all labels except those you specify).
SentenceBinsScorer which gives the LP/LR/LF and UP/UR/UF grouped by sentence bins.
EdgeLengthBinsScorer which is the same as SentenceBinsScorer but for edge length (undirected distance between head and dependent).

Using scorers

Scorers are used through the eval command (TreebankAnalytics eval -h for more details). Scorers are customizable by using a configuration file in a YAML format.

In this config file, you need to specify which scorers you'd like to use:

Scorers :
    - AllScorer
    - LabelsScorer
    - FilteredScorer

This will use three different scorers (AllScorer, LabelsScorer, FilteredScorer). You can also customize every single scorer. See the description of scorers' options below.

AllScorer

No customization available.

SentenceBinsScorer & EdgeLengthBinsScorer

Available options:

binStart (type: integer): give the starting point of the first bin (default = 1).
binStop (type: integer): give the end point of the last bin. All sentences above this threshold will be agregated in a single group (default = 100).
binStep (type: integer): size of the bin (default = 10)

The default options gives bins like this :

1-10
11-20
21-30
...
41-50
...
90-100
100+

LabelsScorer

Available options:

filteredLabels (type: list): list of labels (default = []).
keep (type: boolean): should we keep the filteredLabels (default = true)

If keep is true, the scorer only shows the filteredLabels, if keep is false, the scorer shows all labels except the filtered ones.

Example:

LabelsScorer:
    filteredLabels:
        - nsubj
        - nmod
    keep: true

This gives:

Label	NumberInGold	LP	LR	LF	UP	UR	UF
nsubj	8000	90	90	90	92	93	91.5
nmod	6000	..	..	..	..	..	..

FilteredScorer

Available options:

filteredLabels (type: list): list of labels (default = []).
keep (type: boolean): should we keep the filteredLabels (default = false)

If keep is true, the scorer only compute the LP/LR/LF and UP/UR/UF for these labels. If keep is false, the scorer compute the scores for all labels except the filtered ones.

Default config

You always need to give a config file (there is no default). A standard config file would be the following one:

Scorers:
    - AllScorer

General options

Some options may be specified for both Analyzers and Scorers:

General: 
    showNameScorers: true | false
    showNameAnalyzers: true | false

The first option (showNameScorers) will output the name of each scorer before giving the result or not. The second option (showNameAnalyzers) will output the name of each analyzer before giving the result or not.

Example (showNameScorers is set to true):

AllScorer
LP    LR    LF    UP     UR    UF
90    85    87    90     85    87

Example (showNameScorers is set to false):

LP    LR    LF    UP     UR    UF
90    85    87    90     85    87

This option is useful when you want to output usable CSV file for pgfplots for example.

Converters

Using converters

You can convert from one format to another, by specifying the input format and the output format. See TreebankAnalytics convert -h for more details.

How to cite

If you're using the software, please cite the following work :

Corentin Ribeyre, Méthodes d’Analyse Supervisée pour l’Interface Syntaxe-Sémantique, PhD Thesis, Université Paris 7 Diderot, 2016.

@phdthesis{ribeyre:tel-01323245,
  title       = {{M{'e}thodes d'Analyse Supervis{'e}e pour l'Interface Syntaxe-S{'e}mantique}},
  author      = {Ribeyre, Corentin},
  url         = {https://hal.archives-ouvertes.fr/tel-01323245},
  school      = {{Universit{'e} Paris Diderot}},
  year        = {2016},
  month       = Jan,
  keywords    = {Deep syntax ; Syntax-semantic interface ; Graph Isomorphism Problem ; Transition-based parsing ; Graph parsing ; Interface syntaxe-s{'e}mantique ; Syntaxe profonde ; Analyse supervis{'e}e de graphes ; Analyse par transitions ; Isomorphisme de graphes},
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
docs		docs
treebankanalytics		treebankanalytics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RELEASE-VERSION		RELEASE-VERSION
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TreebankAnalytics

What is it?

How to install

Formats

My format is not supported.

Analyzers

Using analyzers

Scorers

Using scorers

AllScorer

SentenceBinsScorer & EdgeLengthBinsScorer

LabelsScorer

FilteredScorer

Default config

General options

Converters

Using converters

How to cite

About

Releases 5

Packages

Languages

License

Cocophotos/TreebankAnalytics

Folders and files

Latest commit

History

Repository files navigation

TreebankAnalytics

What is it?

How to install

Formats

My format is not supported.

Analyzers

Using analyzers

Scorers

Using scorers

AllScorer

SentenceBinsScorer & EdgeLengthBinsScorer

LabelsScorer

FilteredScorer

Default config

General options

Converters

Using converters

How to cite

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages