Skip to content

Research code for generating semantic role labels for CHILDES

Notifications You must be signed in to change notification settings

phueb/CHILDES-SRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CHILDES-SRL

A corpus of semantic role labels auto-generated for 5M words of American-English child-directed speech.

Purpose

The purpose of this repository is to:

  • host the CHILDES-SRL corpus, and code to generate it, and
  • suggest recipes for training BERT on CHILDES-SRL for classifying token spans into semantic role arguments.

Inspiration and code for a BERT-based semantic role labeler comes from the AllenNLP toolkit. A SRL demo can be found here.

The code is for research purpose only.

Data

There are 2 manually annotated ("human-based") datasets, named after the year of their release:

  • data/pre_processed/human-based-2018_srl.txt
  • data/pre_processed/human-based-2008_srl.txt

The latter is an extended version of the former, which also includes SRL annotation for prepositions.

Further, this repository contains SRL labels generated by an automatic SRL tagger, applied to a custom corpus of approximately 5M words of American-English child-directed language, which can be found in data/pre_processed/childes-20191206_mlm.txt. The data file that contains both utterances and SRL annotation is in data/pre_processed/childes-20191206_srl.txt.

History

  • 2008: The BabySRL project started as a collaboration between Cynthia Fisher, Dan Roth, Michael Connor and Yael Gertner, whose published work is available here.

  • 2016: The most recent work, prior to this, can be found here

  • 2019: Under the supervision of Cynthia Fisher at the Department of Psychology at UIUC, explorations into the ability of BERT to perform SRL tagging began. In particular, we experimented with joint training on SRL and MLM. The joint training procedure is similar to what is proposed in https://arxiv.org/pdf/1901.11504.pdf.

  • 2020 (Summer): Having found little benefit for joint SRL and MLM training BERT on CHILDES, a new line of research into the grammatical capability of RoBERTa began. Development moved here.

Generating the CHILDES-SRL corpus

To annotate 5M words of child-directed speech using a semantic role tagger, trained by AllenNLP, execute data_tools/make_srl_training_data_from_model.py

To generate a corpus of human-labeled semantic role labels for a small section of CHILDES, execute data_tools/make_srl_training_data_from_human.py

Quality of auto-generated tags

How well does AllenNLP SRL tagger perform on CHILDES 2008 SRL data? Below is a list of f1 scores, comparing its performance with that of trained human annotators.

      ARG-A1 f1= 0.00
      ARG-A4 f1= 0.00
     ARG-LOC f1= 0.00
        ARG0 f1= 0.95
        ARG1 f1= 0.93
        ARG2 f1= 0.79
        ARG3 f1= 0.44
        ARG4 f1= 0.80
    ARGM-ADV f1= 0.70
    ARGM-CAU f1= 0.84
    ARGM-COM f1= 0.00
    ARGM-DIR f1= 0.48
    ARGM-DIS f1= 0.68
    ARGM-EXT f1= 0.38
    ARGM-GOL f1= 0.00
    ARGM-LOC f1= 0.68
    ARGM-MNR f1= 0.68
    ARGM-MOD f1= 0.78
    ARGM-NEG f1= 0.99
    ARGM-PNC f1= 0.03
    ARGM-PPR f1= 0.00
    ARGM-PRD f1= 0.15
    ARGM-PRP f1= 0.39
    ARGM-RCL f1= 0.00
    ARGM-REC f1= 0.00
    ARGM-TMP f1= 0.84
      ARGRG1 f1= 0.00
      R-ARG0 f1= 0.00
      R-ARG1 f1= 0.00
  R-ARGM-CAU f1= 0.00
  R-ARGM-LOC f1= 0.00
  R-ARGM-TMP f1= 0.00
     overall f1= 0.88

Compatibility

Tested on Ubuntu 16.04, Python 3.6, and torch==1.2.0

About

Research code for generating semantic role labels for CHILDES

Topics

Resources

Stars

Watchers

Forks