Skip to content

Releases: amir-zeldes/gum

V10.1.0 - corrections and minor updates

16 May 20:45
9df08e9
Compare
Choose a tag to compare

This is a corrected version of GUM series 10 (no additional documents since V10.0.0)

  • Added ExtPos to multiword fixed expression
  • Revised Cxn annotations to follow latest UCxn standard for construction annotation
  • Content-identical with UD v2.14

V10.0.0 - added court, essay, letter and podcast genres

15 Feb 20:19
e7491c8
Compare
Choose a tag to compare

This is the first release of GUM series 10, with 16 genres in total.

  • Four new growing genres:
    • court - courtroom transcripts
    • essay - argumentative essays
    • letter - personal and professional correspondence on paper (not e-mails)
    • podcast - podcast on various topics
  • Many corrections to all annotation layers

Note on document names compared to V9:

  • With the addition of the court genre, one conversation from GUM V9 which is actually from courtroom proceedings has been moved to the new court genre (GUM_conversation_court -> GUM_court_carpet)
  • To compensate for the removed conversation, an additional conversation has been added in V10: GUM_conversation_toys

V9.2.0 - RST++, MSeg and CxG

10 Nov 16:29
3b0ab7d
Compare
Choose a tag to compare

This is the final release of the GUM 9.X series, which is the basis for the contents of the equivalent Universal Dependencies release v2.13. New in this version:

  • Enhanced Rhetorical Structure Theory annotations using RST++:
    • Additional, tree breaking secondary discourse relations
    • Annotation of connectives and many other signaling devices for discourse relations
  • Morphological segmentation based on Unimorph in the MSeg annotation (e.g. un-break-able)
  • Construction Grammar annotation of constructions in the Cxn annotation
  • A second human written summary for each document in the test set
  • Numerous corrections and consistency improvements bringing this corpus and the English Web Treebank (EWT) closer

V9.1.0 - Numerous corrections

05 May 16:42
b153503
Compare
Choose a tag to compare
  • Numerous corrections to all layers
  • Consistency improved with other LDC and UD English corpora
    • Added xpos tag GW for goeswith handling as in EWT
    • MWT fixed for "let's"
    • Label consistency with EWT for assigning iobj without obj
    • Many RST corrections for the DISRPT shared task
  • Data in this version is even with the UD v2.12 release

V9.0.0 - new data, summaries and entity salience

02 Feb 18:55
5f724df
Compare
Choose a tag to compare
  • 20 documents added including more conversational data (total tokens: 203,879)
  • Abstractive summaries for each document in metadata
  • Annotations for most salient entities in each document
  • Foreign language tags identify individual source languages
  • New process for reconstructing Reddit text data in top-level folders (see README.md)
  • Many corrections to all annotation layers

V8.1.0 - final version of GUM series 8

06 Jan 16:55
aa6621a
Compare
Choose a tag to compare
  • Added centering theory annotations (ranked cf, cb, sentence transition types)
  • Numerous corrections
  • Final version of GUM V8.X ahead of V9 release

V8.0.0 - new data and new RST relations

31 Jan 22:16
ed1d2e9
Compare
Choose a tag to compare
  • 25 documents added including more conversational data (total tokens: 180,849):
  • New RST discourse relations, now covering 32 labels in a two level hierarchy, as discourse constituent and dependency trees
  • More consistent UD syntax, including a new obl:agent relation for passive agents
  • New Wikidata identifiers for wikification layer (including nested and pronominal mentions; see #97)
  • Many corrections to all annotation layers

V7.3.0 - HYPH tokens, RST depth, 6-way infstat, pred/disc coref, MIN spans and XML in deps

05 Nov 14:44
c108b9b
Compare
Choose a tag to compare

Stable version 7.3.0, corresponds to UD version 2.9. Same 168 documents as in 7.2.0 but substantial changes to some annotations and tokenization, leading to more total tokens (152,308).

Changes:

  • tokenization now follows EWT and recent LDC corpora in separating hyphenated compounds (e.g. "data-driven" is three tokens)
  • new xpos/extended PTB tag for such tokens: HYPH
  • added RST depth to discourse relations in .conllu and .rsd files, allowing deterministic conversion of discourse dependencies to fully hierarchical RST constituent trees
  • added # newpar comments to conllu files expressing potentially nested block elements, such as paragraphs, headings or bulleted lists
  • added a MISC annotation XML to .conllu files expressing all other XML markup in the corpus
  • shortened entity bracket format in .conllu files to consolidate with Coref UD data / Universal Anaphora initiative
  • removed accessible-generic information status annotations for countries and absolute date expressions
  • add information status categories closer to SFB632 guidelines, including in conllu files. Now a six-way distinction: giv:act, giv:inact, acc:inf, acc:com, acc:aggr and new
  • added pred and disc coref edge types for indefinite predication and discourse deixis respectively
  • added MIN spans and coreference type to entity annotations in .conllu files
  • many corrections and additional validations

V7.2.0 - OntoGUM coreference version and corrections

09 Aug 15:39
65c7794
Compare
Choose a tag to compare
  • Added separate OntoGUM version of coreference annotations following the OntoNotes scheme, in addition to the more comprehensive GUM coreference annotations
  • Numerous corrections

V7.1.0 - enhanced dependencies, consistency overhaul and more

05 May 19:14
4525197
Compare
Choose a tag to compare

(Note: this version contains the content-identical superset of annotations producing UD_English-GUM in Universal Dependencies V2.8)

  • Massive round of consistency corrections and harmonization with English Web Treebank, PTB and OntoNotes
  • Added enhanced dependencies
  • More error validations
  • Added multiword tokens to CoNLL-U format (caution: token IDs like 1-2 now in use!)
  • Added reconstructed ellipsis tokens to CoNLL-U format (caution: token IDs like 8.1 now in use!)
  • Added metadata to CoNLL-U files
  • Better escape characters in Wikification
  • ANNIS conversion support for null nodes to accommodate ellipsis tokens