Releases · amir-zeldes/gum

16 May 20:45

amir-zeldes

V10.1.0

9df08e9

V10.1.0 - corrections and minor updates Latest

Latest

This is a corrected version of GUM series 10 (no additional documents since V10.0.0)

Added ExtPos to multiword fixed expression
Revised Cxn annotations to follow latest UCxn standard for construction annotation
Content-identical with UD v2.14

Assets 2

15 Feb 20:19

amir-zeldes

V10.0.0

e7491c8

V10.0.0 - added court, essay, letter and podcast genres

This is the first release of GUM series 10, with 16 genres in total.

Four new growing genres:
- court - courtroom transcripts
- essay - argumentative essays
- letter - personal and professional correspondence on paper (not e-mails)
- podcast - podcast on various topics
Many corrections to all annotation layers

Note on document names compared to V9:

With the addition of the court genre, one conversation from GUM V9 which is actually from courtroom proceedings has been moved to the new court genre (GUM_conversation_court -> GUM_court_carpet)
To compensate for the removed conversation, an additional conversation has been added in V10: GUM_conversation_toys

Assets 2

10 Nov 16:29

amir-zeldes

V9.2.0

3b0ab7d

V9.2.0 - RST++, MSeg and CxG

This is the final release of the GUM 9.X series, which is the basis for the contents of the equivalent Universal Dependencies release v2.13. New in this version:

Enhanced Rhetorical Structure Theory annotations using RST++:
- Additional, tree breaking secondary discourse relations
- Annotation of connectives and many other signaling devices for discourse relations
Morphological segmentation based on Unimorph in the MSeg annotation (e.g. un-break-able)
Construction Grammar annotation of constructions in the Cxn annotation
A second human written summary for each document in the test set
Numerous corrections and consistency improvements bringing this corpus and the English Web Treebank (EWT) closer

Assets 2

05 May 16:42

amir-zeldes

V9.1.0

b153503

V9.1.0 - Numerous corrections

Numerous corrections to all layers
Consistency improved with other LDC and UD English corpora
- Added xpos tag GW for goeswith handling as in EWT
- MWT fixed for "let's"
- Label consistency with EWT for assigning iobj without obj
- Many RST corrections for the DISRPT shared task
Data in this version is even with the UD v2.12 release

Assets 2

02 Feb 18:55

amir-zeldes

V9.0.0

5f724df

V9.0.0 - new data, summaries and entity salience

20 documents added including more conversational data (total tokens: 203,879)
Abstractive summaries for each document in metadata
Annotations for most salient entities in each document
Foreign language tags identify individual source languages
New process for reconstructing Reddit text data in top-level folders (see README.md)
Many corrections to all annotation layers

Assets 2

06 Jan 16:55

amir-zeldes

V8.1.0

aa6621a

V8.1.0 - final version of GUM series 8

Added centering theory annotations (ranked cf, cb, sentence transition types)
Numerous corrections
Final version of GUM V8.X ahead of V9 release

Assets 2

31 Jan 22:16

amir-zeldes

V8.0.0

ed1d2e9

V8.0.0 - new data and new RST relations

25 documents added including more conversational data (total tokens: 180,849):
New RST discourse relations, now covering 32 labels in a two level hierarchy, as discourse constituent and dependency trees
More consistent UD syntax, including a new obl:agent relation for passive agents
New Wikidata identifiers for wikification layer (including nested and pronominal mentions; see #97)
Many corrections to all annotation layers

Assets 2

05 Nov 14:44

amir-zeldes

V7.3.0

c108b9b

V7.3.0 - HYPH tokens, RST depth, 6-way infstat, pred/disc coref, MIN spans and XML in deps

Stable version 7.3.0, corresponds to UD version 2.9. Same 168 documents as in 7.2.0 but substantial changes to some annotations and tokenization, leading to more total tokens (152,308).

Changes:

tokenization now follows EWT and recent LDC corpora in separating hyphenated compounds (e.g. "data-driven" is three tokens)
new xpos/extended PTB tag for such tokens: HYPH
added RST depth to discourse relations in .conllu and .rsd files, allowing deterministic conversion of discourse dependencies to fully hierarchical RST constituent trees
added # newpar comments to conllu files expressing potentially nested block elements, such as paragraphs, headings or bulleted lists
added a MISC annotation XML to .conllu files expressing all other XML markup in the corpus
shortened entity bracket format in .conllu files to consolidate with Coref UD data / Universal Anaphora initiative
removed accessible-generic information status annotations for countries and absolute date expressions
add information status categories closer to SFB632 guidelines, including in conllu files. Now a six-way distinction: giv:act, giv:inact, acc:inf, acc:com, acc:aggr and new
added pred and disc coref edge types for indefinite predication and discourse deixis respectively
added MIN spans and coreference type to entity annotations in .conllu files
many corrections and additional validations

Assets 2

09 Aug 15:39

amir-zeldes

V7.2.0

65c7794

V7.2.0 - OntoGUM coreference version and corrections

Added separate OntoGUM version of coreference annotations following the OntoNotes scheme, in addition to the more comprehensive GUM coreference annotations
Numerous corrections

Assets 2

05 May 19:14

amir-zeldes

V7.1.0

4525197

V7.1.0 - enhanced dependencies, consistency overhaul and more

(Note: this version contains the content-identical superset of annotations producing UD_English-GUM in Universal Dependencies V2.8)

Massive round of consistency corrections and harmonization with English Web Treebank, PTB and OntoNotes
Added enhanced dependencies
More error validations
Added multiword tokens to CoNLL-U format (caution: token IDs like 1-2 now in use!)
Added reconstructed ellipsis tokens to CoNLL-U format (caution: token IDs like 8.1 now in use!)
Added metadata to CoNLL-U files
Better escape characters in Wikification
ANNIS conversion support for null nodes to accommodate ellipsis tokens

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: amir-zeldes/gum

V10.1.0 - corrections and minor updates

V10.0.0 - added court, essay, letter and podcast genres

V9.2.0 - RST++, MSeg and CxG

V9.1.0 - Numerous corrections

V9.0.0 - new data, summaries and entity salience

V8.1.0 - final version of GUM series 8

V8.0.0 - new data and new RST relations

V7.3.0 - HYPH tokens, RST depth, 6-way infstat, pred/disc coref, MIN spans and XML in deps

V7.2.0 - OntoGUM coreference version and corrections

V7.1.0 - enhanced dependencies, consistency overhaul and more