HIPE-2020 dataset


Original dataset
Document type	newspaper (mid-19C to mid 20C)
Languages	English, French, German
Annotation guidelines
Annotation tool	INCEpTION
Original format and tagging scheme	`.tsv, IOB`
Annotations	NERC, EL (towards Wikidata, dump of 2019.11.13 )
Version (used in HIPE-2022)	`v1.4`
Related publication	Overview of CLEF-HIPE-2020, Extended Overview of CLEF-HIPE-2020
License

Coarse-grained tagset	Fine-grained tagset	Nesting applies	Linking applies
pers	pers.ind	yes	yes
	pers.coll	yes	yes
	pers.ind.articleauthor	yes	yes
org	org.adm	yes	yes
	org.ent	yes	yes
	org.ent.pressagency	yes	yes
prod	prod.media	yes	yes
	prod.doctr	yes	yes
time	time.date.abs	yes	yes
loc	loc.adm.town	yes	yes
	loc.adm.reg	yes	yes
	loc.adm.nat	yes	yes
	loc.adm.sup	yes	yes
	loc.phys.geo	yes	yes
	loc.phys.hydro	yes	yes
	loc.phys.astro	yes	yes
	loc.oro	yes	yes
	loc.fac	yes	yes
	loc.add.phys	yes	yes
	loc.add.elec	yes	yes
	loc.unk	yes	yes

The hipe2020 dataset can be used for:

Tasks: NERC-Coarse, NERC-Fine, NEL.
Challenges: Multilingual Newspaper Coarse, Multilingual Newspaper Fine, Global Adaptation Coarse.

Annotation guidelines: mostly compatible with letemps and newseye datasets.
Documents: hipe2020 documents corresponds to newspaper articles.
Train set: for this dataset, there is no training set. Only a dev set that is representative for the test set in terms of newspapers and periods.
Sentence splitting: performed automatically on OCRed text using pySBD (performances not perfect).
Metonymic sense: literal and metonymic annotations are in separated columns.
Known glitches:
- some negative offsets in Partial are wrong/off

HIPE-2022 v1.0 release notes

Provide feedback