gold-standard-toxicity

Gold Standard for Toxicity and Incivility Project
Annotated Data in Spanish for Toxicity and Insults in Digital Social Networks

Overview

This repository contains data sets and materials for a gold standard elaboration on toxicity and incivility in the digital sphere based on human coding to benchmark algorithmic classification tasks with transformers and LLMs. The labelling progress is indicated in the coverage badge above.

We are labelling two samples of novel datasets of political digital interactions on Twitter (rebranded as X). The first set comprises almost 5 million data points from three Latin American protest events: (a) protests against the coronavirus and judicial reform measures in Argentina during August 2020; (b) protests against education budget cuts in Brazil in May 2019; and (c) the social outburst in Chile stemming from protests against the underground fare hike in October 2019. We are focusing on interactions in Spanish to elaborate a gold standard for digital interactions in this language, therefore, we prioritise Argentinian and Chilean data. The second set contains more than 31 million messages and more than 9 million interactions between 2010 and 2022, covering the election of members of the first Constitutional Convention in Chile, the drafting process and the referendum in which the proposal was rejected.

This project is generously funded by the OpenAI Academic Programme, 2024 FAE-UDP Research Grant, and partially by the St Hilda’s College Muriel Wise Fund at the University of Oxford. The Training Data Lab research group also logistically supports this project.

Coding Process

Annotation Task

Our samples are balanced considering a probability score from one of our main classification tasks using BERT family models already applied to the full corpora. This baseline classification implied more than 5,400 hours of computing.

The protest data was completely classified locally on a Raspberry Pi 5, while 41% of the batches of Convention data were classified there, and the rest were run on the cloud. This allowed us to reduce the carbon footprint of our computational tasks by around 39%.

Once we finish the labelling process, we will benchmark human coding with this classification and additional transformers and LLMs in the framework of different ongoing works.

Protest Events Sample

Case	Toxicity	Batch	Sample	Oversample	Oversampled	Labelbox	Priority
Argentinan protests	0.00 to 0.20	ARG_Q1	100	20	0	✅	2
Argentinan protests	0.21 to 0.40	ARG_Q2	100	20	1	✅	3
Argentinan protests	0.41 to 0.60	ARG_Q3	100	20	3	✅	4
Argentinan protests	0.61 to 0.80	ARG_Q4	100	20	0	✅	3
Argentinan protests	0.81 to 1.00	ARG_Q5	100	20	3	✅	2
Chilean protests	0.00 to 0.20	CHL_Q1	100	20	1	✅	1
Chilean protests	0.21 to 0.40	CHL_Q2	100	20	0	✅	2
Chilean protests	0.41 to 0.60	CHL_Q3	100	20	2	✅	3
Chilean protests	0.61 to 0.80	CHL_Q4	100	20	1	✅	2
Chilean protests	0.81 to 1.00	CHL_Q5	100	20	3	✅	1
		Total	1,000	200	14

Chilean Constitutional Convention Sample

Case	Toxicity	Batch	Sample	Oversample	Oversampled	Labelbox	Priority
Constitutional Convention	0.00 to 0.20	CON_Q1	200	40	1	✅	3
Constitutional Convention	0.21 to 0.40	CON_Q2	200	40	0	✅	4
Constitutional Convention	0.41 to 0.60	CON_Q3	200	40	0	✅	5
Constitutional Convention	0.61 to 0.80	CON_Q4	200	40	0	✅	4
Constitutional Convention	0.81 to 1.00	CON_Q5	200	40	0	✅	3
		Total	1,000	200	1

Annotation Performance

Coder	Labels	Avg Time per Label	Avg Agreement	Approval
i3i43	375	7s	0.804	0.950
wg9ps	632	18s	0.571	0.868
rj8e0	2,401	6s	0.694	0.839
h1arz	717	8s	0.683	0.870
v1z6y	838	9s	0.644	0.822

Preservation

These data sets are stored with version control on this GitHub repository. Furthermore, a Digital Object Identifier is provided by Zenodo.

Storage and Backup

This GitHub repository has controlled access with Two-Factor Authentication 2FA with physical USB security devices and/or mobile applications. Both issue one-time passwords to generate a cryptographic authentication FIDO2 and U2F.

Moreover, the repository is backed up locally on a Directed Attached Storage HDD NVMe enclosure that supports RAID 1 for two ATA-150 disks of 8TB mirroring.

License

The content of this project itself is licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0), and the underlying code used to format and display that content is licensed under a GNU GPLv3 license.

The above implies that the data sets may be shared, reused, and adapted as long as appropriate acknowledgement is given. In addition, the code may be shared, reused, and adapted as long as the source is disclosed, the changes are stated, and the same GNU GPLv3 license is used.

Contribute

Contributions are entirely welcome. You just need to open an issue with your comment or idea.

For more substantial contributions, please fork this repository and make changes. Pull requests are also welcome.

Please read our code of conduct first. Minor contributions will be acknowledged, and significant ones will be considered in our contributor roles taxonomy.

Citation

González-Bustamante, B., & Rivera, S. (2024). Annotated Data in Spanish for Toxicity and Insults in Digital Social Networks. Dataset, pre-release version 0.3.3 -- Purple Butterfly. Leiden University, Universidad Diego Portales, University of California Irvine and Training Data Lab. https://zenodo.org/doi/10.5281/zenodo.12574288.

Authors

Bastián González-Bustamante
b.a.gonzalez.bustamante@fgga.leidenuniv.nl
ORCID iD 0000-0003-1510-6820
https://bgonzalezbustamante.com

Sebastián Rivera
sebastian.rivera@uci.edu
ORCID iD 0000-0003-2642-2546
https://sebastianrivera.cl

CRediT - Contributor Roles Taxonomy

Bastián González-Bustamante (ORCID iD 0000-0003-1510-6820): Conceptualisation, data curation, formal analysis, funding acquisition, investigation, methodology, project administration, resources, software, supervision, and validation.

Sebastián Rivera (ORCID iD 0000-0003-2642-2546): Conceptualisation, data curation, resources, supervision, and validation.

Carla Cisternas (ORCID iD 0000-0001-7948-6194): Investigation, resources and supervision.

Jaquelin Morillo (ORCID iD 0000-0002-2870-2691): Investigation and resources.

Diego Aguilar (ORCID iD 0000-0003-4531-5922): Investigation and resources.

Sofía Carrerá-Martínez (ORCID iD 0009-0001-1596-115X): Investigation and resources.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
badges		badges
code		code
data/raw		data/raw
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE-CC.md		LICENSE-CC.md
LICENSE-GPL.md		LICENSE-GPL.md
README.md		README.md
STATUS.md		STATUS.md
gold-standard-toxicity.Rproj		gold-standard-toxicity.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

gold-standard-toxicity

Overview

Coding Process

Annotation Task

Protest Events Sample

Chilean Constitutional Convention Sample

Annotation Performance

Preservation

Storage and Backup

License

Contribute

Citation

Authors

CRediT - Contributor Roles Taxonomy

About

Licenses found

Releases

Contributors 6

Languages

License

Licenses found

training-datalab/gold-standard-toxicity

Folders and files

Latest commit

History

Repository files navigation

gold-standard-toxicity

Overview

Coding Process

Annotation Task

Protest Events Sample

Chilean Constitutional Convention Sample

Annotation Performance

Preservation

Storage and Backup

License

Contribute

Citation

Authors

CRediT - Contributor Roles Taxonomy

About

Topics

Resources

License

Licenses found

Code of conduct

Stars

Watchers

Forks

Releases

Contributors 6

Languages