Skip to content

Latest commit

 

History

History
35 lines (23 loc) · 3.51 KB

README.md

File metadata and controls

35 lines (23 loc) · 3.51 KB

logo        Surrey Institute for People-centred AI

S3D: A Weakly Supervised Sarcasm Dataset

GitHub issues GitHub stars GitHub forks License: CC BY-SA 4.0 Twitter Follow

This is the repository for our 'Utilizing Weak Supervision to Create S3D: A Sarcasm Annotated Dataset' paper submitted to the EMNLP NLP+CSS 2022 workshop. This repository includes our SAD dataset along with version 1 and 2 of our S3D dataset. Both of these twitter datasets can be used for the purpose of training sarcasm detection models.

Datasets

SAD - We provide the Tweet IDs and the given sarcasm labels of 2340 manually annotated tweets which were collected observing the #sarcasm hashtag. Available on HuggingFace

S3D-v1 - We provide the Tweet IDs of 100,000 tweets along with their respective labels which were predicted by a fine-tuned BERTweet model which was trained on our 'Combined dataset', a corpus of over a million tweets and reddit comments labelled for sarcasm in previous works. Available on HuggingFace

S3D-v2 - We provide the Tweet IDs of 100,000 tweets along with their respective labels which were predicted by an ensemble of our 'best' three fine-tuned sarcasm detection models. Available on HuggingFace

Experiments

We provide a notebook to show the labelling process of our datasets. You can reproduce the experiments to create S3D-v1 and S3D-v2 via our Python notebooks which uses HuggingFace to load the relevant models to label the dataset.

Models

Models Fine-tuned Models Description
BERTweet BERTweet-base-finetuned-SARC-combined-DS BERTweet model fine-tuned on our combined dataset
BERTweet BERTweet-base-finetuned-SARC-DS BERTweet model fine-tuned on the SARC dataset
RoBERTalarge roberta-large-finetuned-SARC-combined-DS RoBERTalarge model fine-tuned on our combined dataset

Maintainer(s)

Jordan Painter
Diptesh Kanojia