CAPP Dataset: Context-Aware Polite Paraphrase Dataset

ACL 2024 Presentation

Description

This repository contains information about the Context Aware Polite Paraphrase (CAPP) dataset - a dialogue-style corpus of rude utterances and corresponding polite paraphrases, with samples accompanied by additional context in the form of prior turns from the dialogue. The original paper Demonstrations Are All You Need: Advancing Offensive Content Paraphrasing using In-Context Learning was accepted in ACL 2024 Findings. We also provide the generated paraphrases for the different methods described in the paper.

Dataset

The train split has 7939 samples and the test split has 1120 samples. Each sample will always have a rude utterance and a polite paraphrase. About 55% of the train data and 53% of the test data also has prior dialogue turn information to provide additional context.
The following notebook loads and describes the training and test data splits: notebook
For samples with multiple dialogue turns, each dialogue turn will be separated using the [SEP] separator. For example:
```
How long have we been here? [SEP] It's been 2 days and 7 hours. [SEP] What will happen to us?
```

Generated Paraphrases

We provide the inoffensive paraphrases generated using the below listed models for each dataset explored in the paper.

text-davinci-003
gpt-3.5-turbo-instruct
gpt-3.5-turbo-0613
gpt-3.5-turbo-1106
Vicuna-13b

Citation

@article{som2023demonstrations,
  title={Demonstrations Are All You Need: Advancing Offensive Content Paraphrasing using In-Context Learning},
  author={Som, Anirudh and Sikka, Karan and Gent, Helen and Divakaran, Ajay and Kathol, Andreas and Vergyri, Dimitra},
  journal={arXiv preprint arXiv:2310.10707},
  year={2023}
}

Contact

Please report any issues to Github Issues.

For any questions, please contact: Anirudh Som (anirudh.som@sri.com)

Acknowledgement

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001122C0032. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views or policies of DARPA, the Department of Defense or the U.S. Government.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Dataset		Dataset
Files		Files
Generated_Paraphrases		Generated_Paraphrases
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAPP Dataset: Context-Aware Polite Paraphrase Dataset

ACL 2024 Presentation

Description

Dataset

Generated Paraphrases

Citation

Contact

Acknowledgement

License

About

Releases

Packages

Languages

License

anirudhsom/CAPP-Dataset

Folders and files

Latest commit

History

Repository files navigation

CAPP Dataset: Context-Aware Polite Paraphrase Dataset

ACL 2024 Presentation

Description

Dataset

Generated Paraphrases

Citation

Contact

Acknowledgement

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages