This repository contains the replication code for Divide and Conquer: From Complexity to Simplicity for Lay Summarization. The work describes our approach in LaySumm (The 1st Computational Linguistics Lay Summary Challenge Shared Task). The task is to automatically generate non-technical summaries of scholarly text for lay audience.
1 Data - includes all data for the model
1.1 Input-Data - includes original full-text & abstract files for all documents
1.1 Sections-DataFrame - includes csv file containing all documents text (section wise)
1.1 Input-wMVC - includes input documents for the wMVC model
1.1 Input-BART - includes input documents for the BART model
1.1 Section-wise-summaries - includes summaries for all sections (output of BART model)
1.1 Merged-final - includes final merged summaries
2 Utilities - includes utility python scripts
2.1 prepare_data.py - python script for preparing section-wise preprocessed folders
(in Input-wMVC) to be used as input data for wMVC model.
2.2 preprocess_data.py - python script for preprocessing input document text
2.3 merge_summaries.py - python script for merging section-wise summaries (taking input
from the Section-wise-summaries folder, and saving final
summaries in Merged-final folder)
3 BART - includes code for generating abstractive summaries using off-the-shelf BART model and the code to fine-tune BART for the task.
4 wMVC - includes code for generating extractive summaries using wMVC model
5 evaluation - includes evaluation script used in the competetion
6 requirements.txt
- Clone the repository and move to the cloned repository:
git clone https://github.com/anuragjoshi3519/laysumm20
cd laysumm20
- Create virtual environment and install dependencies:
pip3 install virtualenv
virtualenv -p /usr/bin/python3 env
source env/bin/activate
pip3 install -r requirements.txt
python3 -c "from nltk import download; download(['punkt', 'stopwords'])"
-
Add test documents (full_texts & abstracts for every document) in Data/Input-Data (first remove default sample_ABSTRACT.txt and sample_FULLTEXT.txt files)
-
Generate summaries for the test documents:
python3 generateLaysumm.py
Generated summaries can be found in Data/Merged-final folder.
If you find the work useful, please cite it as:
@inproceedings{chaturvedi-etal-2020-divide,
title = "Divide and Conquer: From Complexity to Simplicity for Lay Summarization",
author = "Chaturvedi, Rochana and
Saachi and
Dhani, Jaspreet Singh and
Joshi, Anurag and
Khanna, Ankush and
Tomar, Neha and
Duari, Swagata and
Khurana, Alka and
Bhatnagar, Vasudha",
booktitle = "Proceedings of the First Workshop on Scholarly Document Processing",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.sdp-1.40/",
doi = "10.18653/v1/2020.sdp-1.40",
pages = "344--355"
}