Extractive Text Summarization using BERT

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- How to use
Running the tests
Contributors
Reference

About The Project

The process of picking sentences directly from the story to form the summary is extractive summarization. This process is aided by scoring functions and clustering algorithms to help choose the most suitable sentences. We use the existing BERT model which stands for Bidirectional Encoder Representations from Transformers, to produce extractive summarization by clustering the embeddings of sentences by K-means clustering, but introduce a dynamic method to decide the suitable number of sentences to pick from clusters.On top of that, the study is aimed at producing summaries of higher quality by incorporating reference resolution and dynamically producing summaries of suitable sizes depending on the text. This study aims to provide students with a summarizing service to help understand the content of lecture videos of long duration which would be vital in the process of revision.

This repository contains an extractive summarization tool that uses BERT to produce embeddings for clustering using a K-Means model to produce the summary. This is tested in CNN_Dailymail dataset

Publication

(back to top)

Built With

Transformers - BERT model
NeuralCoref - Reference Resolution
Spacy - Used for NLP language

(back to top)

Getting Started

To setup the project, first download the CNN-Dailymail dataset

Prerequisites

Install necessary packages

pip install spacy
pip install transformers
pip install neuralcoref

python -m spacy download en_core_web_md

(back to top)

How to use

Pre-process the dataset

python3 preprocess.py

This will produce cnn_dataset.pkl.

To produce the summary of any document, set the document in summarize.py

python3 summarize.py create_summary

To produce the summary of any specific CNN document, specify the document number(0-90000) in summarize.py

python3 summarize.py create_summary_cnn_single

To collect within cluster sum of squares, between cluster sum of squares, summary length data

python3 summarize.py collect_data

To train a linear regression model between within cluster sum of squares and summary length

python3 summarize.py train_model

To check the histogram produced by within cluster sum of squares

python3 histogram_wcss

(back to top)

Running the tests

Set the lower limit and upper limit of the CNN dataset to run the test in summarize.py

python3 summarize.py test_cnn

The result is the average ROUGE-1, ROUGE-2 and ROUGE-l score across the range of documents specified

(back to top)

Contributors

Anirudh S - (https://github.com/anirudhs1998)
Saravanan T - (https://github.com/saranthn)
Ashwin Shankar - (https://github.com/Ashwinshankar98)

(back to top)

Reference

Paper: https://arxiv.org/abs/1906.04165

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
README.md		README.md
bert.py		bert.py
cluster.py		cluster.py
coreference.py		coreference.py
create_csv.py		create_csv.py
histogram_wcss.py		histogram_wcss.py
preprocess.py		preprocess.py
rouge_test.py		rouge_test.py
sentence_handler.py		sentence_handler.py
summarize.py		summarize.py
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extractive Text Summarization using BERT

About The Project

Built With

Getting Started

Prerequisites

How to use

Running the tests

Contributors

Reference

About

Releases

Packages

Contributors 3

Languages

saranthn/ExtractiveTextSummarizer-BERT

Folders and files

Latest commit

History

Repository files navigation

Extractive Text Summarization using BERT

About The Project

Built With

Getting Started

Prerequisites

How to use

Running the tests

Contributors

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages