BERT for long sentence classifiaction

BERT does not process tokenized sequences of text with more than 512 word pieces, it has to truncate them.

In the case of corpus like 20 Newsgroups, this represents a problem, because, it has a lot of extensive examples.

Recurrences Over BERT (RoBERT)

In this project, we implemented the approach proposed in this article Hierarchical Transformers for Long Document Classification.

RoBERT can process tokenized sequences of text for every size:

Segements the text sequence in segments of N tokens.
Tokenizes all the segments.
Processes with BERT all the segments.
The representation obtained with BERT for each segments is located sequentially in a tensor.
The tensor will be processed by LSTM.
The representation obtained in the last time step of the LSTM will be used to classify.

Dependencies

Python 3.7
We need the following packages (using pip):

pip install pandas
pip install cleantext
pip install scikit-learn
pip install torch
pip install transformers
pip install matplotlib
pip install mlxtend
pip install seaborn
pip install Unidecode
pip install nltk

Usage

The two command below use the argument True for downloading the 20 Newsgroups corpus (only it is necessary for the first execution of each script).

The first script uses BERT for sequence classication (BERTSC), therefore truncates the sentences.

./launch-experiments-20newsgroups.sh True

The second script uses RoBERT.

./launch-hierarchical-experiments-20newsgroups.sh True

Results

The accuracy obtained in the reference paper in the corpus 20 News Groups using RoBERT on the full datasets is 84 %. In our case, for limitations of hardware, we could not feed all the segmented corpus in the GPU, for this resason, we realize an experimentation with a reduced version, where the maximum number of tokens allowed by example was 512 and 1024 tokens.

In the table below are shown the results obtained, where for the same maximum lenght, BERTSC obtained a better performance than RoBERT, however, we implemented our own implementetation of RoBERT and does not follow the same approach of optimization, however, is obviuous that the LSTM degrade a bit the performance.

MAX. LENGTH	MODEL	ACC
1024	RoBERT	77 %
512	BERTSC	79 %
512	RoBERT	75 %

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dataset		dataset
images		images
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
launch-experiments-20newsgroups.sh		launch-experiments-20newsgroups.sh
launch-hierarchical-experiments-20newsgroups.sh		launch-hierarchical-experiments-20newsgroups.sh
preprocess_20newsgroups.py		preprocess_20newsgroups.py
training_plmc.py		training_plmc.py
training_robert.py		training_robert.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT for long sentence classifiaction

Recurrences Over BERT (RoBERT)

Dependencies

Usage

Results

About

Releases

Packages

Languages

License

franbvalero/BERT-long-sentence-classification

Folders and files

Latest commit

History

Repository files navigation

BERT for long sentence classifiaction

Recurrences Over BERT (RoBERT)

Dependencies

Usage

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages