In this project, we train and compare several Transformers-based models. The task we are solving is known as sentiment analysis. In a nutshell, the models learn to classify free text reviews as positive and negative ones.
We use YELP reviews for this task. Original Yelp Open Dataset is available here.
To reduce the training time and avoid using expensive and often unavailable hardware, we extract 25,000 records per star rating (125,000 reviews in total) and split them into training, validation and test sets for development and final model evaluation. This is still too much for ordinary CPUs to process in a reasonable time. Fortunately, Google Colab offers performant (up to 16GB GPU RAM) GPUs which is sufficient for training the models in an acceptable timeframe.
Basic exploratory data analysis of the subsample is available in
notebooks/yelp_eda.ipynb
.
data
- contains the data we're using for development and testing purposes. Since both initial dataset and the subsample are quite large, this folder will need to be created locally. We only needyelp_academic_dataset_review.json.zip
from the original dataset to be stored indata
. Subsample of the dataset can be generated by runningutils.dataset_utils.py
.models
- model, training, evaluation and prediction definitions.notebooks
- auxiliary notebooks, such as data exploration and example of the model usage.utils
- helpers and supplementary methods, such as subsampling original dataset and text preprocessing for subsequent transformers models training..gitignore
- lists files and folders ignored by git.main.py
- default root of the project, not used at the moment.README.md
- the doc you're reading :)requirements.txt
- project dependencies. Executepip install -r requirements.txt
in console to install additional packages needed to run the project.
Transformer-based models expect fixed-length sequences of token IDs as inputs. Thus, the first step is to transform the textual data into sequences of tokens IDs.
TextPreprocessor
specifies a text preprocessing and transformation routine as a
single class. When instantiating the class, one should consider which transformers
model would be used. The reason for this is that two principal parameters (tokenizer
and vocab_file
) of the TextPreprocessor
class have to be consistent with the chosen
model.
tokenizer
- must be an object of PreTrainedTokenizer
. However, if the parameter
value is not provided, BertTokenizer
is used by default.
vocab_file
- is the vocabulary used by the tokenizer and must be a string. Refer to
HuggingFace's documentation
for a full list of vocabularies (Shortcut name column) and associated model architectures.
If none is provided, 'bert-base-cased'
is used by default.
The main working part of the TextPreprocessor
class is the preprocess
method. It can take
the following two parameters:
texts
- a list of strings to be processed and transformed.
fit
- boolean, tells whether to find the max length of the sequences for padding/truncation. It
should normally be True
only when feeding a training corpus.
The method sequentially performs the following preprocessing steps:
- Tokenization using the
PreTrainedTokenizer
instance provided. The tokenizer internally performs four actions:- Tokenizes the input strings.
- Prepends [CLS] token.
- Appends [SEP] token.
- Maps tokens to their IDs.
- Padding or truncating the sequences of token IDs to the same length.
- Attention masks generation.
1
denotes tokens extracted from the text and0
specifies padding tokens. - Conversion of input matrices with token IDs and attention masks to PyTorch tensors.
Usage example:
train_texts = ['One morning, when Gregor Samsa woke from troubled dreams, he',
'found himself transformed in his bed into a horrible vermin. He',
'lay on his armour-like back, and if he lifted his head a little',
'he could see his brown belly, slightly domed and divided by',
'arches into stiff sections.']
test_texts = ['The bedding was hardly able to cover',
'it and seemed ready to slide off any moment. His many legs,',
'pitifully thin compared with the size of the rest of him, waved',
'about helplessly as he looked.']
from utils.text_preprocessing import TextPreprocessor
prep = TextPreprocessor()
train_seqs, train_masks = prep.preprocess(train_texts, fit=True)
test_seqs, test_masks = prep.preprocess(test_texts)
We make use of HuggingFace's transformers
library (PyTorch implementation), which
provides general-purpose architectures for NLP tasks with a constantly growing number
of pre-trained models.
The central point of the project is TransformersGeneric
class. This class plays the role of
an abstraction and encapsulates models definition, training and evaluation. The
main purpose of this class is to hide the complexity, associated with the training process,
and provide only a high-level API for classification.
The constructor of the class can take a few parameters, however, the first three are the crucial ones:
num_classes
- number of classes, 2 in this example.
transformers_model
- instance of PreTrainedModel
class. BertForSequenceClassification
is
used if none is provided.
model_name
- name of the model which pre-trained parameters will be used. Full list can be found
here. By default, 'bert-base-cased'
is used if the parameter is not indicated.
For an example of its usage, refer to notebooks/yelp_reviews_sentiment_analysis.ipynb
.
We have experimented with ALBERT Base, DistilRoBERTa Base, RoBERTa Base, BERT Base Cased and DistilBERT Base Cased models for comparison. The following image shows training and validation losses, as well as accuracy and F1 score measured on the validation set.
Interestingly, judging by the training and validation losses, distilled RoBERTa and BERT models learn the data more closely than their non-distilled counterparts. However, this does not result in their better performance on the validation set. On the other hand, both RoBERTa-based models outperform BERT and ALBERT. Thus, even though the RoBERTa pretraining approach differs seemingly marginally, it still yields a more robust and better-performing model both in terms of validation accuracy and F1 score in these particular settings.
To check how well the models generalize to the unseen data, we evaluate them on a hold-out test set. The following image presents the final accuracy (X-axis) and F1 score (Y-axis).
The leaders do not change here and RoBERTa-based models score better against BERT and ALBERT with regard to both metrics. Compared to the validation accuracy, the testing one stays roughly the same slightly above 0.92. At the same time, the F1 score dropped by approx. 0.03 (RoBERTa) and 0.02 (DistilRoBERTa), resulting in 0.896 and 0.899 correspondingly. A similar decrease in F1 is observed for the remaining three models.
Interactive dashboard with these results and runs details is available on the Weights & Biases project page.