CPU: Intel 8- Core i7-11800
GPU: RTX 3060
cudnn-11.2-windows-x64-v8.1.0.77
cuda_11.2.1_461.09_win10
tensorflow-gpu==2.5.0
keras
transformers
matplotlib
pandas
sklearn
sentencepiece
Sentiment levels of movie comments can be scored by sentences or phrases. The purpose of this project is to explore the inference power of four transformer-based models which have been pre-trained on self-supervised learning tasks.
Specifically, in this work, the split phrases are classified into five levels of emotions by transfer learning with BERT, RoBERTa, DistilBERT and XLNet.
Click Kaggle Competition Link for more details.
File "./Datasets/train.csv" :
Even though the train/test split has been preserved for the purposes of benchmarking, the labels of the test set are not available. Therefore, the original training dataset with 156,060 phrases was divided into a new training set, validation set, and testing set in terms of proportion 8:1:1. The restructured training set has 140,454 phrases.
The sentiment labels and numbers are within restructured training set:
0 - negative-6371
1 - somewhat negative-24545
2 - neutral-71624
3 - somewhat positive-29626
4 - positive-8288
This folder divides the entire project process into data preprocessing, model, train, and test procedures, waiting for calling by "main.py".
All functions within this folder with detailed comments
-
Split into the train(validation within it) and test
-
Extract useful features and labels
-
Set your model output as categorical and save it in the new label column
-
Split the training set into training and validation dataset
-
tokenizer function (in "train.py")
Return the training, validation and testing datasets.
Define model structures of BERT, RoBERTa, DistilBERT and XLNet.
Define the training pipelines with an optimizer, loss, and other hyper-parameters adjustable.
Return the model after training and logs history to analyze training convergence.
This part evaluates the model performance on the testing dataset containing:
-
Classification Reports.
-
Confusion Matrix figures.
-
Multi-class ROC figures.
-
Globally micro ROC figures.
Folder "images" save the training and testing figures, as well as model structures.
Folder "model_logs" saves the training logs history.
Folder "model_trained" saves the model after training.
Folder "submission" saves the .csv file using the same format as the competition.
Notes. It will run all steps of this project by calling the functions in the folder "function".
Notes. This contains more detailed work than in "main.py", including the EDA process at the beginning, open in Jupyter for running step by step.