bert-russian-sentiment-emotion

Introduction

The aim of the project is to fine-tune the state-of-the-art transformer models for classifying emotions and sentiment of short input sentences in the Russian language. Since there were multiple BERT models and datasets I decided to use a magnifiscent library for managing run configurations - Hydra. WandB was used for experiment tracking.

Metrics and fine-tuned models

The fine-tuned models for each model-dataset combination and corresponding metrics are located in my Hugging Face profile.

Example usage:

from transformers import pipeline
model = pipeline(model="seara/rubert-tiny2-ru-go-emotions")
model("Привет, ты мне нравишься!")
# [{'label': 'love', 'score': 0.5955629944801331}]

List of all trained models:

Models and datasets description

For the Russian language I found two models, heavy and slow ruBERT, and light and fast ruBERT-tiny2. The datasets I used for sentiment analysis (multi-class classification) were taken from Smetanin's review article. I chose the most good looking ones (which have at least 3 classes) and unioned them into one russian-sentiment dataset. For the classification of emotions (multi-label classification), I used CEDR dataset, the only one I found for the Russian language. Therefore I decided to translate English GoEmotions dataset using the deep-translator Python library with Google Translate engine. The translated dataset is called RuGoEmotions and is available on Hugging Face and Github.

Download links for all Russian sentiment datasets collected by Smetanin can be found in this repository.

Run configuration

To start - run

python main.py <command>=<arg>

If there are no options provided, the default configuration located at the conf/config.yaml will be executed.

Commands and args

General:
- task
  - "train" to train model
  - "eval" to evaluate trained model

- log_wandb
  - "True" to enable WandB
  - "False" to disable WandB

- model
  - "rubert-base-cased" for ruBERT
  - "rubert-tiny2" for ruBERT-tiny2

- dataset
  - "cedr"
  - "ru-go-emotions"
  - "russian-sentiment"

Tokenizer:
- trainer.max_length - Selecting tokenizer truncation max length

Dataloader:
- trainer.batch_size
- trainer.shuffle
- trainer.num_workers
- trainer.pin_memory
- trainer.drop_last

Optimizer
- trainer.lr
- trainer.weight_decay
- trainer.num_epochs

python main.py --help might be also useful.

It is not necessary to provide all the parameters, as the missing ones will be automatically applied according to the defaults. Default parameters for each model-dataset combination are located in the conf folder.

Examples

This will download fine-tuned rubert-tiny2 model on the default dataset (CEDR) and display popular metrics.

python main.py task="eval" model="rubert-tiny2"

You can explicitly specify the dataset:

python main.py task="eval" model="rubert-tiny2" dataset="ru-go-emotions"

Evaluation occurs on the test set, which was not used in model's training. The train/val/test split is 80%/10%/10%.

Project structure

bert-russian-sentiment-emotion
├── conf                    - Hydra config files folder
│   ├── dataset             - dataset configs
│   ├── loss                - loss funtion configs
│   ├── model               - model configs
│   ├── optimizer           - optimizer configs
│   └── trainer             - trainer configs for each model-dataset combination
├── data                    - raw data folder
├── main.py                 - main execution file
├── models                  - Location of fine-tuned models
├── notebooks               - Jupyter Notebooks folder
│   ├── datasets            - analysis and visualization
│   └── error-analysis      - model errors analysis
├── requirements.txt        - Python requirements
├── src                     - source code
│   ├── data                - data download and preprocess functions
│   ├── model               - model creation functions
│   ├── trainer             - training, metrics and validation functions
│   └── utils               - some extra functions
└── strings                 - yaml strings for translating classes to the Russian

Here is the list of the templates that inspired me:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bert_ru_sentiment_emotion		bert_ru_sentiment_emotion
conf		conf
data		data
runs		runs
strings		strings
test_notebooks		test_notebooks
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bert-russian-sentiment-emotion

Introduction

Metrics and fine-tuned models

Models and datasets description

Run configuration

Commands and args

Examples

Project structure

About

Languages

License

searayeah/bert-russian-sentiment-emotion

Folders and files

Latest commit

History

Repository files navigation

bert-russian-sentiment-emotion

Introduction

Metrics and fine-tuned models

Models and datasets description

Run configuration

Commands and args

Examples

Project structure

About

Topics

Resources

License

Stars

Watchers

Forks

Languages