Pipline to train German language model and sentiment classifier

This is early commits based on the Poleval2018, it won't work well for the time being.

Our solution is an extension of the work done by FastAI team to train language models for English. We extended it with google sentence piece to tokenize German words.

Installation

The source code needs cleaning up to minimise the amount of work needed to run it.

But for now here are rough manual steps:

Install fastai from our fork (python PATH)
Install sentencepiece from source code (PATH and python PATH)

Requirements

jq - apt install jq

Training

You should have the following structure:

.
├── data
│   ├── germeval2017
│   │   ├── dev_v1.4.tsv
│   │   ....
│   │   ├── train_v1.4.tsv
│   │   └── train_v1.4.xml
│   ├── recorded-tweets.zip
│   └── btw17
│       ├── ...
├── make_dataset
└── README.md
└── work  # this will be created by scripts
    ├── nouniq
    │   ├── models
    │   └── tmp
    └── up_low50k
        ├── models
        └── tmp

Workflow

To create data set:

cd make_dataset
WORK_DIR="../work"
CACHE_DIR="${WORK_DIR}/shared"
DICT_SIZE=30
./prepare-data.sh --work-dir "${WORK_DIR}/btw-nouniq${DICT_SIZE}k" --cache-dir "${CACHE_DIR}" --vocab-size "${DICT_SIZE}000" --model-name "sp" --most-low "False" --lower-case "False" --uniq "False"

To start training lm model

dir=work/btw-nouniq30k
BS=192
nl=4
cuda=0
python fastai_scripts/pretrain_lm.py --dir-path "${dir}" --cuda-id $cuda --cl 12 --bs "${BS}" --lr 0.01 --pretrain-id "nl-${nl}-small-minilr" --sentence-piece-model sp.model --nl "${nl}"

To see the perplexity of the model on a test set.

python fastai_scripts/infer.py --dir-path "${dir}" --cuda-id $cuda --bs 22 --pretrain-id "nl-${nl}-small-minilr" --sentence-piece-model sp.model --test_set tmp/val_ids.npy --correct_for_up=False --nl  "${nl}"

To fine tune

BS=128
python ./fastai_scripts/finetune_lm.py --dir-path "${dir}" --pretrain-path "${dir}" --cuda-id $cuda \
    --cl 6 --pretrain-id "nl-${nl}-small-minilr" --lm-id "nl-${nl}-finetune" --bs $BS --lr 0.001 \
    --use_discriminative False --dropmult 0.5 --sentence-piece-model sp.model --sampled True --nl "${nl}"

BS=192
nl=4
cuda=0
python ./fastai_scripts/finetune_lm.py --dir-path "work/ge2017" --pretrain-path "work/btw-nouniq30k" --cuda-id $cuda \
    --cl 6 --pretrain-id "nl-${nl}-small-minilr" --lm-id "nl-${nl}-ge2017" --bs $BS --lr 0.001 \
    --use_discriminative False --dropmult 0.5 --sentence-piece-model sp.model --sampled True --nl "${nl}"

# discriminative
python ./fastai_scripts/train_clas.py --dir-path="work/ge2017" --cuda-id=$cuda \
    --lm-id="nl-${nl}-ge2017-all" --clas-id="class-nl-${nl}-ge2017"\
    --bs=$BS --cl=5 --lr=0.01 --dropmult 0.5 --sentence-piece-model='sp.model' --nl 4 --use_discriminative False

BS=128
nl=4
cuda=1
python ./fastai_scripts/train_clas.py --dir-path="work/ge2017" --cuda-id=$cuda \
    --lm-id="nl-${nl}-ge2017-all" --clas-id="class-nl-${nl}-ge2017"\
    --bs=$BS --cl=5 --lr=0.01 --dropmult 0.5 --sentence-piece-model='sp.model' --nl 4 --use_discriminative True

python ./fastai_scripts/train_clas.py --dir-path="work/ge2017" --cuda-id=2
--lm-id="nl-4-ge2017-all" --clas-id="class2-nl-4-ge2017"
--bs=40 --cl=5 --lr=0.001 --dropmult 0.5 --sentence-piece-model='sp.model'
--nl 4 --use_discriminative True

destdir=work/ge2017
BS=120
cuda=0
nl=4
python ./ulmfit/evaluate.py --dir-path="$destdir" --cuda-id=$cuda \
    --clas-id="class2-nl-${nl}-ge2017" --bs=$BS --nl "${nl}"

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
kernels		kernels
logs		logs
make_dataset		make_dataset
ulmfit		ulmfit
README.md		README.md
TRAINING.md		TRAINING.md
train_models.sh		train_models.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipline to train German language model and sentiment classifier

Installation

Training

Workflow

About

Releases

Packages

Languages

n-waves/ulmfit4de

Folders and files

Latest commit

History

Repository files navigation

Pipline to train German language model and sentiment classifier

Installation

Training

Workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages