elmo-japanese

Tensorflow implementation of bidirectional language models (biLM) used to compute ELMo representations from "Deep contextualized word representations".

This codebase is based on bilm-tf and deals with Japanese.

This repository supports both training biLMs and using pre-trained models for prediction.

Installation

CPU

conda create -n elmo-jp python=3.6 anaconda
source activate elmo-jp
pip install tensorflow==1.10 h5py
git clone https://github.com/cl-tohoku/elmo-japanese.git

GPU

conda create -n elmo-jp python=3.6 anaconda
source activate elmo-jp
pip install tensorflow-gpu==1.10 h5py
git clone https://github.com/cl-tohoku/elmo-japanese.git

Getting started

Training ELMo

python src/run_train.py \
    --option_file data/config.json \
    --save_dir checkpoint \
    --word_file data/vocab.sample.jp.wakati.txt \
    --char_file data/vocab.sample.jp.space.txt \
    --train_prefix data/sample.jp.wakati.txt

Computing representations from the trained biLM

The following command outputs the ELMo representations (elmo.hdf5) for the text (sample.jp.wakati.txt) in the checkpoint directory (save_dir).

python src/run_elmo.py \
    --option_file checkpoint/options.json \
    --weight_file checkpoint/weight.hdf5 \
    --word_file data/vocab.sample.jp.wakati.txt \
    --char_file data/vocab.sample.jp.space.txt \
    --data_file data/sample.jp.wakati.txt \
    --output_file elmo.hdf5

The following command prints out the information of the elmo.hdf5, such as the number of sentences, words and dimensions.

python scripts/view_hdf5.py elmo.hdf5

Computing sentence representations

Save sentence-level ELMo representations

python src/run_elmo.py \
    --option_file checkpoint/options.json \
    --weight_file checkpoint/weight.hdf5 \
    --word_file data/vocab.sample.jp.wakati.txt \
    --char_file data/vocab.sample.jp.space.txt \
    --data_file data/sample.jp.wakati.txt \
    --output_file elmo.hdf5 \
    --sent_vec

View sentence similarities

python scripts/view_sent_sim.py \
    --data data/sample.jp.wakati.txt \
    --elmo elmo.hdf5

Training ELMo on a new corpus

Making a token vocab file

python scripts/make_vocab_file.py \
    --input_fn data/sample.jp.wakati.txt \
    --output_fn data/vocab.sample.jp.wakati.txt

Making a character vocab file

python scripts/space_split.py \
    --input_fn data/sample.jp.wakati.txt \
    --output_fn data/sample.jp.space.txt

python scripts/make_vocab.py \
    --input_fn data/sample.jp.space.txt \
    --output_fn data/vocab.sample.jp.space.txt

Training ELMo

python src/run_train.py \\
    --train_prefix data/sample.jp.wakati.txt \
    --word_file data/vocab.sample.jp.wakati.txt \
    --char_file data/vocab.sample.jp.space.txt \
    --config_file data/config.json
    --save_dir checkpoint

Retraining the trained ELMo

python src/run_train.py \
    --train_prefix data/sample.jp.wakati.txt \
    --word_file data/vocab.sample.jp.wakati.txt \
    --char_file data/vocab.sample.jp.space.txt \
    --save_dir checkpoint \
    --restart

Computing token representations from the ELMo

python src/run_elmo.py \
    --test_prefix data/sample.jp.wakati.txt \
    --word_file data/vocab.sample.jp.wakati.txt \
    --char_file data/vocab.sample.jp.space.txt \
    --save_dir checkpoint

Using the ELMo trained on Wikipedia

Download: checkpoint, vocab tokens, vocab characters
Computing sentence representations

python src/run_elmo.py \
    --option_file data/checkpoint_wiki-wakati-cleaned_token-10_epoch-10/options.json \
    --weight_file data/checkpoint_wiki-wakati-cleaned_token-10_epoch-10/weight.hdf5 \
    --word_file data/vocab.token.wiki_wakati.cleaned.min-10.txt \
    --char_file data/vocab.char.wiki_wakati.cleaned.min-0.txt \
    --data_file data/sample.jp.wakati.txt \
    --output_file elmo.hdf5 \
    --sent_vec

Retraining the pre-trained ELMo on your corpus

python src/run_train.py \
    --train_prefix PATH_TO_YOUR_CORPUS \
    --word_file data/vocab.token.wiki_wakati.cleaned.min-10.txt \
    --char_file data/vocab.char.wiki_wakati.cleaned.min-0.txt \
    --save_dir checkpoint_wiki-wakati-cleaned_token-10_epoch-10 \
    --restart

Checking performance in text classification

Making a dataset for text classification

cd data
./make_data.sh

Computing sentence representations

python src/run_elmo.py \
    --option_file data/checkpoint_wiki-wakati-cleaned_token-10_epoch-10/options.json \
    --weight_file data/checkpoint_wiki-wakati-cleaned_token-10_epoch-10/weight.hdf5 \
    --word_file data/vocab.token.wiki_wakati.cleaned.min-10.txt \
    --char_file data/vocab.char.wiki_wakati.cleaned.min-0.txt \
    --data_file data/dataset.wakati.txt \
    --output_file elmo.hdf5 \
    --sent_vec

Predicting nearest neighbors

python src/knn.py \
    --data data/dataset.wakati-label.txt \
    --elmo elmo.hdf5

LICENCE

MIT Licence

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
scripts		scripts
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

elmo-japanese

Installation

Getting started

Computing sentence representations

Training ELMo on a new corpus

Using the ELMo trained on Wikipedia

Checking performance in text classification

LICENCE

About

Releases

Packages

Languages

cl-tohoku/elmo-japanese

Folders and files

Latest commit

History

Repository files navigation

elmo-japanese

Installation

Getting started

Computing sentence representations

Training ELMo on a new corpus

Using the ELMo trained on Wikipedia

Checking performance in text classification

LICENCE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages