What is it

Unidic version of Juman++.

How to use it

There are no pre-trained models yet, you need to make you own.

Prerequisites

For running:

Unix-like environment
C++14-compatible compiler
CMake 3.1 or later

For training additionally:

Python 3 or later
MeCab
Offline version of BCCWJ

Compiling Programs

git clone git@github.com:eiennohito/jumanpp-unidic.git --recursive
cd jumanpp-unidic
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j

You will need jumanpp/src/core/tool/jumanpp_tool binaries for making a model and src/jumanpp-unidic-simple for the analysis.

Training a Model

We will update BCCWJ tags to the modern Unidic, create training data for Juman++ and train a model.

You will need about 25G of RAM to train a model using BCCWJ.

Paths for all binaries are written from the build directory.

First lets prepare training corpus data.

Download Unidic 2.3.0 from the official page
Add following lines to the Unidic dicrc:

node-format-unidic22q = "%M","%f[0]","%f[1]","%f[2]","%f[3]","%f[4]","%f[5]","%f[6]","%f[7]","%f[8]","%f[9]","%f[10]","%f[11]","%f[12]","%f[13]","%f[14]","%f[15]","%f[16]","%f[17]","%f[18]","%f[19]","%f[20]","%f[21]","%f[22]","%f[23]","%f[24]","%f[25]","%f[26]","%f[27]","%f[28]"\n
unk-format-unidic22q = "%M","%f[0]","%f[1]","%f[2]","%f[3]","%f[4]","%f[5]"\n
bos-format-unidic22q =
eos-format-unidic22q = EOS\n

Convert BCCWJ to MeCab constrained analysis mode.

python3 ../scripts/bccwj2mecab_input.py <path to BCCWJ>/CORE/SUW/core_SUW.txt > core.mecab.in

"Modernize" BCCWJ with newer Unidic using MeCab constrained analysis mode:

mecab -d <unidic-2.3.0> -Ounidic22q -p core.mecab.in > core.mecab.out

Convert modernized BCCWJ to the Juman++ training input:

python3 ../scripts/fixup_mecab.py core.mecab.out

After these steps there should be files core.mecab.out.full-tdata and core.mecab.out.part-tdata near the core.mecab.out. Now let's create a seed model (without parameters).

Create a raw analysis dictionary for Juman++ by concatenating scripts/header.csv and lex.csv from Unidic.
Compile an analysis dictionary:

jumanpp/src/core/tool/jumanpp_tool index \
    --spec ../src/unidic-2.3.0-simple.spec \
    --dict-file <concatenated dictionary> \
    --output unidic.seed

And finally, let's train a model:

../scripts/train.sh jumanpp/src/core/tool/jumanpp_tool unidic.seed \
    core.mecab.out.full-tdata core.mecab.out.part-tdata \
    unidic.model

You can do analysis now with Juman++!

src/jumanpp-unidic-simple unidic.model

Embedding a RNN

Coming soon...

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
jumanpp @ 72845fb		jumanpp @ 72845fb
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is it

How to use it

Prerequisites

Compiling Programs

Training a Model

Embedding a RNN

About

Releases

Packages

Languages

License

eiennohito/jumanpp-unidic

Folders and files

Latest commit

History

Repository files navigation

What is it

How to use it

Prerequisites

Compiling Programs

Training a Model

Embedding a RNN

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages