Skip to content

phkhanhtrinh23/japanese_spelling_correction

Repository files navigation

Japanese Grammatical Error Correction - JGEC

JGEC is described in the paper GECToR -Grammatical Error Correction: Tag, Not Rewrite, but it is implemented for Japanese. This project's code is based on the official implementation gector.

Model Architecture

The model consists of a bert-base-japanese and two linear classification heads, one for labels and one for detect.

labels predicts a specific edit transformation ($KEEP, $DELETE, $APPEND_x, etc), and detect predicts whether the token is CORRECT or INCORRECT. The results from the two are used to make a prediction. The predicted transformations are then applied to the errorful input sentence to obtain a corrected sentence.

Furthermore, in some cases, one pass of predicted transformations is not sufficient to transform the errorful sentence to the target sentence. Therefore, we repeat the process again on the result of the previous pass of transformations, until the model predicts that the sentence no longer contains incorrect tokens.

Inference using iterative sequence-tagging (https://www.grammarly.com/blog/engineering/gec-tag-not-rewrite/)

Datasets

Synthetically Generated Error Corpus

The JaWiki, Lang8, BSD, PheMT, jpn-eng, and jp_address are to synthetically generate errorful sentences, with a method similar to Awasthi et al. 2019, but with adjustments for Japanese. The details of the implementation can be found in the preprocessing code in this repository.

Training

Install the requirements:

pip install -r requirements.txt

The model was trained in Colab with GPUs on each corpus with the following hyperparameters (default is used if unspecified):

python ./utils/combine.py
python ./utils/preprocess.py
bash train.sh

Demo

from module import JGEC

obj = JGEC()
source_sents = ["そして10時くらいに、喫茶店でレーシャルとジョノサンとベルに会いました",
                "一緒にコーヒーを飲みながら、話しました。"]

res = obj(source_sents)

print("Results:", res)
# Results: ['そして10時くらいに、喫茶店でレーシャルとジョノサンとベルに会いました', 
#         '一緒にコーヒーを飲みながら、話しました。']

Inference

Trained weights can be downloaded here. The trained weights have been trained on all of the datasets mentioned above.

Extract model.zip to the ./utils/data/model directory. You should have the following folder structure:

JGEC/
  utils/
    data/
      model/
        checkpoint
        model_checkpoint.data-00000-of-00001
        model_checkpoint.index

After downloading and extracting the weights, the demo app can be run with the command

python main.py

You may need to pip install flask if Flask is not already installed.

Evaluation

The model can be evaluated with evaluate.py on a parallel sentences corpus. The evaluation corpus used was TMU Evaluation Corpus for Japanese Learners (TEC_JL), and the metric is GLEU score.

TEC-JL Results

Method GLEU
Chollampatt and Ng, 2018 0.739
JGEC 0.860

Credit

jonnyli1125