LEAD

Pytorch Implementation for EMNLP2022 Findings (Long Paper)

"Learning from the Dictionary: Heterogeneous Knowledge Guided Fine-tuning for Chinese Spell Checking".

Requirements

python >= 3.9
torch == 1.11.0
transformers == 4.14.1
hanlp == 2.1.0b27
pypinyin == 0.46.0
einops == 0.4.1

Prepare Data and Pretrained Model

The raw data contains:
- SIGHAN Bake-off 2013: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html
- SIGHAN Bake-off 2014: http://ir.itc.ntnu.edu.tw/lre/clp14csc.html
- SIGHAN Bake-off 2015: http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
- Wang271K: https://github.com/wdimmy/Automatic-Corpus-Generation
You can also directly download the processed data from ReaLiSe.

Put the processed data files in the resources/data directory.
Download pretrained Chinese RoBERTa model, chinese-roberta-wwm-ext, from huggingface, and put the all files in the resources/pretrained_onlybert directory.
Download the glyph-enhanced pretrained model from GCC and put the model files in resources/glyph.

Run the Code

Run run.sh or directly execute the following command:

python main.py --config config/bert_phonics_dictionary_strokes.yaml

The config parameters can be replaced with other files placed in the config directory. And the configuration files can also be modified.

After training, run test.sh to test the model. The checkpoint parameter should be set correctly.

Trained checkpoints for SIGHAN datasets can be found here.

Citation

@inproceedings{li2022learning,
  title={Learning from the Dictionary: Heterogeneous Knowledge Guided Fine-tuning for Chinese Spell Checking},
  author={Li, Yinghui and Ma, Shirong and Zhou, Qingyu and Li, Zhongli and Yangning, Li and Huang, Shulin and Liu, Ruiyang and Li, Chao and Cao, Yunbo and Zheng, Haitao},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2022},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
config		config
metric		metric
model		model
pipeline		pipeline
processor		processor
reader		reader
resources		resources
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
run.sh		run.sh
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LEAD

Requirements

Prepare Data and Pretrained Model

Run the Code

Citation

About

Releases

Packages

Contributors 2

Languages

geekjuruo/LEAD

Folders and files

Latest commit

History

Repository files navigation

LEAD

Requirements

Prepare Data and Pretrained Model

Run the Code

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages