Chinese Word Segmentation

The goal of this project is to train a model based on Bidirectional LSTM to separate chinese words in a sentence.

The dataset used for the training was the concatenation of four different datasets: AS (Traditional Chinese), CITYU (Traditional Chinese), MSR (Simplified Chinese) and PKU (Simplified Chinese).

The training was done using a Google Compute Engine instance running a Tesla K80 GPU.

Instructions

Generate dictionary

python preprocess.py [resources_path] [sentence_size]

Train

python train.py [resources_path] [sentence_size]

Predict

python train.py [input_path] [output_path] [resources_path]

Score

python train.py [prediction_file] [gold_file]

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
code		code
report		report
resources		resources
sample_files		sample_files
.gitignore		.gitignore
Project Description.pdf		Project Description.pdf
README.md		README.md
State-of-the-art Chinese Word Segmentation with Bi-LSTMs.pdf		State-of-the-art Chinese Word Segmentation with Bi-LSTMs.pdf
environment.yml		environment.yml
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chinese Word Segmentation

Instructions

About

Releases

Packages

Languages

ibiscp/Chinese-Word-Segmentation

Folders and files

Latest commit

History

Repository files navigation

Chinese Word Segmentation

Instructions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages