wordcept

A python toolkit for machine learning on Chinese words.

Word Segmentation Tool: `dartfrog.py`

To train the word segmentation tool on a corpus of segmented text, run:

dartfrog.py --fit TRAIN-DATA-FILE

To process raw text and produce segmented text, run:

dartfrog.py --transform INPUT-FILE OUTPUT-FILE

Dataset: SIGHAN Bakeoff 2005	F1	Recall	OOV Recall
AS	0.928	0.935	0.390
CityU	0.911	0.927	0.388
MSRA	0.946	0.963	0.205
PKU	0.924	0.932	0.499

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
dartfrog.py		dartfrog.py
requirements.txt		requirements.txt