ChineseWordSegmentationSystem

A Chinese word segmentation system, mainly based on HMM.

The component of UI is implemented with Flask as a local website.

How to start?

Fork this repository or download all the files.
Run pip install -r requirements.txt to install all the dependency needed.
Run python "FlaskUI/FlaskUI.py" runserver to start the website locally.
Vist 127.0.0.1:5000 in the browser.

Quite easy, isn't it?

Some possible issues?

For Windows users, if you fail to Run pip install -r requirements.txt:
- Make sure that you have installed pip properly and added its directory to the PATH environment variable.
- If your system's default encoding is not UTF-8, for example, GBK, do as follows:
  1. Edit the file Python36\Lib\site-packages\pip\compat\__init__.py.
  2. Replace return s.decode('utf-8')(line 75 ) with return s.decode('gbk').
The website doesn't look the same as the pictures?
- Make sure that your computer has access to the Internet, as the website needs to load CSS from CDN.
Others?
- Open a new issue and it will be responded as soon as possible.

How does it work?

Train the HMM model with the data in TrainingSet and get three matrix:
- InitStatus
- TransProbMatrix
- EmitProbMatrix
When segmenting words, load these matrix and cut the whole text into sentences to process.
Use Viterbi algorithm to find the most possible status of every character in a sentence.
According to the status, segment the sentence.
As for the Flask UI, it uses Flask-Bootstrap, Flask-WTF and Flask-Script to build a local website.
UPDATE: The latest version combines HMM with Maximum Matching Algothm.

How well does it work?

Well, I have to admit, the accuracy of segmentation is not satisfying enough.
F1 Score:
- pku_test: 0.763 --> 0.829(the latest version)
- msr_test: 0.793 --> 0.889(the latest version)
As comparision, F1 Score of Jieba(a famous Python Chinese word segmentation module):
- pku_test: 0.818
- msr_test: 0.815
For more information, see Result/

How does it look like?

Reference

When doing this project, I refer to quite a few articles on the Internet and some books. Some of them are listed as follows:

中文分词的python实现-基于HMM算法 - CSDN博客
中文分词之HMM模型详解 - CSDN博客
Flask Web Development: Developing Web Applications with Python (Miguel Grinberg)
Flask Documentation (0.12)
Bootstrap 教程 | 菜鸟教程

With my sincere gratitude!

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
FlaskUI		FlaskUI
Report		Report
Result		Result
TestingSet		TestingSet
TrainingSet		TrainingSet
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Segmentation.py		Segmentation.py
Training.py		Training.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChineseWordSegmentationSystem

How to start?

Some possible issues?

How does it work?

How well does it work?

How does it look like?

Reference

About

Releases 1

Packages

Contributors 2

Languages

License

izackwu/ChineseWordSegmentationSystem

Folders and files

Latest commit

History

Repository files navigation

ChineseWordSegmentationSystem

How to start?

Some possible issues?

How does it work?

How well does it work?

How does it look like?

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages