Skip to content
This repository has been archived by the owner on Nov 11, 2020. It is now read-only.

A Chinese word segmentation system, mainly based on HMM and Maximum Matching, with a local website built with Flask as the UI.

License

Notifications You must be signed in to change notification settings

izackwu/ChineseWordSegmentationSystem

Repository files navigation

ChineseWordSegmentationSystem

A Chinese word segmentation system, mainly based on HMM.

The component of UI is implemented with Flask as a local website.

How to start?

  • Fork this repository or download all the files.
  • Run pip install -r requirements.txt to install all the dependency needed.
  • Run python "FlaskUI/FlaskUI.py" runserver to start the website locally.
  • Vist 127.0.0.1:5000 in the browser.

Quite easy, isn't it?

Some possible issues?

  • For Windows users, if you fail to Run pip install -r requirements.txt:

    • Make sure that you have installed pip properly and added its directory to the PATH environment variable.
    • If your system's default encoding is not UTF-8, for example, GBK, do as follows:
      1. Edit the file Python36\Lib\site-packages\pip\compat\__init__.py.
      2. Replace return s.decode('utf-8')(line 75 ) with return s.decode('gbk').
  • The website doesn't look the same as the pictures?

    • Make sure that your computer has access to the Internet, as the website needs to load CSS from CDN.
  • Others?

    • Open a new issue and it will be responded as soon as possible.

How does it work?

  • Train the HMM model with the data in TrainingSet and get three matrix:
    • InitStatus
    • TransProbMatrix
    • EmitProbMatrix
  • When segmenting words, load these matrix and cut the whole text into sentences to process.
  • Use Viterbi algorithm to find the most possible status of every character in a sentence.
  • According to the status, segment the sentence.
  • As for the Flask UI, it uses Flask-Bootstrap, Flask-WTF and Flask-Script to build a local website.
  • UPDATE: The latest version combines HMM with Maximum Matching Algothm.

How well does it work?

  • Well, I have to admit, the accuracy of segmentation is not satisfying enough.
  • F1 Score:
    • pku_test: 0.763 --> 0.829(the latest version)
    • msr_test: 0.793 --> 0.889(the latest version)
  • As comparision, F1 Score of Jieba(a famous Python Chinese word segmentation module):
    • pku_test: 0.818
    • msr_test: 0.815
  • For more information, see Result/

How does it look like?

index-0

index-1

index-2


sentence-0

sentence-1


help-0


settings-0


copyright-0

Reference

When doing this project, I refer to quite a few articles on the Internet and some books. Some of them are listed as follows:

With my sincere gratitude!

About

A Chinese word segmentation system, mainly based on HMM and Maximum Matching, with a local website built with Flask as the UI.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published