Skip to content

Chinese word segmentation algorithm based on entropy(基于熵,无需语料库的中文分词)

License

Notifications You must be signed in to change notification settings

AiningWang/Chinese-Words-Segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chinese Words Segmentation

Dataset

  • total 36497 chinese news, which is collected from the Internet

Purpose

Different from English, there are no space between Chinese words. This project aims to implement Chinese word segmentation without dictionary.

  • develop Chinese word segmentation algorithm based on entropy
  • find new Chinese words (which is not in corpus)

Method

Chinese Words Segmentation

Internal Aggregation

  • see WordInfo.calculateAggregation()

External Collocation Richness

  • see WordInfo.calculateEntropy()

Word Frequency

If the word frequency is too high, it might be stop words.
If the word frequency is too low, it may not be as word.
  • In this project, I set FREQ_MIN = 5, FREQ_MAX = 10. The word between FREQ_MIN and FREQ_MAX can be new words candidates

Usage

  • Change INPUT_FILE in wordSegment.py
  • python 2

About

Chinese word segmentation algorithm based on entropy(基于熵,无需语料库的中文分词)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages