Chinese Words Segmentation

Dataset

total 36497 chinese news, which is collected from the Internet

Purpose

Different from English, there are no space between Chinese words. This project aims to implement Chinese word segmentation without dictionary.

develop Chinese word segmentation algorithm based on entropy
find new Chinese words (which is not in corpus)

Method

Chinese Words Segmentation

Internal Aggregation

see WordInfo.calculateAggregation()

External Collocation Richness

see WordInfo.calculateEntropy()

Word Frequency

If the word frequency is too high, it might be stop words.

If the word frequency is too low, it may not be as word.

In this project, I set FREQ_MIN = 5, FREQ_MAX = 10. The word between FREQ_MIN and FREQ_MAX can be new words candidates

Usage

Change INPUT_FILE in wordSegment.py
python 2

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LICENSE		LICENSE
MathTool.py		MathTool.py
README.md		README.md
stopwords.txt		stopwords.txt
wordSegment.py		wordSegment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chinese Words Segmentation

Dataset

Purpose

Method

Chinese Words Segmentation

Internal Aggregation

External Collocation Richness

Word Frequency

If the word frequency is too high, it might be stop words.

If the word frequency is too low, it may not be as word.

Usage

About

Releases

Packages

Languages

License

AiningWang/Chinese-Words-Segmentation

Folders and files

Latest commit

History

Repository files navigation

Chinese Words Segmentation

Dataset

Purpose

Method

Chinese Words Segmentation

Internal Aggregation

External Collocation Richness

Word Frequency

If the word frequency is too high, it might be stop words.

If the word frequency is too low, it may not be as word.

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages