wordseg

wordseg is a Python package of word segmentation models.

Table of contents:

Installation

wordseg is available through pip:

pip install wordseg

To install wordseg from the GitHub source:

git clone https://github.com/jacksonllee/wordseg.git
cd wordseg
pip install -e ".[dev]"

Usage

wordseg implements a word segmentation model as a Python class. An instantiated model class object has the following methods (emulating the scikit-learn-styled API for machine learning):

fit: Train the model with segmented sentences.
predict: Predict the segmented sentences from unsegmented sentences.

The implemented model classes are as follows:

RandomSegmenter: Segmentation is predicted at random at each potential word boundary independently for some given probability. No training is required.
LongestStringMatching: This model constructs predicted words by moving from left to right along an unsegmented sentence and finding the longest matching words, constrained by a maximum word length parameter.

Sample code snippet:

from src.wordseg import LongestStringMatching

# Initialize a model.
model = LongestStringMatching(max_word_length=4)

# Train the model.
# `fit` takes an iterable of segmented sentences (a tuple or list of strings).
model.fit(
  [
    ("this", "is", "a", "sentence"),
    ("that", "is", "not", "a", "sentence"),
  ]
)

# Make some predictions; `predict` gives a generator, which is materialized by list() here.
list(model.predict(["thatisadog", "thisisnotacat"]))
# [['that', 'is', 'a', 'd', 'o', 'g'], ['this', 'is', 'not', 'a', 'c', 'a', 't']]
# We can't get 'dog' and 'cat' because they aren't in the training data.

License

MIT License. Please see LICENSE.txt.

Changelog

Please see CHANGELOG.md.

Contributing

Please see CONTRIBUTING.md.

Citation

Lee, Jackson L. 2023. wordseg: Word segmentation models in Python. https://doi.org/10.5281/zenodo.4077433

@software{leengrams,
  author       = {Jackson L. Lee},
  title        = {wordseg: Word segmentation models in Python},
  year         = 2023,
  doi          = {10.5281/zenodo.4077433},
  url          = {https://doi.org/10.5281/zenodo.4077433}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.circleci		.circleci
src/wordseg		src/wordseg
tests		tests
.flake8		.flake8
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wordseg

Installation

Usage

License

Changelog

Contributing

Citation

About

Releases 5

Packages

Contributors 2

Languages

License

jacksonllee/wordseg

Folders and files

Latest commit

History

Repository files navigation

wordseg

Installation

Usage

License

Changelog

Contributing

Citation

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 2

Languages

Packages