wordseg
is a Python package of word segmentation models.
Table of contents:
wordseg
is available through pip:
pip install wordseg
To install wordseg
from the GitHub source:
git clone https://github.com/jacksonllee/wordseg.git
cd wordseg
pip install -e ".[dev]"
wordseg
implements a word segmentation model as a Python class.
An instantiated model class object has the following methods
(emulating the scikit-learn-styled API for machine learning):
fit
: Train the model with segmented sentences.predict
: Predict the segmented sentences from unsegmented sentences.
The implemented model classes are as follows:
RandomSegmenter
: Segmentation is predicted at random at each potential word boundary independently for some given probability. No training is required.LongestStringMatching
: This model constructs predicted words by moving from left to right along an unsegmented sentence and finding the longest matching words, constrained by a maximum word length parameter.
Sample code snippet:
from src.wordseg import LongestStringMatching
# Initialize a model.
model = LongestStringMatching(max_word_length=4)
# Train the model.
# `fit` takes an iterable of segmented sentences (a tuple or list of strings).
model.fit(
[
("this", "is", "a", "sentence"),
("that", "is", "not", "a", "sentence"),
]
)
# Make some predictions; `predict` gives a generator, which is materialized by list() here.
list(model.predict(["thatisadog", "thisisnotacat"]))
# [['that', 'is', 'a', 'd', 'o', 'g'], ['this', 'is', 'not', 'a', 'c', 'a', 't']]
# We can't get 'dog' and 'cat' because they aren't in the training data.
MIT License. Please see LICENSE.txt
.
Please see CHANGELOG.md
.
Please see CONTRIBUTING.md
.
Lee, Jackson L. 2023. wordseg: Word segmentation models in Python. https://doi.org/10.5281/zenodo.4077433
@software{leengrams,
author = {Jackson L. Lee},
title = {wordseg: Word segmentation models in Python},
year = 2023,
doi = {10.5281/zenodo.4077433},
url = {https://doi.org/10.5281/zenodo.4077433}
}