All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Part-of-speech tagging:
- Added the function
pos_tag
that takes a segmented sentence or phrase and returns its part-of-speech tags. - Added the function
hkcancor_to_ud
that maps a part-of-speech tag from the original HKCanCor annotated data to one of the tags from the Universal Dependencies v2 tagset.
- Added the function
- Word segmentation:
- Improved segmentation quality by revising the underlying wordlist data.
- Fixed the issue of not opening text files with UTF-8 encoding (a possible issue on Windows).
- Word segmentation:
- Segmentation is customizable for the following:
- Maximum word length
- A user-supplied list of words to allow as words
- A user-supplied list of words to disallow as words
- The default segmentation model has been improved with the rime-cantonese data (CC BY 4.0 license).
- Segmentation is customizable for the following:
- Characters-to-Jyutping conversion:
- The conversion returns results in a word-segmented form.
- The conversion model has been improved with the rime-cantonese data (CC BY 4.0 license).
- Added the following functions; they are equivalent to their (now deprecated)
x2y
counterparts:characters_to_jyutping
jyutping_to_tipa
jyutping_to_yale
- Added support for Python 3.9.
jyutping_to_yale
: The default value of the keyword argumentas_list
has been changed fromFalse
toTrue
, so that this function is now more in line with the other "jyutping_to_X" functions for returning a list.characters_to_jyutping
: The returned valued is now a list of segmented words, where each is a 2-tuple of (Cantonese characters, Jyutping). Previously, it was a list of Jyutping strings for the individual Cantonese characters.
- Switched documentation to the readthedocs theme and numpydoc docstring style.
- Improved CircleCI builds with orbs.
- The following
x2y
functions have been deprecated in favor of their equivalents named in the form ofx_to_y
.characters2jyutping
jyutping2tipa
jyutping2yale
- Turned on HTTPS for the pycantonese.org domain.
- Switched to the
wordseg
dependency to a PyPI source instead of a GitHub direct link.
- Added the
characters2jyutping()
function for converting Cantonese characters to Jyutping romanization. - Added the
segment()
function for word segmentation.
- Added support for Python 3.7 and 3.8.
- Dropped support for Python 3.4 and 3.5 (supporting 3.6, 3.7, and 3.8 now).
- 104 stop words.
- Exposed the
exclude
parameter in various reader methods for excluding specific participants. This parameter was implemented at pylangacq v0.10.0.
- Allowed "n" to be a syllabic nasal.
- Fixed corpus reader not picking up the characters.
- PyCantonese now requires Python 3.4 or above.
- Adopted the CHAT corpus format, piggybacking on PyLangAcq
- Converted HKCanCor into the CHAT format
- Switched to transparent function names
(cf. issue #10):
parse_jyutping()
,jyutping2yale()
,jyutping2tipa()
- Bug fixes: issues #6, #7, #8 #9
- Fixed the Jyutping-Yale conversion issue with "yu"
- Added
number_of_words()
andnumber_of_characters()
for corpus access - Forced all part-of-speech tags (both in searches and internal to corpus objects) in caps, in line with the NLTK convention
- Overall code restructuring
- Only Python 3.x is supported from this point onwards
- Used generators instead of lists for corpus access methods
- Added the part-of-speech search criterion
- Added Jyutping-to-Yale conversion
- Added Jyutping-to-TIPA conversion
- Disabled the function for reading a custom corpus dataset (it will come back)
- Fixed corpus access path issues
- The Hong Kong Cantonese Corpus is included in the package.
- A general-purpose
search()
function is defined, replacing the element-specific search functions from version 0.1.
- Basic functions available, including...
- Parsing Jyutping romanization
- Reading a tagged corpus data folder
- Searching by a given element (onset/initial, nucleus, coda, final, character)
- Searching by a character plus a range