Skip to content

v0.1.97

Compare
Choose a tag to compare
@taku910 taku910 released this 06 Aug 16:03
· 167 commits to master since this release

Major changes

  • Migrated the C++ version from C++11 to C++17.
  • Migrated the CI environment from Travis-CI to Github actions
  • Started using cibuildtool to build pypi wheel packages

New features

  • [ALL] Support differential privacy while training. https://aclanthology.org/2022.findings-acl.171.pdf
  • [ALL] Introduced APIs that return the struct of ImmutableSentencePieceText, which encodes string-token, id, and utf-8 byte offsets at once. New API is available both from C++ and Python.
  • [ALL] Allow tab ‘\t’ to be included in user defined symbols.
  • [ALL] Added NFKD normalization rule. NFKD rule is provided as a TSV file.
  • [ALL] Added option to emit unknown symbol instead of raw symbol.
  • [Python]: Batch encode/decode requests are performed in native multi-threads.
  • [Python]: Supports to pass a custom log stream during training.
  • [Python]: Adds module-level version variable: spm.__version__
  • [Python]: Creates wheel package of Mac universal binary.

Bug fixes & minor changes

  • Uses the efficient encoding algorithm by default. Removed the functionality to switch the Viterbi tokenization algorithm.
  • Make the output of Encode and 1-best from NBestEncode same.
  • Use std::string_view as much as possible.
  • [Python] Removed pip package for ppc64le and s390x architecture as cibuiltool doesn’t support them.