Notes on versioning:

The project follows semantic versioning 2.0.0. The API covers the following symbols:

C++
- onmt::BPELearner
- onmt::BPE
- onmt::SPMLearner
- onmt::SentencePiece
- onmt::SpaceTokenizer
- onmt::Tokenizer
- onmt::Vocab
- onmt::unicode::*
Python
- pyonmttok.BPELearner
- pyonmttok.SentencePieceLearner
- pyonmttok.SentencePieceTokenizer
- pyonmttok.Tokenizer
- pyonmttok.Vocab

[Unreleased]

New features

Fixes and improvements

v1.37.1 (2023-03-01)

Fixes and improvements

Consider escaped characters as single characters in BPE
Ignore undefined scripts when resolving inherited or common scripts

v1.37.0 (2023-02-28)

New features

Add tokenization option allow_isolated_marks to allow combining marks to appear isolated in the tokenization output in specific conditions

Fixes and improvements

Fix infinite loop when the text contains an invalid Unicode character
Fix segmentation fault when the BPELearner does not not find any pairs of characters in the tokenized data
[Python] Update ICU to 72.1

v1.36.0 (2023-01-11)

New features

[Python] Add argument vocabulary in the Tokenizer constructor to set the vocabulary with a list of tokens instead of using a file
[Python] Add function pyonmttok.is_valid_language to check if a language code is valid and can be passed to the Tokenizer constructor

v1.35.0 (2022-12-06)

New features

[Python] Add pickling support to pyonmttok.Vocab

Fixes and improvements

Update pybind11 to 2.10.1
Update cibuildwheel to 2.11.2

v1.34.0 (2022-09-13)

Changes

[Python] Wheels are now built under manylinux2014 and requires pip >= 19.3 for installation

New features

[Python] Build wheels for Python 3.11

Fixes and improvements

Improve error handling when reading token frequencies in the vocabulary file
[Python] Fix possible crash when pyonmttok is imported before torch
[Python] Update ICU to 71.1
[C++] Fix static compilation with -DBUILD_SHARED_LIBS=OFF
[C++] Fix CMake warning when compiling the tests

v1.33.0 (2022-08-29)

New features

[Python] Build ARM64 wheels for macOS

Fixes and improvements

[CLI] Fix error when the option --segment_alphabet is not set
Fix SentencePiece build warning when compiling with Clang

v1.32.0 (2022-07-25)

New features

Add property pyonmttok.Vocab.counters to retrieve the number of occurrences of each token

Fixes and improvements

Update pybind11 to 2.10.0
Update cxxopts to 3.0.0

v1.31.0 (2022-03-07)

New features

Add utilities to build and use vocabularies:
- pyonmttok.Vocab
- pyonmttok.build_vocab_from_tokens
- pyonmttok.build_vocab_from_lines
Define the method Tokenizer.__call__ to simplify the tokenizer usage when additional features are unused:

tokens = tokenizer(text)

Fixes and improvements

Update pybind11 to 2.9.1

v1.30.1 (2022-01-25)

Fixes and improvements

Fix deprecated languages codes in ICU that are incorrectly considered as invalid (e.g. "tl" for Tagalog)

v1.30.0 (2021-11-29)

New features

[Python] Build wheels for AArch64 Linux

Fixes and improvements

[Python] Update ICU to 70.1

v1.29.0 (2021-10-08)

Changes

[Python] Drop support for Python 3.5

New features

[Python] Build wheels for Python 3.10
[Python] Add tokenization method Tokenizer.tokenize_batch

v1.28.1 (2021-09-30)

Fixes and improvements

Fix detokenization when a token includes a fullwidth percent sign (％) that is not used as an escape sequence (version 1.27.0 contained a partial fix for this bug)

v1.28.0 (2021-09-17)

Changes

[C++] Remove the SpaceTokenizer class that is not meant to be public and can be confused with the "space" tokenization mode

New features

Build Python wheels for Windows
Add option tokens_delimiter to configure how tokens are delimited in tokenized files (default is a space)
Expose option with_separators in Python and CLI to include whitespace characters in the tokenized output
[Python] Add package version information in pyonmttok.__version__

Fixes and improvements

Fix detokenization when option with_separators is enabled

v1.27.0 (2021-08-30)

Changes

Linux Python wheels are now compiled with manylinux2010 and require pip >= 19.0 for installation
macOS Python wheels now require macOS >= 10.14

Fixes and improvements

Fix casing resolution when some letters do not have case information
Fix detokenization when a token includes a fullwidth percent sign (％) that is not used as an escape sequence
Improve error message when setting invalid segment_alphabet or lang options
Update SentencePiece to 0.1.96
[Python] Improve declaration of functions and classes for better type hints and checks
[Python] Update ICU to 69.1

v1.26.4 (2021-06-25)

Fixes and improvements

Fix regression introduced in last version for preserved tokens that are not segmented by BPE

v1.26.3 (2021-06-24)

Fixes and improvements

Fix another divergence with the SentencePiece output when there is only one subword and the spacer is detached

v1.26.2 (2021-06-08)

Fixes and improvements

Fix a divergence with the SentencePiece output when the spacer is detached from the word

v1.26.1 (2021-05-31)

Fixes and improvements

Fix application of the BPE vocabulary when using preserve_segmented_tokens and a subword appears without joiner in the vocabulary
Fix compilation with ICU versions older than 60

v1.26.0 (2021-04-19)

New features

Add lang tokenization option to apply language-specific case mappings

Fixes and improvements

Use ICU to convert strings to Unicode values instead of a custom implementation

v1.25.0 (2021-03-15)

New features

Add training flag in tokenization methods to disable subword regularization during inference
[Python] Implement __len__ method in the Token class

Fixes and improvements

Raise an error when enabling case_markup with incompatible tokenization modes "space" and "none"
[Python] Improve parallelization when Tokenizer.tokenize is called from multiple Python threads (the Python GIL is now released)
[Python] Cleanup some manual Python <-> C++ types conversion

v1.24.0 (2021-02-16)

New features

Add verbose flag in file tokenization APIs to log progress every 100,000 lines
[Python] Add options property to Tokenizer instances
[Python] Add class pyonmttok.SentencePieceTokenizer to help creating a tokenizer compatible with SentencePiece

Fixes and improvements

Fix deserialization into Token objects that was sometimes incorrect
Fix Windows compilation
Fix Google Test integration that was sometimes installed as part of make install
[Python] Update pybind11 to 2.6.2
[Python] Update ICU to 66.1
[Python] Compile ICU with optimization flags

v1.23.0 (2020-12-30)

Changes

Drop Python 2 support

New features

Publish Python wheels for macOS

Fixes and improvements

Improve performance in all tokenization modes (up to 2x faster)
Fix missing space escaping within protected sequences in "none" and "space" tokenization modes
Fix a regression introduced in 1.20 where segment_alphabet_* options behave differently on characters that appear in multiple Unicode scripts (e.g. some Japanese characters can belong to both Hiragana and Katakana scripts and should not trigger a segmentation)
Fix a regression introduced in 1.21 where a joiner is incorrectly placed when using preserve_segmented_tokens and the word is segmented by both a segment_* option and BPE
Fix incorrect tokenization when using support_prior_joiners and some joiners are within protected sequences

v1.22.2 (2020-11-12)

Fixes and improvements

Do not require "none" tokenization mode for SentencePiece vocabulary restriction

v1.22.1 (2020-10-30)

Fixes and improvements

Fix error when enabling vocabulary restriction with SentencePiece and spacer_annotate is not explicitly set
Fix backward compatibility with Kangxi and Kanbun scripts (see segment_alphabet option)

v1.22.0 (2020-10-29)

Changes

[C++] Subword model caching is no longer supported and should be handled by the client. The subword encoder instance can now be passed as a std::shared_ptr to make it outlive the Tokenizer instance.

New features

Add set_random_seed function to make subword regularization reproducible
[Python] Support serialization of Token instances
[C++] Add Options structure to configure tokenization options (Flags can still be used for backward compatibility)

Fixes and improvements

Fix BPE vocabulary restriction when using joiner_new, spacer_annotate, or spacer_new (the previous implementation always assumed joiner_annotate was used)
[Python] Fix spacer argument name in Token constructor
[C++] Fix ambiguous subword encoder ownership by using a std::shared_ptr

v1.21.0 (2020-10-22)

New features

Accept vocabularies with tab-separated frequencies (format produced by SentencePiece)

Fixes and improvements

Fix BPE vocabulary restriction when words have a leading or trailing joiner
Raise an error when using a multi-character joiner and support_prior_joiner
[Python] Implement __hash__ method of pyonmttok.Token objects to be consistent with the __eq__ implementation
[Python] Declare pyonmttok.Tokenizer arguments (except mode) as keyword-only
[Python] Improve compatibility with Python 3.9

v1.20.0 (2020-09-24)

Changes

The following changes affect users compiling the project from the source. They ensure users get the best performance and all features by default:
- ICU is now required to improve performance and Unicode support
- SentencePiece is now integrated as a Git submodule and linked statically to the project
- Boost is no longer required, the project now uses cxxopts which is integrated as a Git submodule
- The project is compiled in Release mode by default
- Tests are no longer compiled by default (use -DBUILD_TESTS=ON to compile the tests)

New features

Accept any Unicode script aliases in the segment_alphabet option
Update SentencePiece to 0.1.92
[Python] Improve the capabilities of the Token class:
- Implement the __repr__ method
- Allow setting all attributes in the constructor
- Add a copy constructor
[Python] Add a copy constructor for the Tokenizer class

Fixes and improvements

[Python] Accept None value for segment_alphabet argument

v1.19.0 (2020-09-02)

New features

Add BPE dropout (Provilkov et al. 2019)
[Python] Introduce the "Token API": a set of methods that manipulate Token objects instead of serialized strings
[Python] Add unicode_ranges argument to the detokenize_with_ranges method to return ranges over Unicode characters instead of bytes

Fixes and improvements

Include "Half-width kana" in Katakana script detection

v1.18.5 (2020-07-07)

Fixes and improvements

Fix possible crash when applying a case insensitive BPE model on Unicode characters

v1.18.4 (2020-05-22)

Fixes and improvements

Fix segmentation fault on cli/tokenize exit
Ignore empty tokens during detokenization
When writing to a file, avoid flushing the output stream on each line
Update cli/CMakeLists.txt to mark Boost.ProgramOptions as required

v1.18.3 (2020-03-09)

Fixes and improvements

Strip token annotations when calling SubwordLearner.ingest_token

v1.18.2 (2020-02-17)

Fixes and improvements

Speed and memory improvements for BPE learning

v1.18.1 (2020-01-16)

Fixes and improvements

[Python] Fix memory leak when deleting Tokenizer object

v1.18.0 (2020-01-06)

New features

Include is_placeholder function in the Python API
Add ingest_token method to learner objects to allow external tokenization

v1.17.2 (2019-12-06)

Fixes and improvements

Fix joiner annotation when SentencePiece returns isolated spacers
Apply preserve_segmented_tokens in "none" tokenization mode
Performance improvements when using case_feature or case_markup
Add missing --no_substitution flag on the command line client

v1.17.1 (2019-11-28)

Fixes and improvements

Fix missing case features for isolated joiners or spacers

v1.17.0 (2019-11-13)

New features

Flag soft_case_regions to minimize the number of uppercase regions when using case_markup

Fixes and improvements

Fix mismatch between subword learning and encoding when using case_feature
[C++] Fix missing default value for new argument of constructor SPMLearner

v1.16.1 (2019-10-21)

Fixes and improvements

Fix invalid SentencePiece training file when generated with SentencePieceLearner.ingest (newlines were missing)
Correctly ignore placeholders when using SentencePieceLearner without a tokenizer

v1.16.0 (2019-10-07)

New features

Support keeping the vocabulary generated by SentencePiece with the keep_vocab argument
[C++] Add intermediate method to annotate tokens before detokenization

Fixes and improvements

Improve file read/write errors detection
[Python] Lower the risk of ABI incompatibilities with other pybind11 extensions

v1.15.7 (2019-09-20)

Fixes and improvements

Do not apply case modifiers on placeholder tokens

v1.15.6 (2019-09-16)

Fixes and improvements

Fix placeholder tokenization when followed by a combining mark

v1.15.5 (2019-09-16)

Fixes and improvements

[Python] Downgrade pybind11 to fix segmentation fault when importing after non-compliant Python wheels

v1.15.4 (2019-09-14)

Fixes and improvements

[Python] Fix possible runtime error on program exit when using SentencePieceLearner

v1.15.3 (2019-09-13)

Fixes and improvements

Fix possible memory issues when run in multiple threads with ICU

v1.15.2 (2019-09-11)

Fixes and improvements

[Python] Improve error checking in file based functions

v1.15.1 (2019-09-05)

Fixes and improvements

Fix regression in space tokenization: characters inside placeholders were incorrectly normalized

v1.15.0 (2019-09-05)

New features

support_prior_joiners flag to support tokenizing a pre-tokenized input

Fixes and improvements

Fix case markup when joiners or spacers are individual tokens

v1.14.1 (2019-08-07)

Fixes and improvements

Improve error checking

v1.14.0 (2019-07-19)

New features

[C++] Method to detokenize from AnnotatedTokens

Fixes and improvements

[Python] Release the GIL in time consuming functions (e.g. file tokenization, subword learning, etc.)
Performance improvements

v1.13.0 (2019-06-12)

New features

[Python] File-based tokenization and detokenization APIs
Support tokenizing files with multiple threads

Fixes and improvements

Respect "NoSubstitution" flag for combining marks applied on spaces

v1.12.1 (2019-05-27)

Fixes and improvements

Fix Python package

v1.12.0 (2019-05-27)

New features

Python API for subword learning (BPE and SentencePiece)
C++ tokenization method to get the intermediate token representation

Fixes and improvements

Replace Boost.Python by pybind11 for the Python wrapper
Fix verbose flag for SentencePiece training
Check and raise possible errors during SentencePiece training

v1.11.0 (2019-02-05)

New features

Support copy operators on the Python client
Support returning token locations in detokenized text

Fixes and improvements

Hide SentencePiece dependency in public headers

v1.10.6 (2019-01-15)

Fixes and improvements

Update SentencePiece to 0.1.8 in the Python package
Allow naming positional arguments in the Python API

v1.10.5 (2019-01-03)

Fixes and improvements

More strict handle of combining marks - fixes #57 and #58

v1.10.4 (2018-12-18)

Fixes and improvements

Harden detokenization on invalid case markups combination

v1.10.3 (2018-11-05)

Fixes and improvements

Fix case markup for 1 letter words

v1.10.2 (2018-10-18)

Fixes and improvements

Fix compilations errors when SentencePiece is not installed
Fix DLLs builds using Visual Studio
Handle rare cases where SentencePiece returns 0 pieces

v1.10.1 (2018-10-08)

Fixes and improvements

Fix regression for SentencePiece: spacer annotation was not automatically enabled in tokenization mode "none"

v1.10.0 (2018-10-05)

New features

CaseMarkup flag to inject case information as new tokens

Fixes and improvements

Do not break compilation for users with old SentencePiece versions

v1.9.0 (2018-09-25)

New features

Vocabulary restriction for SentencePiece encoding

Fixes and improvements

Improve Tokenizer constructor for subword configuration

v1.8.4 (2018-09-24)

Fixes and improvements

Expose base methods in Tokenizer class
Small performance improvements for standard use cases

v1.8.3 (2018-09-18)

Fixes and improvements

Fix count of Arabic characters in the map of detected alphabets

v1.8.2 (2018-09-10)

Fixes and improvements

Minor fix to CMakeLists.txt for SentencePiece compilation

v1.8.1 (2018-09-07)

Fixes and improvements

Support training SentencePiece as a subtokenizer

v1.8.0 (2018-09-07)

New features

Add learning interface for SentencePiece

v1.7.0 (2018-09-04)

New features

Add integrated Subword Learning with first support of BPE.

Fixes and improvements

Preserve placeholders as independent tokens for all modes

v1.6.2 (2018-08-29)

New features

Support SentencePiece sampling API

Fixes and improvements

Additional +30% speedup for BPE tokenization
Fix BPE not respecting PreserveSegmentedTokens (#30)

v1.6.1 (2018-07-31)

Fixes and improvements

Fix Python package

v1.6.0 (2018-07-30)

New features

PreserveSegmentedTokens flag to not attach joiners or spacers to tokens segmented by any Segment* flags

Fixes and improvements

Do not rebuild bpe_vocab if already loaded (e.g. when CacheModel is set)

v1.5.3 (2018-07-13)

Fixes and improvements

Fix PreservePlaceholders with JoinerAnnotate that possibly modified other tokens

v1.5.2 (2018-07-12)

Fixes and improvements

Fix support of BPE models v0.2 trained with learn_bpe.py

v1.5.1 (2018-07-12)

Fixes and improvements

Do not escape spaces in placeholders value if NoSubstitution is enabled

v1.5.0 (2018-07-03)

New features

Support apply_bpe.py 0.3 mode

Fixes and improvements

Up to x3 faster tokenization and detokenization

v1.4.0 (2018-06-13)

New features

New character level tokenization mode Char
Flag SpacerNew to make spacers independent tokens

Fixes and improvements

Replace spacer tokens by substitutes when found in the input text
Do not enable spacers by default when SentencePiece is used as a subtokenizer

v1.3.0 (2018-04-07)

New features

New tokenization mode None that simply forwards the input text
Support SentencePiece, as a tokenizer or sub-tokenizer
Flag PreservePlaceholders to not mark placeholders with joiners or spacers

Fixes and improvements

Revisit Python compilation to support wheels building

v1.2.0 (2018-03-28)

New features

Add API to retrieve discovered alphabet during tokenization
Flag to convert joiners to spacers

Fixes and improvements

Add install target for the Python bindings library

v1.1.1 (2018-01-23)

Fixes and improvements

Make Alphabet.h public

v1.1.0 (2018-01-22)

New features

Python bindings
Tokenization flag to disable special characters substitution

Fixes and improvements

Fix incorrect behavior when --segment_alphabet is not set by the client
Fix alphabet identification
Fix segmentation fault when tokenizing empty string on spaces

v1.0.0 (2017-12-11)

Breaking changes

New Tokenizer constructor requiring bit flags

New features

Support BPE modes from learn_bpe.lua
Case insensitive BPE models
Space tokenization mode
Alphabet segmentation
Do not tokenize blocks encapsulated by ｟ and ｠
segment_numbers flag to split numbers into digits
segment_case flag to split words on case changes
segment_alphabet_change flag to split on alphabet change
cache_bpe_model flag to cache BPE models for future instances

Fixes and improvements

Fix SpaceTokenizer crash with leading or trailing spaces
Fix incorrect tokenization around tabulation character (#5)
Fix incorrect joiner between numeric and punctuation

v0.2.0 (2017-03-08)

New features

Add CMake install rule
Add API option to include separators
Add static library compilation support

Fixes and improvements

Rename library to libOpenNMTTokenizer
Make words features optional in tokenizer API
Make unicode headers private

v0.1.0 (2017-02-14)

Initial release.

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

[Unreleased]

New features

Fixes and improvements

v1.37.1 (2023-03-01)

Fixes and improvements

v1.37.0 (2023-02-28)

New features

Fixes and improvements

v1.36.0 (2023-01-11)

New features

v1.35.0 (2022-12-06)

New features

Fixes and improvements

v1.34.0 (2022-09-13)

Changes

New features

Fixes and improvements

v1.33.0 (2022-08-29)

New features

Fixes and improvements

v1.32.0 (2022-07-25)

New features

Fixes and improvements

v1.31.0 (2022-03-07)

New features

Fixes and improvements

v1.30.1 (2022-01-25)

Fixes and improvements

v1.30.0 (2021-11-29)

New features

Fixes and improvements

v1.29.0 (2021-10-08)

Changes

New features

v1.28.1 (2021-09-30)

Fixes and improvements

v1.28.0 (2021-09-17)

Changes

New features

Fixes and improvements

v1.27.0 (2021-08-30)

Changes

Fixes and improvements

v1.26.4 (2021-06-25)

Fixes and improvements

v1.26.3 (2021-06-24)

Fixes and improvements

v1.26.2 (2021-06-08)

Fixes and improvements

v1.26.1 (2021-05-31)

Fixes and improvements

v1.26.0 (2021-04-19)

New features

Fixes and improvements

v1.25.0 (2021-03-15)

New features

Fixes and improvements

v1.24.0 (2021-02-16)

New features

Fixes and improvements

v1.23.0 (2020-12-30)

Changes

New features

Fixes and improvements

v1.22.2 (2020-11-12)

Fixes and improvements

v1.22.1 (2020-10-30)

Fixes and improvements

v1.22.0 (2020-10-29)

Changes

New features

Fixes and improvements

v1.21.0 (2020-10-22)

New features