VKCOM · gleb-kov · Mar 20, 2023 · Apr 1, 2023 · Apr 1, 2023 · Apr 1, 2023
diff --git a/.gitignore b/.gitignore
@@ -57,7 +57,8 @@ coverage.xml
 *.txt
 *.yttm
 artifacts/
-stress
+bpe_stress
+wordpiece_stress
 
 # Translations
 *.mo

diff --git a/LICENSE b/LICENSE
@@ -1,4 +1,4 @@
-Copyright (c) 2019 VK.com
+Copyright (c) 2019-2023 VK.com
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,13 +1,13 @@
 include youtokentome/cpp/utils.h
 include youtokentome/cpp/bpe.h
 include youtokentome/cpp/utf8.h
+include youtokentome/cpp/wordpiece.h
 include youtokentome/cpp/yttm.pyx
-include youtokentome/cpp/third_party/flat_hash_map.h
-include youtokentome/cpp/third_party/LICENSE
+include youtokentome/cpp/third_party/flat_hash_map/flat_hash_map.h
+include youtokentome/cpp/third_party/flat_hash_map/LICENSE
+include youtokentome/cpp/third_party/thread_pool/thread_pool.h
+include youtokentome/cpp/third_party/thread_pool/LICENSE
 include LICENSE
 include README.md
 include requirements.txt
 include yttm_cli.py
-
-
-
diff --git a/README.md b/README.md
@@ -6,20 +6,21 @@
 
 # YouTokenToMe 
 
-YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)].
-Our implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE)
- and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 60 times faster.
-  Check out our [benchmark](benchmark.md) results.
+YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently contains the fastest implementations of:
+- Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)], [benchmark results](benchmark_bpe.md);
+- WordPiece [[Song et al.](https://arxiv.org/abs/2012.15524)], [benchmark results](benchmark_wordpiece.md).
 
 Key advantages:
 
 * Multithreading for training and tokenization
-* The algorithm has  `O(N)` complexity, where `N` is the length of training data
 * Highly efficient implementation in C++
 * Python wrapper and command-line interface
 
-Extra features:
-* BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267))
+## BPE implementation
+
+Algorighm properties:
+* Time complexity is `O(N)`, where `N` is the length of training data
+* Supports BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267))
 
 As well as in the algorithm from the original paper, ours does not consider tokens 
 that cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored.
@@ -28,15 +29,21 @@ For example, the phrase ```Blazingly fast tokenization!``` can be tokenized into
 
 `['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']`
 
+## WordPiece implementation
+
+Algorighm properties:
+* Currently supports tokenizer only, but not training
+* Time complexity is `O(Nm^2)`, where `N` is the length of tokenized data and `m` is the max length of word in vocabulary
+
 ## Installation
 
 ```bash
 pip install youtokentome
 ```
-## Python interface 
+
+## Python BPE interface
 
 ### Example
-Let's start with a self-contained example. 
 
 ```python
 import random
@@ -67,11 +74,28 @@ bpe = yttm.BPE(model=model_path)
 print(bpe.encode([test_text], output_type=yttm.OutputType.ID))
 print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD))
 ```
+
+### Methods
+Class `youtokentome.BPE` has the following methods:
+
+#### constructor
+
+```python
+youtokentome.BPE(model, n_threads=-1)
+```
 
+Class constructor. Loads the trained model.
+
+* `model`: string, path to the trained model
+* `n_threads`: int, number of parallel threads used to run. 
+    If equal to -1, then the maximum number of threads available will be used.
+
 &nbsp;
-### Training model
+
+#### train
+
 ```python
-youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)
+train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)
 ```
 Trains BPE model and saves to file.
 
@@ -92,22 +116,6 @@ Trains BPE model and saves to file.
 
 &nbsp;
 
-### Model loading
-
-```python
-youtokentome.BPE(model, n_threads=-1)
-```
-
-Class constructor. Loads the trained model.
-
-* `model`: string, path to the trained model
-* `n_threads`: int, number of parallel threads used to run. 
-    If equal to -1, then the maximum number of threads available will be used.
-
-&nbsp;
-
-### Methods
-Class `youtokentome.BPE` has the following methods:
 #### encode 
 ```python
 encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0)
@@ -185,16 +193,23 @@ Convert each id to subword and concatenate with space symbol.
 
 
 **Returns:** List of strings.  
-
-## Command line interface
 
-### Example 
+## Python WordPiece interface
 
-```bash
-$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
-$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA 
-```
+### Example
 
+TODO
+
+### Methods
+Class `youtokentome.WordPiece` has the following methods:
+
+#### constructor
+
+#### encode
+
+#### decode
+
+## Command line interface
 
 ### Supported commands
 
@@ -209,16 +224,16 @@ Options:
   --help  Show this message and exit.
 
 Commands:
-  bpe     Train BPE model.
-  decode  Decode ids to text.
-  encode  Encode text to ids or subwords.
-  vocab   Print list of learned subwords.
+  bpe-train     Train BPE model.
+  bpe-decode  Decode ids to text.
+  bpe-encode  Encode text to ids or subwords.
+  bpe-vocab   Print list of learned subwords.
 ```
 
 Command `bpe` allows you to train Byte Pair Encoding model based on a text file.
 
 ```
-$ yttm bpe --help
+$ yttm bpe-train --help
 
 Usage: yttm bpe [OPTIONS]
 
@@ -237,18 +252,31 @@ Options:
   --help                Show this message and exit.
 ```
 
+Convert ids back to text. Use `stdin` for input and `stdout` for output.
+
+```
+$ yttm bpe-decode --help
+
+Usage: yttm decode [OPTIONS]
+
+  Decode ids to text.
+
+Options:
+  --model PATH  Path to file with learned model.  [required]
+  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
+  --help        Show this message and exit.
+```
 
 Apply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output.
 
 By default, encoding works in parallel using `n_threads` threads. Number of threads is limited by
-8 (see [benchmark](benchmark.md#number-of-threads)).
+8 (see [benchmark](benchmark_bpe.md#number-of-threads)).
 
 With the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one.
  Each sentence will be tokenized and written to the `stdout` before the next sentence is read.
 
-
 ```
-$ yttm encode --help
+$ yttm bpe-encode --help
 
 Usage: yttm encode [OPTIONS]
 
@@ -269,7 +297,7 @@ Options:
 Print vocabulary. This can be useful for understanding the model.
 
 ```
-$ yttm vocab --help
+$ yttm bpe-vocab --help
 
 Usage: yttm vocab [OPTIONS]
 
@@ -281,24 +309,11 @@ Options:
   --help        Show this message and exit.
 ```
 
-Convert ids back to text. Use `stdin` for input and `stdout` for output.
-
-```
-$ yttm decode --help
-
-Usage: yttm decode [OPTIONS]
+### Examples
 
-  Decode ids to text.
+TODO: wordpiece
 
-Options:
-  --model PATH  Path to file with learned model.  [required]
-  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
-  --help        Show this message and exit.
+```bash
+$ yttm bpe-train --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
+$ yttm bpe-encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA
 ```
-
-
-
-
-
-
-
diff --git a/benchmark.md → benchmark_bpe.md b/benchmark.md → benchmark_bpe.md
@@ -1,7 +1,11 @@
-## Speed tests
+## BPE Speed tests
 
-`YouTokenToMe` will be compared with [Hugging Face](https://github.com/huggingface/tokenizers), [SentencePiece](https://github.com/google/sentencepiece/)
- and [fastBPE](https://github.com/glample/fastBPE). These three algorithms are considered to be fast.
+`YouTokenToMe` will be compared with:
+* [Hugging Face](https://github.com/huggingface/tokenizers)
+* [SentencePiece](https://github.com/google/sentencepiece/)
+* [fastBPE](https://github.com/glample/fastBPE)
+
+These algorithms are considered to be fast.
 
 Data from [Wikipedia](https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/) was used to evaluate algorithm speed. In a similar way to `enwik8` and `enwik9`, the experiments were run on first `10^8` and `10^9` bytes of datasets for English, Russian, Chinese and Japanese.
 
@@ -11,7 +15,7 @@ In this benchmark, `YouTokenToMe` used 4 threads for training and tokenization.
  doesn't support multithreading for **BPE** at all. `fastBPE` doesn't support multithreading for training. 
  For tokenization, it also used 4 threads. 
 
-Source code for benchmark can be found [here](tests/speed_test/speed_test.py).
+Source code for benchmark can be found [here](tests/speed_test/bpe.py).
 The results of the experiments are below. The time is measured in seconds.
 
 All experiments were run on the following machine:

diff --git a/benchmark_wordpiece.md b/benchmark_wordpiece.md
@@ -0,0 +1,38 @@
+## WordPiece Speed tests
+
+`YouTokenToMe` will be compared with:
+* [Hugging Face](https://github.com/huggingface/tokenizers)
+* [Keras](https://github.com/keras-team/keras-nlp)
+* [Tensorflow](https://github.com/tensorflow/text)
+* [Torch](https://github.com/pytorch/text)
+
+These algorithms are considered to be fast.
+
+Data from [Wikipedia](https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/) was used to evaluate algorithm speed. In a similar way to `enwik8` and `enwik9`, the experiments were run on first `10^8` and `10^9` bytes of datasets for English, Russian, Chinese and Japanese.
+
+Used vocabulary: [bert-base-cased](https://huggingface.co/bert-base-cased).
+
+In this benchmark, `YouTokenToMe` used 4 threads for training and tokenization.
+
+Source code for benchmark can be found [here](tests/speed_test/wordpiece.py).
+The results of the experiments are below. The time is measured in seconds.
+
+All experiments were run on the following machine: TODO
+
+### Tokenization 100MB
+TODO: TABLE
+
+### Tokenization 1GB 
+TODO: TABLE
+
+`YouTokenToMe` performed really well in this benchmark. This is especially noticeable for languages with large alphabets.
+
+## Number of threads
+
+The table below shows the dependence of performance on the number of threads for `YouTokenToMe`.
+
+### Tokenization 1GB
+TODO: TABLE
+
+
+TODO: CONCLUSION ON THREADS
diff --git a/requirements.txt b/requirements.txt
@@ -1,5 +1,10 @@
-setuptools>=32.0.0
-Click>=7.0
-pytest==4.3.1
-tabulate==0.8.5
-Cython==0.29.14
+atomicwrites==1.4.1
+attrs==22.2.0
+click==8.1.3
+Cython==0.29.34
+more-itertools==9.1.0
+pluggy==1.0.0
+py==1.11.0
+pytest==7.2.1
+six==1.16.0
+tabulate==0.9.0
diff --git a/setup.py b/setup.py
@@ -12,6 +12,7 @@
             "youtokentome/cpp/bpe.cpp",
             "youtokentome/cpp/utils.cpp",
             "youtokentome/cpp/utf8.cpp",
+            "youtokentome/cpp/wordpiece.cpp"
         ],
         extra_compile_args=["-std=c++11", "-pthread", "-O3"],
         language="c++",
@@ -35,7 +36,7 @@
     python_requires=">=3.5.0",
     install_requires=["Click>=7.0"],
     entry_points={"console_scripts": ["yttm = youtokentome.yttm_cli:main"]},
-    author="Ivan Belonogov",
+    author="VKCOM",
     license="MIT",
     classifiers=[
         "License :: OSI Approved :: MIT License",

diff --git a/tests/speed_test/Dockerfile b/tests/speed_test/Dockerfile
@@ -8,8 +8,11 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
 	cmake \
 	make \
 	g++ \
-	wget && \
-    pip3 install tabulate youtokentome tokenizers
+	wget \
+	bzip2 \
+	perl && \
+	pip3 install -r requirements.txt && \
+	pip3 install youtokentome
 
 WORKDIR /repos
 
@@ -26,8 +29,13 @@ RUN g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
 
 WORKDIR /workspace
 
-COPY ./speed_test.py ./speed_test.py
 RUN cp /repos/fastBPE/fast /workspace/fastBPE
+RUN wget -O bert-base-cased.txt https://huggingface.co/bert-base-cased/resolve/main/vocab.txt
 
-# CMD ["python", "speed_test.py", "--langs", "en", "ru", "zh", "ja", "--corpus_size", "100", "--vocab_size", "30000"]
-CMD ["python", "speed_test.py", "--langs", "ru", "--corpus_size", "10", "--vocab_size", "30000"]
+COPY ./bpe.py ./bpe.py
+COPY ./wordpiece.py ./wordpiece.py
+
+# use comma to separate langs, e.g.: "--langs", "en", "ru", "zh", "ja"
+CMD ["python", "bpe.py", "--langs", "ru", "--corpus_size", "10", "--vocab_size", "30000"]
+
+CMD ["python", "wordpiece.py", "--langs", "ru", "--corpus_size", "10", "--vocab", "bert-base-cased.txt"]