Skip to content
This repository has been archived by the owner on Apr 23, 2024. It is now read-only.

[WIP] fast wordpiece tokenization #105

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,8 @@ coverage.xml
*.txt
*.yttm
artifacts/
stress
bpe_stress
wordpiece_stress

# Translations
*.mo
Expand Down
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (c) 2019 VK.com
Copyright (c) 2019-2023 VK.com

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
10 changes: 5 additions & 5 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
include youtokentome/cpp/utils.h
include youtokentome/cpp/bpe.h
include youtokentome/cpp/utf8.h
include youtokentome/cpp/wordpiece.h
include youtokentome/cpp/yttm.pyx
include youtokentome/cpp/third_party/flat_hash_map.h
include youtokentome/cpp/third_party/LICENSE
include youtokentome/cpp/third_party/flat_hash_map/flat_hash_map.h
include youtokentome/cpp/third_party/flat_hash_map/LICENSE
include youtokentome/cpp/third_party/thread_pool/thread_pool.h
include youtokentome/cpp/third_party/thread_pool/LICENSE
include LICENSE
include README.md
include requirements.txt
include yttm_cli.py



137 changes: 76 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,21 @@

# YouTokenToMe

YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)].
Our implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE)
and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 60 times faster.
Check out our [benchmark](benchmark.md) results.
YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently contains the fastest implementations of:
- Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)], [benchmark results](benchmark_bpe.md);
- WordPiece [[Song et al.](https://arxiv.org/abs/2012.15524)], [benchmark results](benchmark_wordpiece.md).

Key advantages:

* Multithreading for training and tokenization
* The algorithm has `O(N)` complexity, where `N` is the length of training data
* Highly efficient implementation in C++
* Python wrapper and command-line interface

Extra features:
* BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267))
## BPE implementation

Algorighm properties:
* Time complexity is `O(N)`, where `N` is the length of training data
* Supports BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267))

As well as in the algorithm from the original paper, ours does not consider tokens
that cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored.
Expand All @@ -28,15 +29,21 @@ For example, the phrase ```Blazingly fast tokenization!``` can be tokenized into

`['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']`

## WordPiece implementation

Algorighm properties:
* Currently supports tokenizer only, but not training
* Time complexity is `O(Nm^2)`, where `N` is the length of tokenized data and `m` is the max length of word in vocabulary

## Installation

```bash
pip install youtokentome
```
## Python interface

## Python BPE interface

### Example
Let's start with a self-contained example.

```python
import random
Expand Down Expand Up @@ -67,11 +74,28 @@ bpe = yttm.BPE(model=model_path)
print(bpe.encode([test_text], output_type=yttm.OutputType.ID))
print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD))
```

### Methods
Class `youtokentome.BPE` has the following methods:

#### constructor

```python
youtokentome.BPE(model, n_threads=-1)
```

Class constructor. Loads the trained model.

* `model`: string, path to the trained model
* `n_threads`: int, number of parallel threads used to run.
If equal to -1, then the maximum number of threads available will be used.

 
### Training model

#### train

```python
youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)
train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)
```
Trains BPE model and saves to file.

Expand All @@ -92,22 +116,6 @@ Trains BPE model and saves to file.

 

### Model loading

```python
youtokentome.BPE(model, n_threads=-1)
```

Class constructor. Loads the trained model.

* `model`: string, path to the trained model
* `n_threads`: int, number of parallel threads used to run.
If equal to -1, then the maximum number of threads available will be used.

 

### Methods
Class `youtokentome.BPE` has the following methods:
#### encode
```python
encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0)
Expand Down Expand Up @@ -185,16 +193,23 @@ Convert each id to subword and concatenate with space symbol.


**Returns:** List of strings.

## Command line interface

### Example
## Python WordPiece interface

```bash
$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA
```
### Example

TODO

### Methods
Class `youtokentome.WordPiece` has the following methods:

#### constructor

#### encode

#### decode

## Command line interface

### Supported commands

Expand All @@ -209,16 +224,16 @@ Options:
--help Show this message and exit.

Commands:
bpe Train BPE model.
decode Decode ids to text.
encode Encode text to ids or subwords.
vocab Print list of learned subwords.
bpe-train Train BPE model.
bpe-decode Decode ids to text.
bpe-encode Encode text to ids or subwords.
bpe-vocab Print list of learned subwords.
```

Command `bpe` allows you to train Byte Pair Encoding model based on a text file.

```
$ yttm bpe --help
$ yttm bpe-train --help

Usage: yttm bpe [OPTIONS]

Expand All @@ -237,18 +252,31 @@ Options:
--help Show this message and exit.
```

Convert ids back to text. Use `stdin` for input and `stdout` for output.

```
$ yttm bpe-decode --help

Usage: yttm decode [OPTIONS]

Decode ids to text.

Options:
--model PATH Path to file with learned model. [required]
--ignore_ids List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
--help Show this message and exit.
```

Apply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output.

By default, encoding works in parallel using `n_threads` threads. Number of threads is limited by
8 (see [benchmark](benchmark.md#number-of-threads)).
8 (see [benchmark](benchmark_bpe.md#number-of-threads)).

With the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one.
Each sentence will be tokenized and written to the `stdout` before the next sentence is read.


```
$ yttm encode --help
$ yttm bpe-encode --help

Usage: yttm encode [OPTIONS]

Expand All @@ -269,7 +297,7 @@ Options:
Print vocabulary. This can be useful for understanding the model.

```
$ yttm vocab --help
$ yttm bpe-vocab --help

Usage: yttm vocab [OPTIONS]

Expand All @@ -281,24 +309,11 @@ Options:
--help Show this message and exit.
```

Convert ids back to text. Use `stdin` for input and `stdout` for output.

```
$ yttm decode --help

Usage: yttm decode [OPTIONS]
### Examples

Decode ids to text.
TODO: wordpiece

Options:
--model PATH Path to file with learned model. [required]
--ignore_ids List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
--help Show this message and exit.
```bash
$ yttm bpe-train --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
$ yttm bpe-encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA
```







12 changes: 8 additions & 4 deletions benchmark.md → benchmark_bpe.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
## Speed tests
## BPE Speed tests

`YouTokenToMe` will be compared with [Hugging Face](https://github.com/huggingface/tokenizers), [SentencePiece](https://github.com/google/sentencepiece/)
and [fastBPE](https://github.com/glample/fastBPE). These three algorithms are considered to be fast.
`YouTokenToMe` will be compared with:
* [Hugging Face](https://github.com/huggingface/tokenizers)
* [SentencePiece](https://github.com/google/sentencepiece/)
* [fastBPE](https://github.com/glample/fastBPE)

These algorithms are considered to be fast.

Data from [Wikipedia](https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/) was used to evaluate algorithm speed. In a similar way to `enwik8` and `enwik9`, the experiments were run on first `10^8` and `10^9` bytes of datasets for English, Russian, Chinese and Japanese.

Expand All @@ -11,7 +15,7 @@ In this benchmark, `YouTokenToMe` used 4 threads for training and tokenization.
doesn't support multithreading for **BPE** at all. `fastBPE` doesn't support multithreading for training.
For tokenization, it also used 4 threads.

Source code for benchmark can be found [here](tests/speed_test/speed_test.py).
Source code for benchmark can be found [here](tests/speed_test/bpe.py).
The results of the experiments are below. The time is measured in seconds.

All experiments were run on the following machine:
Expand Down
38 changes: 38 additions & 0 deletions benchmark_wordpiece.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
## WordPiece Speed tests

`YouTokenToMe` will be compared with:
* [Hugging Face](https://github.com/huggingface/tokenizers)
* [Keras](https://github.com/keras-team/keras-nlp)
* [Tensorflow](https://github.com/tensorflow/text)
* [Torch](https://github.com/pytorch/text)

These algorithms are considered to be fast.

Data from [Wikipedia](https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/) was used to evaluate algorithm speed. In a similar way to `enwik8` and `enwik9`, the experiments were run on first `10^8` and `10^9` bytes of datasets for English, Russian, Chinese and Japanese.

Used vocabulary: [bert-base-cased](https://huggingface.co/bert-base-cased).

In this benchmark, `YouTokenToMe` used 4 threads for training and tokenization.

Source code for benchmark can be found [here](tests/speed_test/wordpiece.py).
The results of the experiments are below. The time is measured in seconds.

All experiments were run on the following machine: TODO

### Tokenization 100MB
TODO: TABLE

### Tokenization 1GB
TODO: TABLE

`YouTokenToMe` performed really well in this benchmark. This is especially noticeable for languages with large alphabets.

## Number of threads

The table below shows the dependence of performance on the number of threads for `YouTokenToMe`.

### Tokenization 1GB
TODO: TABLE


TODO: CONCLUSION ON THREADS
15 changes: 10 additions & 5 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
setuptools>=32.0.0
Click>=7.0
pytest==4.3.1
tabulate==0.8.5
Cython==0.29.14
atomicwrites==1.4.1
attrs==22.2.0
click==8.1.3
Cython==0.29.34
more-itertools==9.1.0
pluggy==1.0.0
py==1.11.0
pytest==7.2.1
six==1.16.0
tabulate==0.9.0
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
"youtokentome/cpp/bpe.cpp",
"youtokentome/cpp/utils.cpp",
"youtokentome/cpp/utf8.cpp",
"youtokentome/cpp/wordpiece.cpp"
],
extra_compile_args=["-std=c++11", "-pthread", "-O3"],
language="c++",
Expand All @@ -35,7 +36,7 @@
python_requires=">=3.5.0",
install_requires=["Click>=7.0"],
entry_points={"console_scripts": ["yttm = youtokentome.yttm_cli:main"]},
author="Ivan Belonogov",
author="VKCOM",
license="MIT",
classifiers=[
"License :: OSI Approved :: MIT License",
Expand Down
18 changes: 13 additions & 5 deletions tests/speed_test/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,11 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
cmake \
make \
g++ \
wget && \
pip3 install tabulate youtokentome tokenizers
wget \
bzip2 \
perl && \
pip3 install -r requirements.txt && \
pip3 install youtokentome

WORKDIR /repos

Expand All @@ -26,8 +29,13 @@ RUN g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast

WORKDIR /workspace

COPY ./speed_test.py ./speed_test.py
RUN cp /repos/fastBPE/fast /workspace/fastBPE
RUN wget -O bert-base-cased.txt https://huggingface.co/bert-base-cased/resolve/main/vocab.txt

# CMD ["python", "speed_test.py", "--langs", "en", "ru", "zh", "ja", "--corpus_size", "100", "--vocab_size", "30000"]
CMD ["python", "speed_test.py", "--langs", "ru", "--corpus_size", "10", "--vocab_size", "30000"]
COPY ./bpe.py ./bpe.py
COPY ./wordpiece.py ./wordpiece.py

# use comma to separate langs, e.g.: "--langs", "en", "ru", "zh", "ja"
CMD ["python", "bpe.py", "--langs", "ru", "--corpus_size", "10", "--vocab_size", "30000"]

CMD ["python", "wordpiece.py", "--langs", "ru", "--corpus_size", "10", "--vocab", "bert-base-cased.txt"]
Loading