No word tokenizer under the hood? #87

slowwavesleep · 2021-05-17T13:58:53Z

Hi,

In the original BPE paper, as well as in the BPE dropout paper, the authors apply word-based tokenization (namely, the Moses tokenizer, as well as some others) before the main algorithm. However, this project's readme is somewhat vague regarding this detail. Do I understand it correctly that the only word-based tokenization implemented is basically splitting on spaces and that's it?

What confuses me is this quote: ours does not consider tokens that cross word boundaries. For some languages it's impossible not to consider tokens that cross word boundaries based on spaces alone. So my question as follows: is there a more sophisticated word-based tokenizer under the hood after all?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No word tokenizer under the hood? #87

No word tokenizer under the hood? #87

slowwavesleep commented May 17, 2021

No word tokenizer under the hood? #87

No word tokenizer under the hood? #87

Comments

slowwavesleep commented May 17, 2021