Skip to content
This repository has been archived by the owner on Apr 23, 2024. It is now read-only.

No word tokenizer under the hood? #87

Open
slowwavesleep opened this issue May 17, 2021 · 0 comments
Open

No word tokenizer under the hood? #87

slowwavesleep opened this issue May 17, 2021 · 0 comments

Comments

@slowwavesleep
Copy link

Hi,

In the original BPE paper, as well as in the BPE dropout paper, the authors apply word-based tokenization (namely, the Moses tokenizer, as well as some others) before the main algorithm. However, this project's readme is somewhat vague regarding this detail. Do I understand it correctly that the only word-based tokenization implemented is basically splitting on spaces and that's it?

What confuses me is this quote: ours does not consider tokens that cross word boundaries. For some languages it's impossible not to consider tokens that cross word boundaries based on spaces alone. So my question as follows: is there a more sophisticated word-based tokenizer under the hood after all?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant