Rewrite and Speed Up Tokenizer #54

jonthegeek · 2020-11-02T13:15:25Z

As an RBERT user, I'd like the tokenizer to be as fast as it can be, so that I don't have to wait for this step more than is absolutely necessary.

First thing to check: Does keras::text_tokenizer (and friends) do what we need? If so, we should be able to save_text_tokenizer() when the model is downloaded for #51.

The text was updated successfully, but these errors were encountered:

jonthegeek · 2020-11-02T13:18:46Z

Oh, duh, no, keras::text_tokenizer doesn't easily do the wordpiece stuff.

Check out wordpiece_encode in https://github.com/bnosac/sentencepiece though to see if that looks efficient.

jonthegeek self-assigned this Nov 2, 2020

jonthegeek removed their assignment Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite and Speed Up Tokenizer #54

Rewrite and Speed Up Tokenizer #54

jonthegeek commented Nov 2, 2020

jonthegeek commented Nov 2, 2020 •

edited

Loading

Rewrite and Speed Up Tokenizer #54

Rewrite and Speed Up Tokenizer #54

Comments

jonthegeek commented Nov 2, 2020

jonthegeek commented Nov 2, 2020 • edited Loading

jonthegeek commented Nov 2, 2020 •

edited

Loading