Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite and Speed Up Tokenizer #54

Open
jonthegeek opened this issue Nov 2, 2020 · 1 comment
Open

Rewrite and Speed Up Tokenizer #54

jonthegeek opened this issue Nov 2, 2020 · 1 comment

Comments

@jonthegeek
Copy link
Collaborator

As an RBERT user, I'd like the tokenizer to be as fast as it can be, so that I don't have to wait for this step more than is absolutely necessary.

First thing to check: Does keras::text_tokenizer (and friends) do what we need? If so, we should be able to save_text_tokenizer() when the model is downloaded for #51.

@jonthegeek jonthegeek self-assigned this Nov 2, 2020
@jonthegeek
Copy link
Collaborator Author

jonthegeek commented Nov 2, 2020

Oh, duh, no, keras::text_tokenizer doesn't easily do the wordpiece stuff.

Check out wordpiece_encode in https://github.com/bnosac/sentencepiece though to see if that looks efficient.

@jonthegeek jonthegeek removed their assignment Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant