Save tokenizer as part of model #51

jonthegeek · 2020-01-19T21:16:11Z

The tokenizer for a given model is deterministic (it only depends on the vocab file + whether it's cased). Producing the tokenizer takes 100x as long as loading a pre-processed tokenizer (about 4 s vs 40 ms for bert_base_uncased).

Save the tokenizer as part of the download process. If a model has a vocab but not a tokenizer, save a tokenizer once and then use it going forward (for backward compatibility with things that are already downloaded).

jonthegeek · 2020-01-19T21:23:43Z

Note: I tested preprocessing the config json vs saving it as-is, preprocessing saves microseconds, so it probably isn't worth messing with. It wouldn't HURT, though, so I may do the same fix for that one when I do the tokenizer.

jonthegeek self-assigned this Nov 2, 2020

jonthegeek mentioned this issue Nov 2, 2020

Rewrite and Speed Up Tokenizer #54

Open

jonthegeek removed their assignment Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save tokenizer as part of model #51

Save tokenizer as part of model #51

jonthegeek commented Jan 19, 2020

jonthegeek commented Jan 19, 2020

Save tokenizer as part of model #51

Save tokenizer as part of model #51

Comments

jonthegeek commented Jan 19, 2020

jonthegeek commented Jan 19, 2020