How to generate vocab.json and merges.txt for YTTM tokenizer? #66

nikhilno1 · 2020-03-08T16:01:38Z

I want to train a GPT2 model with new vocabulary. I am following instructions given here: https://github.com/mgrankin/ru_transformers. YTTM tokenizer outputs a yt.model file that has the new vocab. However the run_generation.py script requires vocab.json and merges.txt files. I can see the vocab with below command:

yttm vocab --model yt.model

But I don't know how to convert it into vocab.json and merges.txt format. Shouldn't this have been a common problem?

ckoshka · 2021-07-26T23:58:00Z

This is also an issue for me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to generate vocab.json and merges.txt for YTTM tokenizer? #66

How to generate vocab.json and merges.txt for YTTM tokenizer? #66

nikhilno1 commented Mar 8, 2020

ckoshka commented Jul 26, 2021

How to generate vocab.json and merges.txt for YTTM tokenizer? #66

How to generate vocab.json and merges.txt for YTTM tokenizer? #66

Comments

nikhilno1 commented Mar 8, 2020

ckoshka commented Jul 26, 2021