Skip to content
This repository has been archived by the owner on Apr 23, 2024. It is now read-only.

How to generate vocab.json and merges.txt for YTTM tokenizer? #66

Open
nikhilno1 opened this issue Mar 8, 2020 · 1 comment
Open

How to generate vocab.json and merges.txt for YTTM tokenizer? #66

nikhilno1 opened this issue Mar 8, 2020 · 1 comment

Comments

@nikhilno1
Copy link

I want to train a GPT2 model with new vocabulary. I am following instructions given here: https://github.com/mgrankin/ru_transformers. YTTM tokenizer outputs a yt.model file that has the new vocab. However the run_generation.py script requires vocab.json and merges.txt files. I can see the vocab with below command:

yttm vocab --model yt.model

But I don't know how to convert it into vocab.json and merges.txt format. Shouldn't this have been a common problem?

@ckoshka
Copy link

ckoshka commented Jul 26, 2021

This is also an issue for me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants