How to build encoder.json and dict.txt #1186

008karan · 2019-09-26T15:45:37Z

I am training RoBERTa on a different language. I found how to build vocab.bpe using other BPE methods but not able to figure out how to get dict.txt and encoder.json.
Please suggest how to do this.

lematt1991 · 2019-09-26T16:05:19Z

What BPE are you using (sentencepiece, fastbpe, something else)? The encoder.json is specific to GPT2's BPE. The dict.txt file will get created when you preprocess your data using fairseq-preprocess

008karan · 2019-09-26T16:27:05Z

I am using sentencepiece BPE @lematt1991 . So can I copy-paste encoder.json directly?

lematt1991 · 2019-09-26T16:58:46Z

You shouldn't need encoder.json at all. Follow these instructions, and skip the "Next encode it with the GPT-2 BPE" section, and encode using your sentencepiece BPE. The rest should be the same.

008karan · 2019-09-26T17:21:40Z

But in the preprocessing step there is srcdict argument and train.bpe valid.bpe test.bpe files are needed whereas I only got one model file and vocab from sentencepiece BPE.

 fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

@lematt1991

lematt1991 · 2019-09-26T17:28:16Z

The *.bpe files are the names of the BPE encoded files. You would do something like:

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

And then:

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

By not specifying --srcdict it will generate a dictionary for you.

lematt1991 · 2019-09-27T13:35:07Z

Does this solve your problem? If so, do you mind closing this issue. Thanks!

lematt1991 · 2019-09-30T19:28:04Z

Closing due to inactivity

GabboM · 2020-05-21T11:57:06Z

what if dict.txt is not present and I have multiple data-bins (data-bin1:data-bin2:data-bin3, ec...), how can I create a general dict.txt that is valid for each data-bin and not create a new one for each of them that will also cause problems?

Java-2022 · 2020-06-03T06:58:29Z

@lematt1991 lematt1991

What BPE are you using (sentencepiece, fastbpe, something else)? The encoder.json is specific to GPT2's BPE. The dict.txt file will get created when you preprocess your data using fairseq-preprocess
@lematt1991 please guide me how to create encoder.json using gpt-2 bpe for other language.

jordiae · 2020-12-06T00:55:29Z

The *.bpe files are the names of the BPE encoded files. You would do something like:

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

And then:

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

By not specifying --srcdict it will generate a dictionary for you.

Hi @lematt1991, in case you want to use a specific dictionary, you should create by hand dict.txt, right? Afaik, fairseq-preprocess will generate the dictionary by looking at the unique tokens appearing in training, but in case some of the tokens in your original dictionary don't appear in the train set (e.g., placeholder tokens you want to reserve, or special tokens you may use in a future fine-tuning but not during pre-training), these will not be added to dict.txt, and there are no arguments to add them. In this case, one would have to manually concatenate them to dict.txt, with frequency 0?

lematt1991 closed this as completed Sep 30, 2019

ceatlinar mentioned this issue Nov 23, 2019

Problem at Getting Word Predictions from the Roberta Model I Trained for Turkish #1417

Closed

datquocnguyen mentioned this issue Jun 2, 2020

Would you release the tutorial about how to generate bpe.codes and dict.txt file? VinAIResearch/BERTweet#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to build encoder.json and dict.txt #1186

How to build encoder.json and dict.txt #1186

008karan commented Sep 26, 2019

lematt1991 commented Sep 26, 2019

008karan commented Sep 26, 2019 •

edited

Loading

lematt1991 commented Sep 26, 2019

008karan commented Sep 26, 2019

lematt1991 commented Sep 26, 2019

lematt1991 commented Sep 27, 2019

lematt1991 commented Sep 30, 2019

GabboM commented May 21, 2020

Java-2022 commented Jun 3, 2020

jordiae commented Dec 6, 2020

How to build encoder.json and dict.txt #1186

How to build encoder.json and dict.txt #1186

Comments

008karan commented Sep 26, 2019

lematt1991 commented Sep 26, 2019

008karan commented Sep 26, 2019 • edited Loading

lematt1991 commented Sep 26, 2019

008karan commented Sep 26, 2019

lematt1991 commented Sep 26, 2019

lematt1991 commented Sep 27, 2019

lematt1991 commented Sep 30, 2019

GabboM commented May 21, 2020

Java-2022 commented Jun 3, 2020

jordiae commented Dec 6, 2020

008karan commented Sep 26, 2019 •

edited

Loading