Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to build encoder.json and dict.txt #1186

Closed
008karan opened this issue Sep 26, 2019 · 10 comments
Closed

How to build encoder.json and dict.txt #1186

008karan opened this issue Sep 26, 2019 · 10 comments

Comments

@008karan
Copy link

I am training RoBERTa on a different language. I found how to build vocab.bpe using other BPE methods but not able to figure out how to get dict.txt and encoder.json.
Please suggest how to do this.

@lematt1991
Copy link
Contributor

What BPE are you using (sentencepiece, fastbpe, something else)? The encoder.json is specific to GPT2's BPE. The dict.txt file will get created when you preprocess your data using fairseq-preprocess

@008karan
Copy link
Author

008karan commented Sep 26, 2019

I am using sentencepiece BPE @lematt1991 . So can I copy-paste encoder.json directly?

@lematt1991
Copy link
Contributor

You shouldn't need encoder.json at all. Follow these instructions, and skip the "Next encode it with the GPT-2 BPE" section, and encode using your sentencepiece BPE. The rest should be the same.

@008karan
Copy link
Author

But in the preprocessing step there is srcdict argument and train.bpe valid.bpe test.bpe files are needed whereas I only got one model file and vocab from sentencepiece BPE.

 fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

@lematt1991

@lematt1991
Copy link
Contributor

The *.bpe files are the names of the BPE encoded files. You would do something like:

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

And then:

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

By not specifying --srcdict it will generate a dictionary for you.

@lematt1991
Copy link
Contributor

Does this solve your problem? If so, do you mind closing this issue. Thanks!

@lematt1991
Copy link
Contributor

Closing due to inactivity

@GabboM
Copy link

GabboM commented May 21, 2020

what if dict.txt is not present and I have multiple data-bins (data-bin1:data-bin2:data-bin3, ec...), how can I create a general dict.txt that is valid for each data-bin and not create a new one for each of them that will also cause problems?

@Java-2022
Copy link

@lematt1991 lematt1991

What BPE are you using (sentencepiece, fastbpe, something else)? The encoder.json is specific to GPT2's BPE. The dict.txt file will get created when you preprocess your data using fairseq-preprocess
@lematt1991 please guide me how to create encoder.json using gpt-2 bpe for other language.

@jordiae
Copy link

jordiae commented Dec 6, 2020

The *.bpe files are the names of the BPE encoded files. You would do something like:

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

And then:

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

By not specifying --srcdict it will generate a dictionary for you.

Hi @lematt1991, in case you want to use a specific dictionary, you should create by hand dict.txt, right? Afaik, fairseq-preprocess will generate the dictionary by looking at the unique tokens appearing in training, but in case some of the tokens in your original dictionary don't appear in the train set (e.g., placeholder tokens you want to reserve, or special tokens you may use in a future fine-tuning but not during pre-training), these will not be added to dict.txt, and there are no arguments to add them. In this case, one would have to manually concatenate them to dict.txt, with frequency 0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants