Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would you release the tutorial about how to generate bpe.codes and dict.txt file? #7

Closed
RyanHuangNLP opened this issue Jun 2, 2020 · 3 comments

Comments

@RyanHuangNLP
Copy link

RyanHuangNLP commented Jun 2, 2020

Would you release the tutorial about how to generate bpe.codes and dict.txt file, and the preprocess pipeline about how generate pretrain data?

I want to train a another language bertweet

@datquocnguyen
Copy link
Collaborator

Regrading bpe.codes: see https://github.com/glample/fastBPE

W.r.t. dict.txt: see https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

To generate dict.txt you would have to REMOVE: --srcdict gpt2_bpe/dict.txt \ out of the data binarizing step, i.e. use:

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

@RyanHuangNLP
Copy link
Author

RyanHuangNLP commented Jun 2, 2020

mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
for SPLIT in train valid test; do \
    python -m examples.roberta.multiprocessing_bpe_encoder \
        --encoder-json gpt2_bpe/encoder.json \
        --vocab-bpe gpt2_bpe/vocab.bpe \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

for the encode step, is it necessary to replace the encoder to fastbpe? Would you give more details about encode steps?

Regrading bpe.codes: see https://github.com/glample/fastBPE

W.r.t. dict.txt: see https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

To generate dict.txt you would have to REMOVE: --srcdict gpt2_bpe/dict.txt \ out of the data binarizing step, i.e. use:

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

@datquocnguyen
Copy link
Collaborator

Yes, you should use fastbpe. See: facebookresearch/fairseq#1186

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants