Would you release the tutorial about how to generate bpe.codes and dict.txt file? #7

RyanHuangNLP · 2020-06-02T08:27:25Z

Would you release the tutorial about how to generate bpe.codes and dict.txt file, and the preprocess pipeline about how generate pretrain data?

I want to train a another language bertweet

datquocnguyen · 2020-06-02T14:29:12Z

Regrading bpe.codes: see https://github.com/glample/fastBPE

W.r.t. dict.txt: see https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

To generate dict.txt you would have to REMOVE: --srcdict gpt2_bpe/dict.txt \ out of the data binarizing step, i.e. use:

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

RyanHuangNLP · 2020-06-02T15:40:34Z

mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
for SPLIT in train valid test; do \
    python -m examples.roberta.multiprocessing_bpe_encoder \
        --encoder-json gpt2_bpe/encoder.json \
        --vocab-bpe gpt2_bpe/vocab.bpe \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

for the encode step, is it necessary to replace the encoder to fastbpe? Would you give more details about encode steps?

Regrading bpe.codes: see https://github.com/glample/fastBPE

W.r.t. dict.txt: see https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

To generate dict.txt you would have to REMOVE: --srcdict gpt2_bpe/dict.txt \ out of the data binarizing step, i.e. use:
fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

datquocnguyen · 2020-06-02T15:45:18Z

Yes, you should use fastbpe. See: facebookresearch/fairseq#1186

datquocnguyen closed this as completed Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would you release the tutorial about how to generate bpe.codes and dict.txt file? #7

Would you release the tutorial about how to generate bpe.codes and dict.txt file? #7

RyanHuangNLP commented Jun 2, 2020 •

edited

Loading

datquocnguyen commented Jun 2, 2020

RyanHuangNLP commented Jun 2, 2020 •

edited

Loading

datquocnguyen commented Jun 2, 2020

Would you release the tutorial about how to generate bpe.codes and dict.txt file? #7

Would you release the tutorial about how to generate bpe.codes and dict.txt file? #7

Comments

RyanHuangNLP commented Jun 2, 2020 • edited Loading

datquocnguyen commented Jun 2, 2020

RyanHuangNLP commented Jun 2, 2020 • edited Loading

datquocnguyen commented Jun 2, 2020

RyanHuangNLP commented Jun 2, 2020 •

edited

Loading

RyanHuangNLP commented Jun 2, 2020 •

edited

Loading