Problem at Getting Word Predictions from the Roberta Model I Trained for Turkish #1417

ceatlinar · 2019-11-23T16:51:01Z

Hi everyone, I followed this guide with some differences as mentioned here: #1186 to pretrain a Roberta model for Turkish. Below are the detailed steps I took:
1 - Downloaded a turkish corpus and created train.txt, valid.txt and test.txt files from the corpus.
2 - trained a sentencepiece bpe model and used it to create train.bpe, valid.bpe and test.bpe files using the following command :

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

3 - Used following command and not specifying -srcdict I created a dict.txt

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

4 - Did the rest of the official guide for training.

The model training (with just a few epochs to check if everything is correct) was done without an error. The problem is when I try to get masked word prediction I get following errors:

1 - When using following command with output_format = piece to create bpe files

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

I use this code snipe to get prediction:

import torch 
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103')
assert isinstance(roberta.model, torch.nn.Module)
roberta.fill_mask('Hadi bir masked_word yiyelim.', topk=3)

The error is:

ValueError Traceback (most recent call last)
in ()
3 roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103')
4 assert isinstance(roberta.model, torch.nn.Module)
----> 5 roberta.fill_mask('Hadi bir yiyelim.', topk=3)

3 frames
/content/drive/My Drive/fairseq-master/fairseq/data/encoders/gpt2_bpe_utils.py in (.0)
112
113 def decode(self, tokens):
--> 114 text = ''.join([self.decoder[token] for token in tokens])
115 text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
116 return text

ValueError: invalid literal for int() with base 10: 'larıyla

2 - When using following command with output_format = id to create bpe files

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=id > wikitext-103-raw/wiki.${SPLIT}.bpe
done

The error is I get english predictions when I actually pretrained my model with Turkish corpus. Following are the prediction I get for a sample:

import torch 
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103')
assert isinstance(roberta.model, torch.nn.Module)
roberta.fill_mask('Hadi bir masked_word yiyelim.', topk=3)

loading archive file checkpoints
loading archive file data-bin/wikitext-103
| dictionary: 31544 types
[('Hadi bir revealing yiyelim.', 0.0002395595656707883, ' revealing'),
('Hadi birlington yiyelim.', 0.0002280160115333274, 'lington'),
('Hadi bir light yiyelim.', 0.0001991547178477049, ' light')]

Any chance you could help, just asking because I followed the issue I mentioned above which you gave the replies @lematt1991

The text was updated successfully, but these errors were encountered:

lematt1991 · 2019-11-23T17:16:11Z

The default BPE for Roberta is the GPT-2 BPE. Since you are using sentencepiece, you'll need to specify bpe='sentencepiece' to from_pretrained.

ceatlinar · 2019-11-23T17:22:56Z

roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103',bpe = 'sentencepiece')
gives

19
20 def init(self, args):
---> 21 vocab = file_utils.cached_path(args.sentencepiece_vocab)
22 try:
23 import sentencepiece as spm

AttributeError: 'Namespace' object has no attribute 'sentencepiece_vocab' error

lematt1991 · 2019-11-23T17:25:47Z

Does sentencepiece.bpe.model exist in your data-bin/wikitext-103 directory?

ceatlinar · 2019-11-23T17:28:30Z

Do you mean the bpe model I trained to create bpe files? It is not in the directory but I will add and try it again

ceatlinar · 2019-11-23T17:32:22Z

I did add it and run training again. It pulled last checkpoint I assume because training end immediately. Yet I got some error. I am using Colab envoriment and mounted drive again and can see the model file.

lematt1991 · 2019-11-23T17:36:29Z

I'm not sure I follow. You don't need to re-train with the sentencepiece model in the data directory, you need it in the data directory when you load the pre-trained model, otherwise, you're feeding in raw text and it has no way of knowing how to BPE encode it. Just run:

roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103',bpe = 'sentencepiece'

Once you've added your sentencepiece model to your data directory

ceatlinar · 2019-11-23T17:40:01Z

This is the exact code I run. I added the sentencepiece model under data-bin/wikitext-103 but it still gives 'Namespace' object has no attribute 'sentencepiece_vocab' error

lematt1991 · 2019-11-23T17:44:08Z

Oh sorry, actually it looks like the sentencepiece model is suppose to go in the checkpoints directory.

ceatlinar · 2019-11-23T17:50:08Z

Thank you so much, now it works. This was really really helpful. I appreciate your help. Closing the issue since now it is solved.

ceatlinar closed this as completed Nov 23, 2019

ceatlinar mentioned this issue Nov 26, 2019

Pretrainig Roberta for a Different Language than English, Using own data #1398

Closed

ceatlinar mentioned this issue Dec 17, 2019

How to Prepare Data for Training BERT for Turkish facebookresearch/XLM#255

Open

rohanatcactus mentioned this issue Aug 24, 2021

RoBERTa Pre-Trained Model Loading Error: TypeError: expected str, bytes or os.PathLike object, not NoneType #3787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem at Getting Word Predictions from the Roberta Model I Trained for Turkish #1417

Problem at Getting Word Predictions from the Roberta Model I Trained for Turkish #1417

ceatlinar commented Nov 23, 2019 •

edited

Loading

lematt1991 commented Nov 23, 2019

ceatlinar commented Nov 23, 2019

lematt1991 commented Nov 23, 2019

ceatlinar commented Nov 23, 2019

ceatlinar commented Nov 23, 2019 •

edited

Loading

lematt1991 commented Nov 23, 2019

ceatlinar commented Nov 23, 2019

lematt1991 commented Nov 23, 2019

ceatlinar commented Nov 23, 2019

Problem at Getting Word Predictions from the Roberta Model I Trained for Turkish #1417

Problem at Getting Word Predictions from the Roberta Model I Trained for Turkish #1417

Comments

ceatlinar commented Nov 23, 2019 • edited Loading

lematt1991 commented Nov 23, 2019

ceatlinar commented Nov 23, 2019

lematt1991 commented Nov 23, 2019

ceatlinar commented Nov 23, 2019

ceatlinar commented Nov 23, 2019 • edited Loading

lematt1991 commented Nov 23, 2019

ceatlinar commented Nov 23, 2019

lematt1991 commented Nov 23, 2019

ceatlinar commented Nov 23, 2019

ceatlinar commented Nov 23, 2019 •

edited

Loading

ceatlinar commented Nov 23, 2019 •

edited

Loading