Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem at Getting Word Predictions from the Roberta Model I Trained for Turkish #1417

Closed
ceatlinar opened this issue Nov 23, 2019 · 9 comments

Comments

@ceatlinar
Copy link

ceatlinar commented Nov 23, 2019

Hi everyone, I followed this guide with some differences as mentioned here: #1186 to pretrain a Roberta model for Turkish. Below are the detailed steps I took:
1 - Downloaded a turkish corpus and created train.txt, valid.txt and test.txt files from the corpus.
2 - trained a sentencepiece bpe model and used it to create train.bpe, valid.bpe and test.bpe files using the following command :

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

3 - Used following command and not specifying -srcdict I created a dict.txt

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

4 - Did the rest of the official guide for training.

The model training (with just a few epochs to check if everything is correct) was done without an error. The problem is when I try to get masked word prediction I get following errors:

1 - When using following command with output_format = piece to create bpe files

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

I use this code snipe to get prediction:

import torch 
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103')
assert isinstance(roberta.model, torch.nn.Module)
roberta.fill_mask('Hadi bir masked_word yiyelim.', topk=3)

The error is:

ValueError Traceback (most recent call last)
in ()
3 roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103')
4 assert isinstance(roberta.model, torch.nn.Module)
----> 5 roberta.fill_mask('Hadi bir yiyelim.', topk=3)

3 frames
/content/drive/My Drive/fairseq-master/fairseq/data/encoders/gpt2_bpe_utils.py in (.0)
112
113 def decode(self, tokens):
--> 114 text = ''.join([self.decoder[token] for token in tokens])
115 text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
116 return text

ValueError: invalid literal for int() with base 10: 'larıyla

2 - When using following command with output_format = id to create bpe files

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=id > wikitext-103-raw/wiki.${SPLIT}.bpe
done

The error is I get english predictions when I actually pretrained my model with Turkish corpus. Following are the prediction I get for a sample:

import torch 
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103')
assert isinstance(roberta.model, torch.nn.Module)
roberta.fill_mask('Hadi bir masked_word yiyelim.', topk=3)

loading archive file checkpoints
loading archive file data-bin/wikitext-103
| dictionary: 31544 types
[('Hadi bir revealing yiyelim.', 0.0002395595656707883, ' revealing'),
('Hadi birlington yiyelim.', 0.0002280160115333274, 'lington'),
('Hadi bir light yiyelim.', 0.0001991547178477049, ' light')]

Any chance you could help, just asking because I followed the issue I mentioned above which you gave the replies @lematt1991

@lematt1991
Copy link
Contributor

The default BPE for Roberta is the GPT-2 BPE. Since you are using sentencepiece, you'll need to specify bpe='sentencepiece' to from_pretrained.

@ceatlinar
Copy link
Author

roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103',bpe = 'sentencepiece')
gives

19
20 def init(self, args):
---> 21 vocab = file_utils.cached_path(args.sentencepiece_vocab)
22 try:
23 import sentencepiece as spm

AttributeError: 'Namespace' object has no attribute 'sentencepiece_vocab' error

@lematt1991
Copy link
Contributor

Does sentencepiece.bpe.model exist in your data-bin/wikitext-103 directory?

@ceatlinar
Copy link
Author

Do you mean the bpe model I trained to create bpe files? It is not in the directory but I will add and try it again

@ceatlinar
Copy link
Author

ceatlinar commented Nov 23, 2019

I did add it and run training again. It pulled last checkpoint I assume because training end immediately. Yet I got some error. I am using Colab envoriment and mounted drive again and can see the model file.

@lematt1991
Copy link
Contributor

I'm not sure I follow. You don't need to re-train with the sentencepiece model in the data directory, you need it in the data directory when you load the pre-trained model, otherwise, you're feeding in raw text and it has no way of knowing how to BPE encode it. Just run:

roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103',bpe = 'sentencepiece'

Once you've added your sentencepiece model to your data directory

@ceatlinar
Copy link
Author

This is the exact code I run. I added the sentencepiece model under data-bin/wikitext-103 but it still gives 'Namespace' object has no attribute 'sentencepiece_vocab' error

@lematt1991
Copy link
Contributor

Oh sorry, actually it looks like the sentencepiece model is suppose to go in the checkpoints directory.

@ceatlinar
Copy link
Author

Thank you so much, now it works. This was really really helpful. I appreciate your help. Closing the issue since now it is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants