-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem at Getting Word Predictions from the Roberta Model I Trained for Turkish #1417
Comments
The default BPE for Roberta is the GPT-2 BPE. Since you are using sentencepiece, you'll need to specify |
AttributeError: 'Namespace' object has no attribute 'sentencepiece_vocab' error |
Does |
Do you mean the bpe model I trained to create bpe files? It is not in the directory but I will add and try it again |
I did add it and run training again. It pulled last checkpoint I assume because training end immediately. Yet I got some error. I am using Colab envoriment and mounted drive again and can see the model file. |
I'm not sure I follow. You don't need to re-train with the sentencepiece model in the data directory, you need it in the data directory when you load the pre-trained model, otherwise, you're feeding in raw text and it has no way of knowing how to BPE encode it. Just run:
Once you've added your sentencepiece model to your data directory |
This is the exact code I run. I added the sentencepiece model under data-bin/wikitext-103 but it still gives 'Namespace' object has no attribute 'sentencepiece_vocab' error |
Oh sorry, actually it looks like the sentencepiece model is suppose to go in the |
Thank you so much, now it works. This was really really helpful. I appreciate your help. Closing the issue since now it is solved. |
Hi everyone, I followed this guide with some differences as mentioned here: #1186 to pretrain a Roberta model for Turkish. Below are the detailed steps I took:
1 - Downloaded a turkish corpus and created train.txt, valid.txt and test.txt files from the corpus.
2 - trained a sentencepiece bpe model and used it to create train.bpe, valid.bpe and test.bpe files using the following command :
3 - Used following command and not specifying -srcdict I created a dict.txt
4 - Did the rest of the official guide for training.
The model training (with just a few epochs to check if everything is correct) was done without an error. The problem is when I try to get masked word prediction I get following errors:
1 - When using following command with output_format = piece to create bpe files
I use this code snipe to get prediction:
The error is:
2 - When using following command with output_format = id to create bpe files
The error is I get english predictions when I actually pretrained my model with Turkish corpus. Following are the prediction I get for a sample:
loading archive file checkpoints
loading archive file data-bin/wikitext-103
| dictionary: 31544 types
[('Hadi bir revealing yiyelim.', 0.0002395595656707883, ' revealing'),
('Hadi birlington yiyelim.', 0.0002280160115333274, 'lington'),
('Hadi bir light yiyelim.', 0.0001991547178477049, ' light')]
Any chance you could help, just asking because I followed the issue I mentioned above which you gave the replies @lematt1991
The text was updated successfully, but these errors were encountered: