-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine Tuning (Training beyond specified 't_total'...) #262
Comments
Thanks for your input and feedback! My answers to your questions:
Indeed, your results without fine-tuning are really poor, which makes me think that it could be related to your own vocabulary and corpus structure. For instance, in our experiments, we evaluated the raw model (i.e. with no fine-tune) on a Finance / News and we got 30% EM and 48% F1 for 500 questions. Please note that when annotating the question-answer pairs, if each question has only one ground-truth, there will also be a huge difference between EM and F1; this difference is shorter for models evaluated on SQuAD-dev because for each question there are 2 or 3 different ground-truths. Moreover, it still difficult to understand if the problem with your dataset is related to the Reader or to the Retriever. I am currently working with a PR to facilitate the evaluation of the Reader, in order to help users understand better what is going wrong with the performance. But honestly, I think that the Retriever plays a really big role in it. So, during the last weeks, I worked on some improvements for the Retriever:
Please notice, that it's very difficult to have scores higher than 81% EM and 88% F1, since we are using BERT-base, which achieved such results on SQuAD-dev: https://huggingface.co/pytorch-transformers/examples.html#squad . Morever these results are obtained without the Retrieval phase, where the model has already the true paragraph to perform a prediction on. In real life, this is not possible, since we have to find a list of paragraphs that likely have the answer to the question and feed the model with them. We also recently included the option to obtain more than one answers by passing a positive integer number to the arg
I am not sure if there are tips of transfer learning for the Reader that will improve hugely the performance of the Pipeline, as I think the bottleneck of the performance is the Retriever. But as a rule of thumb, if your data set is small, I think you should always start from the model pre-trained on SQuAD (instead of fine-tuning directly the "raw" version of pre-trained BERT). Another idea would be to train it with other famous datasets such as TriviaQA, NewsQA, etc. I really would like to train models on these datasets, evaluate them and maybe release them here in the package, but I do not have time available for this now. If there is someone willing to help us with this, he/she is very much welcome 😃 .
I have reasons to think that fine-tuning on SQuAD 2.0 will probably show worse answers than using SQuAD 1.1. To understand why here is an example: Let's say I make a question and the Retriever selects 10 paragraphs to send to the Reader. Suppose the answer is present in one of the paragraphs and is surely not present in the other 9 paragraphs. If the Reader was trained on SQuAD 2.0, it will output "no answer" to these 9 paragraphs with a high probability. This might lead the model to chose "no answer" at the end, even if there is an answer (in the one paragraph). In reality, I have to test to validate my assumption, but I do not have much time available for the moment to do it and prefer to focus on other areas I am surer will improve the system. If someone is willing to do it... 😃
Back to when we trained our models (in March/Avril), we did not face this warning. Today I ran a small training script and, indeed, I got this warning. Our source-code is based on Hugging-Face We will work to include these changes and fixes to |
Hey @andrelmfarias, you made some interesting points there, thanks! :)
It seems to me that these are all the default settings at the moment (perhaps they were not when we were discussing this)?!? If not, how can I set the new retriever (BM25) as the one my pipeline should use? When I increased the n_top from 3 to 20, I got an increase of 4.5% for the f1 score, so this seems to be a good idea (although in my case the improvement was not as impressive as in yours). :) BTW Do you have some implemented method in the reader which enables to follow the loss as the model is training? |
Yes, they are all the default settings now. At the time of our discussion, some weren't yet.
No, there isn't, but it could be a nice feature to add... We will try to work on that |
@andrelmfarias, I just noticed something... I ran the evaluation script on my PC (where I have an older version of the repo/code), and it seems like the retriever used before BM25 works better... I ran the same model on the same ground truth file on both my PC and in Colab (where BM25 is used) and I noticed that the old retriever offers better results... Do you think the old retriever might work better than BM25 in some cases?!? |
Yes, I think it's really data-dependent. Also the default parameters ( But in general, as we usually do in data-science, you should fine-tune these parameters to find the best in your use-case / data. |
Thanks!
You are right! :)
So we can only evaluate the pipeline with cdQA and get the F1 score and the EM, but not follow the cost/loss/error? I thought verbose_logging for the reader might print some additional info, but for some reason I am getting an error when I set it to True... |
Can you show the error here please? So that we can fix it. |
Yes, here it is: UnboundLocalError Traceback (most recent call last) 1 frames UnboundLocalError: local variable 'num_train_optimization_steps' referenced before assignment BTW Can you please show me an example how to use the old retriever in Colab instead of BM25? I am not sure how I can change that... |
Thanks! It's due to a change I made this week, I will fix it.
When you initiate an instance of the QAPipeline object you should pass the arg Please also note, that it is not the only change in the default parameters, we didn't have a |
Thanks a lot! |
Hi, I am doing some fine tuning and I am getting some strange results...
I have my own dataset and I am evaluating your CPU model (bert_qa_vCPU-sklearn.joblib) for which I get an exact_match = 4.761904761904762 and f1 = 20.07726593082685.
I know that the result depends on many factors, e.g. the ground truth (file), the dataset, the way the QA pairs are created, the way the paragraphs in the csv files are organized, etc., but I still think this is pretty low...
By fine tuning the model to my dataset (250 QA training pairs for a user manual), I managed to improve it up to exact_match = 48.01587301587302 and f1 = 56.48361691903947.
My question is... do you have any idea why the bert_qa_vCPU-sklearn.joblib performs so badly? Also, do you have some transfer learning tips (it would be great if I could improve my fine tuned model even more)? Have you maybe released a BERT CPU version fine tuned on SQuAD v2.0 (I guess this one should show better performance than the one fine tuned on SQuADv1.1)?
In addition to this, I have noticed that after a certain number of epochs, the following message appears:
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
I couldn't really understand the reason for this, is something going wrong with the learning rate? Does it have something to do with the number of train examples / the number of epochs (some kind of miscalculation.. or maybe the learning rate is the real cause of the problem)?
Please share your opinion, I think a lot of the other people that have tried fine tuning have faced similar issues... Thanks! :)
The text was updated successfully, but these errors were encountered: