Fine Tuning (Training beyond specified 't_total'...) #262

BojanKovachki · 2019-09-20T11:14:00Z

Hi, I am doing some fine tuning and I am getting some strange results...

I have my own dataset and I am evaluating your CPU model (bert_qa_vCPU-sklearn.joblib) for which I get an exact_match = 4.761904761904762 and f1 = 20.07726593082685.

I know that the result depends on many factors, e.g. the ground truth (file), the dataset, the way the QA pairs are created, the way the paragraphs in the csv files are organized, etc., but I still think this is pretty low...

By fine tuning the model to my dataset (250 QA training pairs for a user manual), I managed to improve it up to exact_match = 48.01587301587302 and f1 = 56.48361691903947.

My question is... do you have any idea why the bert_qa_vCPU-sklearn.joblib performs so badly? Also, do you have some transfer learning tips (it would be great if I could improve my fine tuned model even more)? Have you maybe released a BERT CPU version fine tuned on SQuAD v2.0 (I guess this one should show better performance than the one fine tuned on SQuADv1.1)?

In addition to this, I have noticed that after a certain number of epochs, the following message appears:
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.

I couldn't really understand the reason for this, is something going wrong with the learning rate? Does it have something to do with the number of train examples / the number of epochs (some kind of miscalculation.. or maybe the learning rate is the real cause of the problem)?

Please share your opinion, I think a lot of the other people that have tried fine tuning have faced similar issues... Thanks! :)

andrelmfarias · 2019-09-20T12:43:56Z

Hi @BojanKovachki

Thanks for your input and feedback! My answers to your questions:

I have my own dataset and I am evaluating your CPU model (bert_qa_vCPU-sklearn.joblib) for which I get an exact_match = 4.761904761904762 and f1 = 20.07726593082685.

I know that the result depends on many factors, e.g. the ground truth (file), the dataset, the way the QA pairs are created, the way the paragraphs in the csv files are organized, etc., but I still think this is pretty low...

By fine tuning the model to my dataset (250 QA training pairs for a user manual), I managed to improve it up to exact_match = 48.01587301587302 and f1 = 56.48361691903947.

My question is... do you have any idea why the bert_qa_vCPU-sklearn.joblib performs so badly?

Indeed, your results without fine-tuning are really poor, which makes me think that it could be related to your own vocabulary and corpus structure. For instance, in our experiments, we evaluated the raw model (i.e. with no fine-tune) on a Finance / News and we got 30% EM and 48% F1 for 500 questions. Please note that when annotating the question-answer pairs, if each question has only one ground-truth, there will also be a huge difference between EM and F1; this difference is shorter for models evaluated on SQuAD-dev because for each question there are 2 or 3 different ground-truths.

Moreover, it still difficult to understand if the problem with your dataset is related to the Reader or to the Retriever. I am currently working with a PR to facilitate the evaluation of the Reader, in order to help users understand better what is going wrong with the performance. But honestly, I think that the Retriever plays a really big role in it.

So, during the last weeks, I worked on some improvements for the Retriever:

Retrieve per paragraph instead of per document (Implemented option to Retrieve by paragraph #252) -> this showed to be very good improvement in my dataset (c. +11%). However, I think people should test both and should vary the argument top_n.
Added a new type of Retriever (BM25), which also seems to improve the performance (see Predictions for certain paragraphs are inaccurate. What's wrong? #241)
I just opened a PR including the option to rank the final answers following a weighted average of Retriever and Reader scores (Include Retriever and Reader score weighting and improve n_predictions argument #256). It showed an improvement of around 7.5% on the metrics on my dataset when using a weight factor of 0.35. I hope to merge it next week.

Please notice, that it's very difficult to have scores higher than 81% EM and 88% F1, since we are using BERT-base, which achieved such results on SQuAD-dev: https://huggingface.co/pytorch-transformers/examples.html#squad . Morever these results are obtained without the Retrieval phase, where the model has already the true paragraph to perform a prediction on. In real life, this is not possible, since we have to find a list of paragraphs that likely have the answer to the question and feed the model with them.

We also recently included the option to obtain more than one answers by passing a positive integer number to the arg n_predictions to the method .predict(). By receiving, let's say, the 3 most likely answers, we have a higher probability that one of them is the correct one. It can be useful to do like that for an app for information retrieval for example.

Also, do you have some transfer learning tips (it would be great if I could improve my fine tuned model even more)?

I am not sure if there are tips of transfer learning for the Reader that will improve hugely the performance of the Pipeline, as I think the bottleneck of the performance is the Retriever.

But as a rule of thumb, if your data set is small, I think you should always start from the model pre-trained on SQuAD (instead of fine-tuning directly the "raw" version of pre-trained BERT). Another idea would be to train it with other famous datasets such as TriviaQA, NewsQA, etc. I really would like to train models on these datasets, evaluate them and maybe release them here in the package, but I do not have time available for this now. If there is someone willing to help us with this, he/she is very much welcome 😃 .

Have you maybe released a BERT CPU version fine tuned on SQuAD v2.0 (I guess this one should show better performance than the one fine tuned on SQuADv1.1)?

I have reasons to think that fine-tuning on SQuAD 2.0 will probably show worse answers than using SQuAD 1.1. To understand why here is an example: Let's say I make a question and the Retriever selects 10 paragraphs to send to the Reader. Suppose the answer is present in one of the paragraphs and is surely not present in the other 9 paragraphs. If the Reader was trained on SQuAD 2.0, it will output "no answer" to these 9 paragraphs with a high probability. This might lead the model to chose "no answer" at the end, even if there is an answer (in the one paragraph).

In reality, I have to test to validate my assumption, but I do not have much time available for the moment to do it and prefer to focus on other areas I am surer will improve the system. If someone is willing to do it... 😃

In addition to this, I have noticed that after a certain number of epochs, the following message appears:
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.

I couldn't really understand the reason for this, is something going wrong with the learning rate? Does it have something to do with the number of train examples / the number of epochs (some kind of miscalculation.. or maybe the learning rate is the real cause of the problem)?

Back to when we trained our models (in March/Avril), we did not face this warning. Today I ran a small training script and, indeed, I got this warning. Our source-code is based on Hugging-Face run_squad.py example, and as their repo move really fast we are not able to sync with them at the same pace. So I thought it might be something related to changes in HF's repo.
I did a search on Google / HuggingFace issues and found this: huggingface/transformers#556
Which was fixed with this PR: huggingface/transformers#604

We will work to include these changes and fixes to cdqa as soon as possible.

BojanKovachki · 2019-09-25T12:45:45Z

Hey @andrelmfarias, you made some interesting points there, thanks! :)

Retrieve per paragraph instead of per document (#252) -> this showed to be very good improvement in my dataset (c. +11%). However, I think people should test both and should vary the argument top_n.
Added a new type of Retriever (BM25), which also seems to improve the performance (see #241)
I just opened a PR including the option to rank the final answers following a weighted average of Retriever and Reader scores (#256). It showed an improvement of around 7.5% on the metrics on my dataset when using a weight factor of 0.35. I hope to merge it next week.

It seems to me that these are all the default settings at the moment (perhaps they were not when we were discussing this)?!?

If not, how can I set the new retriever (BM25) as the one my pipeline should use?

When I increased the n_top from 3 to 20, I got an increase of 4.5% for the f1 score, so this seems to be a good idea (although in my case the improvement was not as impressive as in yours). :)

BTW Do you have some implemented method in the reader which enables to follow the loss as the model is training?

andrelmfarias · 2019-09-26T07:37:12Z

It seems to me that these are all the default settings at the moment (perhaps they were not when we were discussing this)?!?

Yes, they are all the default settings now. At the time of our discussion, some weren't yet.

BTW Do you have some implemented method in the reader which enables to follow the loss as the model is training?

No, there isn't, but it could be a nice feature to add... We will try to work on that

BojanKovachki · 2019-09-27T08:00:21Z

@andrelmfarias, I just noticed something...

I ran the evaluation script on my PC (where I have an older version of the repo/code), and it seems like the retriever used before BM25 works better...

I ran the same model on the same ground truth file on both my PC and in Colab (where BM25 is used) and I noticed that the old retriever offers better results...

Do you think the old retriever might work better than BM25 in some cases?!?

andrelmfarias · 2019-09-27T08:36:39Z

Do you think the old retriever might work better than BM25 in some cases?!?

Yes, I think it's really data-dependent. Also the default parameters (retrieve_by_doc=False, top_n=20, retriever_wheight=0.35 and others..) might not be the best on your use-case. They showed to be the best on the development set of SQuAD 1.1-open.

But in general, as we usually do in data-science, you should fine-tune these parameters to find the best in your use-case / data.

BojanKovachki · 2019-09-27T08:42:02Z

Yes, I think it's really data-dependent. Also the default parameters (retrieve_by_doc=False, top_n=20, retriever_wheight=0.35 and others..) might not be the best on your use-case. They showed to be the best on the development set of SQuAD 1.1-open.

Thanks!

But in general, as we usually do in data-science, you should fine-tune these parameters to find the best in your use-case / data.

You are right! :)

No, there isn't, but it could be a nice feature to add... We will try to work on that

So we can only evaluate the pipeline with cdQA and get the F1 score and the EM, but not follow the cost/loss/error?

I thought verbose_logging for the reader might print some additional info, but for some reason I am getting an error when I set it to True...

andrelmfarias · 2019-09-27T09:23:40Z

Can you show the error here please? So that we can fix it.

BojanKovachki · 2019-09-27T09:32:02Z

Yes, here it is:

UnboundLocalError Traceback (most recent call last)
in ()
----> 1 cdqa_pipeline.fit_reader('./drive/cdQA/data/train/train.json')

1 frames
/usr/local/lib/python3.6/dist-packages/cdqa/reader/bertqa_sklearn.py in fit(self, X, y)
1280 logger.info(" Num split examples = %d", len(train_features))
1281 logger.info(" Batch size = %d", self.train_batch_size)
-> 1282 logger.info(" Num steps = %d", num_train_optimization_steps)
1283 all_input_ids = torch.tensor(
1284 [f.input_ids for f in train_features], dtype=torch.long

UnboundLocalError: local variable 'num_train_optimization_steps' referenced before assignment

BTW Can you please show me an example how to use the old retriever in Colab instead of BM25? I am not sure how I can change that...

andrelmfarias · 2019-09-27T09:39:47Z

Thanks!

It's due to a change I made this week, I will fix it.

BTW Can you please show me an example how to use the old retriever in Colab instead of BM25? I am not sure how I can change that...

When you initiate an instance of the QAPipeline object you should pass the arg retriever="tfidf". You can refer to the docstring of the class for more information.

Please also note, that it is not the only change in the default parameters, we didn't have a retriever_weight arg before either. If you want to have the same results that before, you should set it to 0.

BojanKovachki · 2019-09-27T10:24:40Z

Thanks a lot!

This was referenced Sep 23, 2019

Zero division error #263

Closed

Include Retriever and Reader score weighting and improve n_predictions argument #256

Merged

andrelmfarias closed this as completed Oct 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine Tuning (Training beyond specified 't_total'...) #262

Fine Tuning (Training beyond specified 't_total'...) #262

BojanKovachki commented Sep 20, 2019

andrelmfarias commented Sep 20, 2019 •

edited

Loading

BojanKovachki commented Sep 25, 2019

andrelmfarias commented Sep 26, 2019

BojanKovachki commented Sep 27, 2019

andrelmfarias commented Sep 27, 2019

BojanKovachki commented Sep 27, 2019

andrelmfarias commented Sep 27, 2019

BojanKovachki commented Sep 27, 2019

andrelmfarias commented Sep 27, 2019

BojanKovachki commented Sep 27, 2019

Fine Tuning (Training beyond specified 't_total'...) #262

Fine Tuning (Training beyond specified 't_total'...) #262

Comments

BojanKovachki commented Sep 20, 2019

andrelmfarias commented Sep 20, 2019 • edited Loading

BojanKovachki commented Sep 25, 2019

andrelmfarias commented Sep 26, 2019

BojanKovachki commented Sep 27, 2019

andrelmfarias commented Sep 27, 2019

BojanKovachki commented Sep 27, 2019

andrelmfarias commented Sep 27, 2019

BojanKovachki commented Sep 27, 2019

andrelmfarias commented Sep 27, 2019

BojanKovachki commented Sep 27, 2019

andrelmfarias commented Sep 20, 2019 •

edited

Loading