Predictions for certain paragraphs are inaccurate. What's wrong? #241

edanweis · 2019-08-26T04:42:29Z

I have been testing cdQA on paragraphs that have been generated from CSV. I convert the structured data into English, then predict answers using BERT.

I've described the approach here: https://datascience.stackexchange.com/questions/58186/transform-data-into-english-then-predict-an-answer-using-bert

I combine 2 or 3 sentences into paragraphs, then concatenate multiple paragraphs into one dataframe for cdQA pipeline, then query the dataset but results are often incorrect. An example of a sentence:

According to our website, the Melbourne Convention Centre & South Wharf Precinct
project is located at 1 Convention Centre Pl, South Wharf VIC 3006, Australia. The
Melbourne Convention Centre & South Wharf Precinct project has won three awards. The project started in 2014 and was completed in 2016.

And query

how many awards has the Melbourne Convention Centre project won?

Could this form of English writing be too dissimilar to the corpora and datasets on which BERT was pre-trained and fine-tuned? Can you suggest how I could improve results? Thanks.

The text was updated successfully, but these errors were encountered:

andrelmfarias · 2019-08-27T07:18:02Z

Hi @edanweis,

Did you try to predict on the query using the QAPipeline or only the Reader QABert?

edanweis · 2019-08-28T00:21:12Z

Hi @andrelmfarias I'm using QAPipeline

andrelmfarias · 2019-08-28T06:38:38Z

Thanks,

As you are using the QAPipeline it's pretty difficult to understand if the problem is the Retriever or the Reader... Can you please try to predict the answer to your question for that paragraph using BertQA (the Reader)?

If the answer is good you should try to fine-tune the Retriever (changing hyperparameters as stop_words, min_df, max_df, etc...)

edanweis · 2019-09-03T02:04:19Z

@andrelmfarias Can you please provide an example of how to use BertQA for prediction? Do I use the predict method on the QAPipeline specifying an existing model of BertQA as the reader?

andrelmfarias · 2019-09-03T16:11:30Z

@edanweis Indeed the steps for this are not obvious, I am really sorry about that. To test one query on one paragraph, please follow the steps below:

import pandas as pd
from sklearn.externals import joblib
from cdqa.utils.converters import df2squad
from cdqa.utils.download import download_model
from cdqa.reader.bertqa_sklearn import BertProcessor, BertQA

download_model(model='bert-squad_1.1', dir='./models')

paragraph = "According to our website, the Melbourne Convention Centre & South Wharf Precinct project is located at 1 Convention Centre Pl, South Wharf VIC 3006, Australia. The Melbourne Convention Centre & South Wharf Precinct project has won three awards. The project started in 2014 and was completed in 2016."
query = "how many awards has the Melbourne Convention Centre project won?"

# Create dataframe and convert it to squad-like json
df = pd.DataFrame({'title': 'Melbourne Convention', 'paragraphs': [[paragraph]]})
json_data = df2squad(df=df, squad_version='v1.1')

# Add question to json
json_data['data'][0]['paragraphs'][0]['qas'].append({"id":0, "question":query})

# Preprocess json
processor = BertProcessor(do_lower_case=True, is_training=False)
examples, features = processor.fit_transform(X=json_data['data'])

# Load model and predict
qa_model = joblib.load("./models/bert_qa_vCPU-sklearn.joblib")

qa_model.predict(X=(examples, features))

I ran it and the answer I got was correct: "three"

If you do not get the same answer with the cdqa pipeline, you have to improve (fine-tune) your retriever, as I explained above.

JimAva · 2019-09-04T02:14:42Z

Hi @andrelmfarias - I tried your sample code above to fine-tune the Reader on one of my domain paragraphs. I then ran it several times on different questions and got excellent results!

I'm still a little unclear, questions:

Does this mean that in my case, I only need to fine-tune the Reader?
If yes, my sample dataset has about 12K of paragraphs (small sample), do I need to use the cdQA-annotator to create a few questions for a few thousands of the paragraphs?

I'd greatly appreciate the help. I'm trying to better understand the necessary steps.

Thank you.

andrelmfarias · 2019-09-04T09:44:45Z

Hi @JimAva,

The code above is not a fine-tune for the Reader, it's just a test of Reader predictions on your question given a paragraph.

If your Reader is performing well, you don't need to fine-tune it on your custom dataset (you can do it though, if you want to improve it even more).

What you have to do is to tune the Retriever, for that, you have to try different hyperparameters that are specific for the Retriever when you create or QAPipeline object. You can take a look at them at the TfidfRetrieverclass:

cdQA/cdqa/retriever/tfidf_sklearn.py

Line 8 in 300a8c2

class TfidfRetriever(BaseEstimator):

You might need to annotate some questions to evaluate its performance...

JimAva · 2019-09-04T18:18:57Z

Thank you @andrelmfarias for the quick response. Do you have any sample code for the Retriever hyperparameters tuning?

JimAva · 2019-09-05T12:32:20Z

Thank you @andrelmfarias - please let me know when I can test this new feature. Appreciate the help.

andrelmfarias · 2019-09-05T12:51:34Z

Thank you @andrelmfarias for the quick response. Do you have any sample code for the Retriever hyperparameters tuning?

I do not have a snippet for that right now (although I will be surely writing one in the future because I will also need one), but I can give you a recipe:

1 - Annotate QA pairs using the cdqa-annotator

2 - Create instances of QAPipeline using different values for the Retriever Hyperparamets when you init the QAPipeline object.

3 - Run evaluate_pipeline (from here:

cdQA/cdqa/utils/evaluation.py

Line 115 in 300a8c2

def evaluate_pipeline(cdqa_pipeline, annotated_json):

) on your annotated dataset

4 - Try it for different values of hyperparameters and choose the ones that showed the best score performance

If you come out with a nice piece of code for this script you can also open a PR 😃 ! (Maybe a script in the examples folder or a hyperparameter tunning section in the readme)

andrelmfarias · 2019-09-05T13:00:03Z

Thank you @andrelmfarias - please let me know when I can test this new feature. Appreciate the help.

Sure! I will be announcing this new feature and others soon

Toronto899 · 2019-09-05T14:20:28Z

Hi @JimAva,

The code above is not a fine-tune for the Reader, it's just a test of Reader predictions on your question given a paragraph.

If your Reader is performing well, you don't need to fine-tune it on your custom dataset (you can do it though, if you want to improve it even more).

What you have to do is to tune the Retriever, for that, you have to try different hyperparameters that are specific for the Retriever when you create or QAPipeline object. You can take a look at them at the TfidfRetrieverclass:

cdQA/cdqa/retriever/tfidf_sklearn.py

Line 8 in 300a8c2

class TfidfRetriever(BaseEstimator):

You might need to annotate some questions to evaluate its performance...

I try to run example in that code"class TfidfRetriever(BaseEstimator)", but get error "ModuleNotFoundError: No module named 'cdqa.retriever.tfidf_retriever_sklearn'" . Does anyone know why? Thanks!

andrelmfarias · 2019-09-05T14:36:52Z

Hi @Toronto899, There is a typo on the example actually!

We changed the name of the module and forgot to update the docstring. You should use cdqa.retriever.tfidf_sklearn instead.

Thanks for pointing that out.

Toronto899 · 2019-09-05T19:53:01Z

@edanweis Indeed the steps for this are not obvious, I am really sorry about that. To test one query on one paragraph, please follow the steps below:

import pandas as pd
from sklearn.externals import joblib
from cdqa.utils.converters import df2squad
from cdqa.utils.download import download_model
from cdqa.reader.bertqa_sklearn import BertProcessor, BertQA

download_model(model='bert-squad_1.1', dir='./models')

paragraph = "According to our website, the Melbourne Convention Centre & South Wharf Precinct project is located at 1 Convention Centre Pl, South Wharf VIC 3006, Australia. The Melbourne Convention Centre & South Wharf Precinct project has won three awards. The project started in 2014 and was completed in 2016."
query = "how many awards has the Melbourne Convention Centre project won?"

# Create dataframe and convert it to squad-like json
df = pd.DataFrame({'title': 'Melbourne Convention', 'paragraphs': [[paragraph]]})
json_data = df2squad(df=df, squad_version='v1.1')

# Add question to json
json_data['data'][0]['paragraphs'][0]['qas'].append({"id":0, "question":query})

# Preprocess json
processor = BertProcessor(do_lower_case=True, is_training=False)
examples, features = processor.fit_transform(X=json_data['data'])

# Load model and predict
qa_model = joblib.load("./models/bert_qa_vCPU-sklearn.joblib")

qa_model.predict(X=(examples, features))

I ran it and the answer I got was correct: "three"

If you do not get the same answer with the cdqa pipeline, you have to improve (fine-tune) your retriever, as I explained above.

Hi, Andre
I also try to use this one paragraph code to work on this one as below:
"The College shall make an annual contribution to the JESRF, to be made on or
before September 1 in each year, in an amount equal to $50.00 per full-time
member of the bargaining unit at the College, provided that where the amount of
the JESRF is equal to or exceeds an amount equal to $500.00 per full-time member
of the bargaining unit at the College, the obligation of the College to contribute to
the JESRF shall be suspended until the JESRF is again below that amount. In such
a case, the next annual contribution required by the College shall again be $50.00
per full-time member of the bargaining unit at the College or the amount required
to restore the JESRF to $500.00 per full-time member"
And query:
"how much The College shall make an annual contribution to the JESRF?"
The result('$50.00') look fine.
But when I load the whole data "https://opseu.org/sites/default/files/2017-2021_academic_collective_agreement_final_eng_signed_website.pdf", using the QAPipeline and ask the same query, why the result start to give me wrong answer("the College shall pay the cost for the medical examination and/or documentation"), where should I improve? Retriever or the Reader? Any detail steps? Thanks!

JimAva · 2019-09-05T21:46:01Z

Hi @Toronto899 ,

It sounds like you have the same challenge that I'm facing. The Reader gave great results but the Retriever gave bad results (way off). I played around with the Retriever hyperparameters on my data but the results did not improve.

Andre provided this upcoming solution:

Add other types of Retriever - BM25 #246

JimAva · 2019-09-08T14:08:30Z

Yes, I did try changing some of the parameters (below) but did not notice much improvement. lowercase=True, preprocessor=None, tokenizer=None, stop_words='english', token_pattern=r"(?u)\b\w\w+\b", ngram_range=(1, 2), max_df=0.85, min_df=2, vocabulary=None, paragraphs=None, top_n=3, verbose=False):

andrelmfarias · 2019-09-13T08:09:24Z

@JimAva @edanweis ,

I just added a new functionality, instead of retrieving by Articles and than spliting it in paragraphs, now you can Retrieve directly by paragraphs. You just have to instantiate a QAPipeline object passing the argument retrieve_by_doc=False.

I also advise you to increase the arg top_n from 3 (the default) to a larger number (20 or 30 for example, you can fine-tune it).

This new functionality showed large improvements when testing on an open version of SQuAD

JimAva · 2019-09-13T16:46:55Z

Hi @andrelmfarias - thank you for this new feature. I tried the new code as per below object and did not get good results. I tried 20, 30 and 50 or the top_n and still not getting the correct paragraphs returned.

Update - after further investigation, I'm getting mixed results. Need to test some more...
Question - when I run the below object with 'verbose=True', is it possible to also show the paragraph text (or prediction[2]) value in the output?
| rank | index | title |

cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib',
max_df=1.0,
retrieve_by_doc=False,
top_n=20,
verbose=True
)

andrelmfarias · 2019-09-16T07:41:09Z

@JimAva ,

I ran some tests varying the several hyperparameters on an open version of SQuAD 1.1 and got the following results: 59% Exact Matching and 65% F1-score. As you see, although these are pretty good scores in comparison to other experiments (eg: https://arxiv.org/abs/1902.01718), there is a number of wrong answers.

In order to deal with this, we incorporated recently an arg n_predictions in the method predict(). If you pass an integer to this arg (let's say, 3) the prediction method will send you the 3 most probable answers from the 3 most probable documents.

In my opinion, the results will also depend a lot on the structure of your own dataset.

Question - when I run the below object with 'verbose=True', is it possible to also show the paragraph text (or prediction[2]) value in the output?

If verbose=True you get some logs with information about the computation the model is doing. I don't understand why you need the paragraph text in the output... you already have it on prediction[2]

JimAva · 2019-09-19T17:00:40Z

Hi @andrelmfarias ,

I've been trying out the newly introduced retrieve based on BM25 and it seems to be working great! Thank you for your efforts. Will provide more updates...

andrelmfarias · 2019-09-20T07:30:22Z

Thanks @JimAva!

I am currently working on a new PR that will also include the possibility to compute a final by doing a weighted average between the Retriever score and the Reader score. In my experiments with SQuAD-open I found out that we can get better performances by doing it, instead of just using the Reader score for ranking of the final answer. For SQuAD-dev the best weight was 0.35 for the Retriever score and 0.65 for the Reader score. I invite you to try to play with this parameter as well (it will be an arg of the method QAPipeline.predict()) on your dataset once the PR is merged.

Also, you will have the possibility to get the n best predictions ranked by the pipeline. It might be something interesting to explore.

andrelmfarias · 2019-09-20T07:30:26Z

I am closing this issue now since the last updates seem to solve it.

radsimu · 2019-09-24T17:57:37Z

In my case, I cannot blame the retriever. I suspect it is the way the QAPipeline is calling the reader. When calling the reader directly with a single paragraph and question, it works great.

Debuged the retriever component while using the pipeline to retrieve and predict (I fit the retriever on a small dataset (~10 paragraphs in total) which contains the paragraph with the answer). The retriever is retrieving the most probable paragraph with the answer, successfully, as the top-most match. But then when QAPipeline calls the reader, I get very different logits for the paragraph compared to what I get when calling the reader directly. Also I noticed that the resulting logits varry from execution to execution, when running with the pipeline

I will investigate further

radsimu · 2019-09-24T20:38:02Z

finally I found the problem: readme.md says
cdqa_pipeline = QAPipeline(model='bert_qa_vCPU-sklearn.joblib')

should be
cdqa_pipeline = QAPipeline(reader='bert_qa_vCPU-sklearn.joblib')

andrelmfarias · 2019-09-25T07:10:22Z

@radsimu thanks for pointing that out

adilmukhtar82 mentioned this issue Sep 4, 2019

Poor results on financial dataset #245

Closed

andrelmfarias mentioned this issue Sep 5, 2019

Add other types of Retriever - BM25 #246

Merged

andrelmfarias mentioned this issue Sep 19, 2019

Include Retriever and Reader score weighting and improve n_predictions argument #256

Merged

andrelmfarias closed this as completed Sep 20, 2019

andrelmfarias mentioned this issue Sep 20, 2019

Fine Tuning (Training beyond specified 't_total'...) #262

Closed

cppntn mentioned this issue Nov 20, 2019

AttributeError: 'BertConfig' object has no attribute 'is_decoder' #310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predictions for certain paragraphs are inaccurate. What's wrong? #241

Predictions for certain paragraphs are inaccurate. What's wrong? #241

edanweis commented Aug 26, 2019

andrelmfarias commented Aug 27, 2019 •

edited

Loading

edanweis commented Aug 28, 2019

andrelmfarias commented Aug 28, 2019

edanweis commented Sep 3, 2019

andrelmfarias commented Sep 3, 2019 •

edited by fmikaelian

Loading

JimAva commented Sep 4, 2019

andrelmfarias commented Sep 4, 2019 •

edited

Loading

JimAva commented Sep 4, 2019

JimAva commented Sep 5, 2019

andrelmfarias commented Sep 5, 2019

andrelmfarias commented Sep 5, 2019

Toronto899 commented Sep 5, 2019

andrelmfarias commented Sep 5, 2019 •

edited

Loading

Toronto899 commented Sep 5, 2019 •

edited

Loading

JimAva commented Sep 5, 2019

JimAva commented Sep 8, 2019 via email •

edited

Loading

andrelmfarias commented Sep 13, 2019 •

edited

Loading

JimAva commented Sep 13, 2019 •

edited

Loading

andrelmfarias commented Sep 16, 2019 •

edited

Loading

JimAva commented Sep 19, 2019

andrelmfarias commented Sep 20, 2019 •

edited

Loading

andrelmfarias commented Sep 20, 2019

radsimu commented Sep 24, 2019 •

edited

Loading

radsimu commented Sep 24, 2019 •

edited

Loading

andrelmfarias commented Sep 25, 2019

Predictions for certain paragraphs are inaccurate. What's wrong? #241

Predictions for certain paragraphs are inaccurate. What's wrong? #241

Comments

edanweis commented Aug 26, 2019

andrelmfarias commented Aug 27, 2019 • edited Loading

edanweis commented Aug 28, 2019

andrelmfarias commented Aug 28, 2019

edanweis commented Sep 3, 2019

andrelmfarias commented Sep 3, 2019 • edited by fmikaelian Loading

JimAva commented Sep 4, 2019

andrelmfarias commented Sep 4, 2019 • edited Loading

JimAva commented Sep 4, 2019

JimAva commented Sep 5, 2019

andrelmfarias commented Sep 5, 2019

andrelmfarias commented Sep 5, 2019

Toronto899 commented Sep 5, 2019

andrelmfarias commented Sep 5, 2019 • edited Loading

Toronto899 commented Sep 5, 2019 • edited Loading

JimAva commented Sep 5, 2019

JimAva commented Sep 8, 2019 via email • edited Loading

andrelmfarias commented Sep 13, 2019 • edited Loading

JimAva commented Sep 13, 2019 • edited Loading

andrelmfarias commented Sep 16, 2019 • edited Loading

JimAva commented Sep 19, 2019

andrelmfarias commented Sep 20, 2019 • edited Loading

andrelmfarias commented Sep 20, 2019

radsimu commented Sep 24, 2019 • edited Loading

radsimu commented Sep 24, 2019 • edited Loading

andrelmfarias commented Sep 25, 2019

andrelmfarias commented Aug 27, 2019 •

edited

Loading

andrelmfarias commented Sep 3, 2019 •

edited by fmikaelian

Loading

andrelmfarias commented Sep 4, 2019 •

edited

Loading

andrelmfarias commented Sep 5, 2019 •

edited

Loading

Toronto899 commented Sep 5, 2019 •

edited

Loading

JimAva commented Sep 8, 2019 via email •

edited

Loading

andrelmfarias commented Sep 13, 2019 •

edited

Loading

JimAva commented Sep 13, 2019 •

edited

Loading

andrelmfarias commented Sep 16, 2019 •

edited

Loading

andrelmfarias commented Sep 20, 2019 •

edited

Loading

radsimu commented Sep 24, 2019 •

edited

Loading

radsimu commented Sep 24, 2019 •

edited

Loading