Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predictions for certain paragraphs are inaccurate. What's wrong? #241

Closed
edanweis opened this issue Aug 26, 2019 · 25 comments
Closed

Predictions for certain paragraphs are inaccurate. What's wrong? #241

edanweis opened this issue Aug 26, 2019 · 25 comments

Comments

@edanweis
Copy link

I have been testing cdQA on paragraphs that have been generated from CSV. I convert the structured data into English, then predict answers using BERT.

I've described the approach here: https://datascience.stackexchange.com/questions/58186/transform-data-into-english-then-predict-an-answer-using-bert

I combine 2 or 3 sentences into paragraphs, then concatenate multiple paragraphs into one dataframe for cdQA pipeline, then query the dataset but results are often incorrect. An example of a sentence:

According to our website, the Melbourne Convention Centre & South Wharf Precinct
project is located at 1 Convention Centre Pl, South Wharf VIC 3006, Australia. The
Melbourne Convention Centre & South Wharf Precinct project has won three awards. The project started in 2014 and was completed in 2016.

And query

how many awards has the Melbourne Convention Centre project won?

Could this form of English writing be too dissimilar to the corpora and datasets on which BERT was pre-trained and fine-tuned? Can you suggest how I could improve results? Thanks.

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Aug 27, 2019

Hi @edanweis,

Did you try to predict on the query using the QAPipeline or only the Reader QABert?

@edanweis
Copy link
Author

Hi @andrelmfarias I'm using QAPipeline

@andrelmfarias
Copy link
Collaborator

Thanks,

As you are using the QAPipeline it's pretty difficult to understand if the problem is the Retriever or the Reader... Can you please try to predict the answer to your question for that paragraph using BertQA (the Reader)?

If the answer is good you should try to fine-tune the Retriever (changing hyperparameters as stop_words, min_df, max_df, etc...)

@edanweis
Copy link
Author

edanweis commented Sep 3, 2019

@andrelmfarias Can you please provide an example of how to use BertQA for prediction? Do I use the predict method on the QAPipeline specifying an existing model of BertQA as the reader?

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Sep 3, 2019

@edanweis Indeed the steps for this are not obvious, I am really sorry about that. To test one query on one paragraph, please follow the steps below:

import pandas as pd
from sklearn.externals import joblib
from cdqa.utils.converters import df2squad
from cdqa.utils.download import download_model
from cdqa.reader.bertqa_sklearn import BertProcessor, BertQA

download_model(model='bert-squad_1.1', dir='./models')

paragraph = "According to our website, the Melbourne Convention Centre & South Wharf Precinct project is located at 1 Convention Centre Pl, South Wharf VIC 3006, Australia. The Melbourne Convention Centre & South Wharf Precinct project has won three awards. The project started in 2014 and was completed in 2016."
query = "how many awards has the Melbourne Convention Centre project won?"

# Create dataframe and convert it to squad-like json
df = pd.DataFrame({'title': 'Melbourne Convention', 'paragraphs': [[paragraph]]})
json_data = df2squad(df=df, squad_version='v1.1')

# Add question to json
json_data['data'][0]['paragraphs'][0]['qas'].append({"id":0, "question":query})

# Preprocess json
processor = BertProcessor(do_lower_case=True, is_training=False)
examples, features = processor.fit_transform(X=json_data['data'])

# Load model and predict
qa_model = joblib.load("./models/bert_qa_vCPU-sklearn.joblib")

qa_model.predict(X=(examples, features))

I ran it and the answer I got was correct: "three"

If you do not get the same answer with the cdqa pipeline, you have to improve (fine-tune) your retriever, as I explained above.

@JimAva
Copy link

JimAva commented Sep 4, 2019

Hi @andrelmfarias - I tried your sample code above to fine-tune the Reader on one of my domain paragraphs. I then ran it several times on different questions and got excellent results!

I'm still a little unclear, questions:

  1. Does this mean that in my case, I only need to fine-tune the Reader?
  2. If yes, my sample dataset has about 12K of paragraphs (small sample), do I need to use the cdQA-annotator to create a few questions for a few thousands of the paragraphs?

I'd greatly appreciate the help. I'm trying to better understand the necessary steps.

Thank you.

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Sep 4, 2019

Hi @JimAva,

The code above is not a fine-tune for the Reader, it's just a test of Reader predictions on your question given a paragraph.

If your Reader is performing well, you don't need to fine-tune it on your custom dataset (you can do it though, if you want to improve it even more).

What you have to do is to tune the Retriever, for that, you have to try different hyperparameters that are specific for the Retriever when you create or QAPipeline object. You can take a look at them at the TfidfRetrieverclass:

class TfidfRetriever(BaseEstimator):

You might need to annotate some questions to evaluate its performance...

@JimAva
Copy link

JimAva commented Sep 4, 2019

Thank you @andrelmfarias for the quick response. Do you have any sample code for the Retriever hyperparameters tuning?

@JimAva
Copy link

JimAva commented Sep 5, 2019

Thank you @andrelmfarias - please let me know when I can test this new feature. Appreciate the help.

@andrelmfarias
Copy link
Collaborator

Thank you @andrelmfarias for the quick response. Do you have any sample code for the Retriever hyperparameters tuning?

I do not have a snippet for that right now (although I will be surely writing one in the future because I will also need one), but I can give you a recipe:

1 - Annotate QA pairs using the cdqa-annotator

2 - Create instances of QAPipeline using different values for the Retriever Hyperparamets when you init the QAPipeline object.

3 - Run evaluate_pipeline (from here:

def evaluate_pipeline(cdqa_pipeline, annotated_json):
) on your annotated dataset

4 - Try it for different values of hyperparameters and choose the ones that showed the best score performance

If you come out with a nice piece of code for this script you can also open a PR 😃 ! (Maybe a script in the examples folder or a hyperparameter tunning section in the readme)

@andrelmfarias
Copy link
Collaborator

Thank you @andrelmfarias - please let me know when I can test this new feature. Appreciate the help.

Sure! I will be announcing this new feature and others soon

@Toronto899
Copy link

Hi @JimAva,

The code above is not a fine-tune for the Reader, it's just a test of Reader predictions on your question given a paragraph.

If your Reader is performing well, you don't need to fine-tune it on your custom dataset (you can do it though, if you want to improve it even more).

What you have to do is to tune the Retriever, for that, you have to try different hyperparameters that are specific for the Retriever when you create or QAPipeline object. You can take a look at them at the TfidfRetrieverclass:

class TfidfRetriever(BaseEstimator):

You might need to annotate some questions to evaluate its performance...

I try to run example in that code"class TfidfRetriever(BaseEstimator)", but get error "ModuleNotFoundError: No module named 'cdqa.retriever.tfidf_retriever_sklearn'" . Does anyone know why? Thanks!

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Sep 5, 2019

Hi @Toronto899, There is a typo on the example actually!

We changed the name of the module and forgot to update the docstring. You should use cdqa.retriever.tfidf_sklearn instead.

Thanks for pointing that out.

@Toronto899
Copy link

Toronto899 commented Sep 5, 2019

@edanweis Indeed the steps for this are not obvious, I am really sorry about that. To test one query on one paragraph, please follow the steps below:

import pandas as pd
from sklearn.externals import joblib
from cdqa.utils.converters import df2squad
from cdqa.utils.download import download_model
from cdqa.reader.bertqa_sklearn import BertProcessor, BertQA

download_model(model='bert-squad_1.1', dir='./models')

paragraph = "According to our website, the Melbourne Convention Centre & South Wharf Precinct project is located at 1 Convention Centre Pl, South Wharf VIC 3006, Australia. The Melbourne Convention Centre & South Wharf Precinct project has won three awards. The project started in 2014 and was completed in 2016."
query = "how many awards has the Melbourne Convention Centre project won?"

# Create dataframe and convert it to squad-like json
df = pd.DataFrame({'title': 'Melbourne Convention', 'paragraphs': [[paragraph]]})
json_data = df2squad(df=df, squad_version='v1.1')

# Add question to json
json_data['data'][0]['paragraphs'][0]['qas'].append({"id":0, "question":query})

# Preprocess json
processor = BertProcessor(do_lower_case=True, is_training=False)
examples, features = processor.fit_transform(X=json_data['data'])

# Load model and predict
qa_model = joblib.load("./models/bert_qa_vCPU-sklearn.joblib")

qa_model.predict(X=(examples, features))

I ran it and the answer I got was correct: "three"

If you do not get the same answer with the cdqa pipeline, you have to improve (fine-tune) your retriever, as I explained above.

Hi, Andre
I also try to use this one paragraph code to work on this one as below:
"The College shall make an annual contribution to the JESRF, to be made on or
before September 1 in each year, in an amount equal to $50.00 per full-time
member of the bargaining unit at the College, provided that where the amount of
the JESRF is equal to or exceeds an amount equal to $500.00 per full-time member
of the bargaining unit at the College, the obligation of the College to contribute to
the JESRF shall be suspended until the JESRF is again below that amount. In such
a case, the next annual contribution required by the College shall again be $50.00
per full-time member of the bargaining unit at the College or the amount required
to restore the JESRF to $500.00 per full-time member"

And query:
"how much The College shall make an annual contribution to the JESRF?"
The result('$50.00') look fine.
But when I load the whole data "https://opseu.org/sites/default/files/2017-2021_academic_collective_agreement_final_eng_signed_website.pdf", using the QAPipeline and ask the same query, why the result start to give me wrong answer("the College shall pay the cost for the medical examination and/or documentation"), where should I improve? Retriever or the Reader? Any detail steps? Thanks!

@JimAva
Copy link

JimAva commented Sep 5, 2019

Hi @Toronto899 ,

It sounds like you have the same challenge that I'm facing. The Reader gave great results but the Retriever gave bad results (way off). I played around with the Retriever hyperparameters on my data but the results did not improve.

Andre provided this upcoming solution:

Add other types of Retriever - BM25 #246

@JimAva
Copy link

JimAva commented Sep 8, 2019 via email

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Sep 13, 2019

@JimAva @edanweis ,

I just added a new functionality, instead of retrieving by Articles and than spliting it in paragraphs, now you can Retrieve directly by paragraphs. You just have to instantiate a QAPipeline object passing the argument retrieve_by_doc=False.

I also advise you to increase the arg top_n from 3 (the default) to a larger number (20 or 30 for example, you can fine-tune it).

This new functionality showed large improvements when testing on an open version of SQuAD

@JimAva
Copy link

JimAva commented Sep 13, 2019

Hi @andrelmfarias - thank you for this new feature. I tried the new code as per below object and did not get good results. I tried 20, 30 and 50 or the top_n and still not getting the correct paragraphs returned.

Update - after further investigation, I'm getting mixed results. Need to test some more...
Question - when I run the below object with 'verbose=True', is it possible to also show the paragraph text (or prediction[2]) value in the output?
| rank | index | title |

cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib',
max_df=1.0,
retrieve_by_doc=False,
top_n=20,
verbose=True
)

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Sep 16, 2019

@JimAva ,

I ran some tests varying the several hyperparameters on an open version of SQuAD 1.1 and got the following results: 59% Exact Matching and 65% F1-score. As you see, although these are pretty good scores in comparison to other experiments (eg: https://arxiv.org/abs/1902.01718), there is a number of wrong answers.

In order to deal with this, we incorporated recently an arg n_predictions in the method predict(). If you pass an integer to this arg (let's say, 3) the prediction method will send you the 3 most probable answers from the 3 most probable documents.

In my opinion, the results will also depend a lot on the structure of your own dataset.

Question - when I run the below object with 'verbose=True', is it possible to also show the paragraph text (or prediction[2]) value in the output?

If verbose=True you get some logs with information about the computation the model is doing. I don't understand why you need the paragraph text in the output... you already have it on prediction[2]

@JimAva
Copy link

JimAva commented Sep 19, 2019

Hi @andrelmfarias ,

I've been trying out the newly introduced retrieve based on BM25 and it seems to be working great! Thank you for your efforts. Will provide more updates...

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Sep 20, 2019

Thanks @JimAva!

I am currently working on a new PR that will also include the possibility to compute a final by doing a weighted average between the Retriever score and the Reader score. In my experiments with SQuAD-open I found out that we can get better performances by doing it, instead of just using the Reader score for ranking of the final answer. For SQuAD-dev the best weight was 0.35 for the Retriever score and 0.65 for the Reader score. I invite you to try to play with this parameter as well (it will be an arg of the method QAPipeline.predict()) on your dataset once the PR is merged.

Also, you will have the possibility to get the n best predictions ranked by the pipeline. It might be something interesting to explore.

@andrelmfarias
Copy link
Collaborator

I am closing this issue now since the last updates seem to solve it.

@radsimu
Copy link

radsimu commented Sep 24, 2019

In my case, I cannot blame the retriever. I suspect it is the way the QAPipeline is calling the reader. When calling the reader directly with a single paragraph and question, it works great.

Debuged the retriever component while using the pipeline to retrieve and predict (I fit the retriever on a small dataset (~10 paragraphs in total) which contains the paragraph with the answer). The retriever is retrieving the most probable paragraph with the answer, successfully, as the top-most match. But then when QAPipeline calls the reader, I get very different logits for the paragraph compared to what I get when calling the reader directly. Also I noticed that the resulting logits varry from execution to execution, when running with the pipeline

I will investigate further

@radsimu
Copy link

radsimu commented Sep 24, 2019

finally I found the problem: readme.md says
cdqa_pipeline = QAPipeline(model='bert_qa_vCPU-sklearn.joblib')

should be
cdqa_pipeline = QAPipeline(reader='bert_qa_vCPU-sklearn.joblib')

@andrelmfarias
Copy link
Collaborator

@radsimu thanks for pointing that out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants