-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Predictions for certain paragraphs are inaccurate. What's wrong? #241
Comments
Hi @edanweis, Did you try to predict on the query using the |
Hi @andrelmfarias I'm using |
Thanks, As you are using the If the answer is good you should try to fine-tune the Retriever (changing hyperparameters as stop_words, min_df, max_df, etc...) |
@andrelmfarias Can you please provide an example of how to use BertQA for prediction? Do I use the predict method on the QAPipeline specifying an existing model of BertQA as the reader? |
@edanweis Indeed the steps for this are not obvious, I am really sorry about that. To test one query on one paragraph, please follow the steps below: import pandas as pd
from sklearn.externals import joblib
from cdqa.utils.converters import df2squad
from cdqa.utils.download import download_model
from cdqa.reader.bertqa_sklearn import BertProcessor, BertQA
download_model(model='bert-squad_1.1', dir='./models')
paragraph = "According to our website, the Melbourne Convention Centre & South Wharf Precinct project is located at 1 Convention Centre Pl, South Wharf VIC 3006, Australia. The Melbourne Convention Centre & South Wharf Precinct project has won three awards. The project started in 2014 and was completed in 2016."
query = "how many awards has the Melbourne Convention Centre project won?"
# Create dataframe and convert it to squad-like json
df = pd.DataFrame({'title': 'Melbourne Convention', 'paragraphs': [[paragraph]]})
json_data = df2squad(df=df, squad_version='v1.1')
# Add question to json
json_data['data'][0]['paragraphs'][0]['qas'].append({"id":0, "question":query})
# Preprocess json
processor = BertProcessor(do_lower_case=True, is_training=False)
examples, features = processor.fit_transform(X=json_data['data'])
# Load model and predict
qa_model = joblib.load("./models/bert_qa_vCPU-sklearn.joblib")
qa_model.predict(X=(examples, features)) I ran it and the answer I got was correct: "three" If you do not get the same answer with the cdqa pipeline, you have to improve (fine-tune) your retriever, as I explained above. |
Hi @andrelmfarias - I tried your sample code above to fine-tune the Reader on one of my domain paragraphs. I then ran it several times on different questions and got excellent results! I'm still a little unclear, questions:
I'd greatly appreciate the help. I'm trying to better understand the necessary steps. Thank you. |
Hi @JimAva, The code above is not a fine-tune for the Reader, it's just a test of Reader predictions on your question given a paragraph. If your Reader is performing well, you don't need to fine-tune it on your custom dataset (you can do it though, if you want to improve it even more). What you have to do is to tune the Retriever, for that, you have to try different hyperparameters that are specific for the Retriever when you create or cdQA/cdqa/retriever/tfidf_sklearn.py Line 8 in 300a8c2
You might need to annotate some questions to evaluate its performance... |
Thank you @andrelmfarias for the quick response. Do you have any sample code for the Retriever hyperparameters tuning? |
Thank you @andrelmfarias - please let me know when I can test this new feature. Appreciate the help. |
I do not have a snippet for that right now (although I will be surely writing one in the future because I will also need one), but I can give you a recipe: 1 - Annotate QA pairs using the cdqa-annotator 2 - Create instances of 3 - Run Line 115 in 300a8c2
4 - Try it for different values of hyperparameters and choose the ones that showed the best score performance If you come out with a nice piece of code for this script you can also open a PR 😃 ! (Maybe a script in the examples folder or a |
Sure! I will be announcing this new feature and others soon |
I try to run example in that code"class TfidfRetriever(BaseEstimator)", but get error "ModuleNotFoundError: No module named 'cdqa.retriever.tfidf_retriever_sklearn'" . Does anyone know why? Thanks! |
Hi @Toronto899, There is a typo on the example actually! We changed the name of the module and forgot to update the docstring. You should use Thanks for pointing that out. |
Hi, Andre |
Hi @Toronto899 , It sounds like you have the same challenge that I'm facing. The Reader gave great results but the Retriever gave bad results (way off). I played around with the Retriever hyperparameters on my data but the results did not improve. Andre provided this upcoming solution: |
Yes, I did try changing some of the parameters (below) but did not notice
much improvement.
lowercase=True,
preprocessor=None,
tokenizer=None,
stop_words='english',
token_pattern=r"(?u)\b\w\w+\b",
ngram_range=(1, 2),
max_df=0.85,
min_df=2,
vocabulary=None,
paragraphs=None,
top_n=3,
verbose=False):
|
I just added a new functionality, instead of retrieving by Articles and than spliting it in paragraphs, now you can Retrieve directly by paragraphs. You just have to instantiate a QAPipeline object passing the argument I also advise you to increase the arg This new functionality showed large improvements when testing on an open version of SQuAD |
Hi @andrelmfarias - thank you for this new feature. I tried the new code as per below object and did not get good results. I tried 20, 30 and 50 or the top_n and still not getting the correct paragraphs returned. Update - after further investigation, I'm getting mixed results. Need to test some more... cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU-sklearn.joblib', |
@JimAva , I ran some tests varying the several hyperparameters on an open version of SQuAD 1.1 and got the following results: 59% Exact Matching and 65% F1-score. As you see, although these are pretty good scores in comparison to other experiments (eg: https://arxiv.org/abs/1902.01718), there is a number of wrong answers. In order to deal with this, we incorporated recently an arg In my opinion, the results will also depend a lot on the structure of your own dataset.
If |
Hi @andrelmfarias , I've been trying out the newly introduced retrieve based on BM25 and it seems to be working great! Thank you for your efforts. Will provide more updates... |
Thanks @JimAva! I am currently working on a new PR that will also include the possibility to compute a final by doing a weighted average between the Retriever score and the Reader score. In my experiments with SQuAD-open I found out that we can get better performances by doing it, instead of just using the Reader score for ranking of the final answer. For SQuAD-dev the best weight was 0.35 for the Retriever score and 0.65 for the Reader score. I invite you to try to play with this parameter as well (it will be an arg of the method Also, you will have the possibility to get the |
I am closing this issue now since the last updates seem to solve it. |
In my case, I cannot blame the retriever. I suspect it is the way the QAPipeline is calling the reader. When calling the reader directly with a single paragraph and question, it works great. Debuged the retriever component while using the pipeline to retrieve and predict (I fit the retriever on a small dataset (~10 paragraphs in total) which contains the paragraph with the answer). The retriever is retrieving the most probable paragraph with the answer, successfully, as the top-most match. But then when QAPipeline calls the reader, I get very different logits for the paragraph compared to what I get when calling the reader directly. Also I noticed that the resulting logits varry from execution to execution, when running with the pipeline I will investigate further |
finally I found the problem: readme.md says should be |
@radsimu thanks for pointing that out |
I have been testing cdQA on paragraphs that have been generated from CSV. I convert the structured data into English, then predict answers using BERT.
I've described the approach here: https://datascience.stackexchange.com/questions/58186/transform-data-into-english-then-predict-an-answer-using-bert
I combine 2 or 3 sentences into paragraphs, then concatenate multiple paragraphs into one dataframe for cdQA pipeline, then query the dataset but results are often incorrect. An example of a sentence:
And query
Could this form of English writing be too dissimilar to the corpora and datasets on which BERT was pre-trained and fine-tuned? Can you suggest how I could improve results? Thanks.
The text was updated successfully, but these errors were encountered: