-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I trained a bert-large SQUAD v1.1 model as a test, but it works much worse than the denspi_sparc model from the project #1
Comments
Hi @AlecS12, it seems that you have reproduced DenSPI+Sparc with BERT-large well, but what's the problem? The |
Hi @jhyuklee, sorry for the confusion - the 55 EM score that I mentioned was just my ad-hock benchmark (not really the EM score) to compare model performances in the open domain search, it's irrelevant. The fact is that the model I trained performs worse than the denspi-sparc model in the open domain search. I suspect that the problem is with the phrase classifier model.
2 The open-domain evaluation similar to the one in covidAsk on the encoded SPARC dev set (with the exception that only the first ground truth answers are used in evaluation, i used https://github.com/dmis-lab/covidAsk/blob/master/eval_utils.py) confirms worse performance of my model:
What would you recommend to do next? Thank you, |
Hi @AlecS12, if the number of phrases after filtering changes a lot, you may need to tune the |
Hi @jhyuklee, probably the value of filter_threshold is not the root cause. I had to set it to 0.3 to end up with the phrase dump file (encoding the second half of the squad dev set split, the file 0001, with my trained model) that has roughly the size of the same file encoded with denspi_sparc model (99000K). This resulted in search getting worse, with many relevant paragraphs being lost. If the issue is with train_neg procedure (I've done it, following the instructions), how could it be troubleshooted? The salient difference between two models is search result scores; my model is generally 30-50% He came to power by uniting many of the nomadic tribes of Northeast Asia. After founding the Mongol Empire and being proclaimed "Genghis Khan", he started the Mongol invasions that I wonder if this gives any hint of the underlying problem. Here is the log of my negative training: |
Hi,
Thank you for the great project.
I followed the instructions to train the bert-large model on SQUAD, and encoded my dataset both with the model you provided (denspi_sparc) and my trained model. However, the results with my model look significantly worse, though not completely off. For example, a single-word search returns 55 exact word matches in search results with denspi-sparc, but only 27 with my trained model, and the best results are not found. I had to make one substantive code change: replaced run_natkb.py in
sparc/local_dump.py
Line 14 in bee309b
The results of my model training are:
│08/19/2020 15:57:05 - INFO - post - num vecs=45059736, num_words=1783576, nvpw=25.2637
08/19/2020 15:57:08 - INFO - main - [Validation] loss: 10.361, b'{"exact_match": 76.27246925260171, "f1": 84.389824154709}\n'
which are similar to your posted results:
04/28/2020 06:32:59 - INFO - post - num vecs=45059736, num_words=1783576, nvpw=25.2637
04/28/2020 06:33:01 - INFO - main - [Validation] loss: 8.700, b'{"exact_match": 75.10879848628193, "f1": 83.42143097917004}\n'
The sparse weights for input_examples.txt also look similar to yours.
Any help would be appreciated.
The text was updated successfully, but these errors were encountered: