I trained a bert-large SQUAD v1.1 model as a test, but it works much worse than the denspi_sparc model from the project #1

AlecS12 · 2020-10-23T19:04:52Z

Hi,

Thank you for the great project.
I followed the instructions to train the bert-large model on SQUAD, and encoded my dataset both with the model you provided (denspi_sparc) and my trained model. However, the results with my model look significantly worse, though not completely off. For example, a single-word search returns 55 exact word matches in search results with denspi-sparc, but only 27 with my trained model, and the best results are not found. I had to make one substantive code change: replaced run_natkb.py in

sparc/local_dump.py

Line 14 in bee309b

return ["python", "run_natkb.py",

( return ["python", "run_natkb.py" ) with train.py, as run_natkb.py is not provided.
The results of my model training are:

│08/19/2020 15:57:05 - INFO - post - num vecs=45059736, num_words=1783576, nvpw=25.2637
08/19/2020 15:57:08 - INFO - main - [Validation] loss: 10.361, b'{"exact_match": 76.27246925260171, "f1": 84.389824154709}\n'

which are similar to your posted results:

04/28/2020 06:32:59 - INFO - post - num vecs=45059736, num_words=1783576, nvpw=25.2637
04/28/2020 06:33:01 - INFO - main - [Validation] loss: 8.700, b'{"exact_match": 75.10879848628193, "f1": 83.42143097917004}\n'

The sparse weights for input_examples.txt also look similar to yours.

Any help would be appreciated.

jhyuklee · 2020-10-24T04:12:58Z

Hi @AlecS12, it seems that you have reproduced DenSPI+Sparc with BERT-large well, but what's the problem? Thelocal_dump.py file is used for open-domain experiments which I haven't open-sourced here (for openQA, see https://github.com/uwnlp/denspi), so it shouldn't be a problem in the single-passage training setting. I'm not sure what's the 55 EM score you mentioned is coming from. Thanks.

AlecS12 · 2020-10-30T18:24:50Z

Hi @jhyuklee, sorry for the confusion - the 55 EM score that I mentioned was just my ad-hock benchmark (not really the EM score) to compare model performances in the open domain search, it's irrelevant. The fact is that the model I trained performs worse than the denspi-sparc model in the open domain search. I suspect that the problem is with the phrase classifier model.
I encoded the file 0001 (a second half of the dev-v1.1.json that you provided in https://github.com/uwnlp/denspi) to benchmark the open domain search on the QA pairs from it. Here is what I found:

The phrase filter of denspi-sparc model filters out many more phrases than mine trained model.
Phrase hdf file sizes: 102239432 0-1.hdf5 - denspi
287615990 0-1.hdf5 - my model
FAISS indexing denspi_sparc model:
clustering 1553 points in 481D to 16 clusters, redo 1 times, 10 iterations
index ntotal: 54792
For the same data encoded with mine model:
Sampling a subset of 4096 / 4157 for training
index ntotal: 150658
(these numbers are indirect indications of the number oh phrases encoded, do not know where to look for the exact numbers)
The dense search results for my model look like results for denspi_sparc, interspersed with more paragraphs absent in the denspi_search (and the search scores are about 70% of the denspi_sparc scores, which is probably an artifact of the larger index size)

2 The open-domain evaluation similar to the one in covidAsk on the encoded SPARC dev set (with the exception that only the first ground truth answers are used in evaluation, i used https://github.com/dmis-lab/covidAsk/blob/master/eval_utils.py) confirms worse performance of my model:
Evaluating 245 answers.
denspi_sparc:
{'exact_match_top1': 53.87755102040816, 'f1_score_top1': 65.538931589952}
{'exact_match_top10': 73.87755102040816, 'f1_score_top10': 83.85336373643698}
mine trained model:
{'exact_match_top1': 47.3469387755102, 'f1_score_top1': 58.612080436570274}
{'exact_match_top10': 71.42857142857143, 'f1_score_top10': 82.48391481800847}

The close domain evaluations on the SQUAD dev set look identical:
denspi_sparc:
10/28/2020 23:34:18 - INFO - post - num vecs=45059736, num_words=1783576, nvpw=25.2637
10/28/2020 23:34:20 - INFO - main - [Validation] loss: 20.292, b'{"exact_match": 75.72374645222327, "f1": 84.43944614764544}\n'
mine final model:
10/28/2020 23:26:00 - INFO - post - num vecs=45059736, num_words=1783576, nvpw=25.2637
10/28/2020 23:26:03 - INFO - main - [Validation] loss: 14.765, b'{"exact_match": 75.03311258278146, "f1": 83.98681865355799}\n'

What would you recommend to do next?

Thank you,
Alec Segal

jhyuklee · 2020-11-02T02:30:48Z

Hi @AlecS12, if the number of phrases after filtering changes a lot, you may need to tune the filter_threshold. The lower value of it indicates more generous filtering. If the close-domain evaluation looks similar, the (semi) open-domain setup should give you similar results. And also please check if you have done the train_neg procedure which helps the model to normalize the phrase representations over multiple phrases.

AlecS12 · 2020-11-06T03:18:46Z

Hi @jhyuklee, probably the value of filter_threshold is not the root cause. I had to set it to 0.3 to end up with the phrase dump file (encoding the second half of the squad dev set split, the file 0001, with my trained model) that has roughly the size of the same file encoded with denspi_sparc model (99000K). This resulted in search getting worse, with many relevant paragraphs being lost. If the issue is with train_neg procedure (I've done it, following the instructions), how could it be troubleshooted?

The salient difference between two models is search result scores; my model is generally 30-50%
lower for the same matches. Example:
Query: What do we call the empire that Genghis Khan founded?
Both models return the same first answer:

He came to power by uniting many of the nomadic tribes of Northeast Asia. After founding the Mongol Empire and being proclaimed "Genghis Khan", he started the Mongol invasions that
...
denspi score is 138.0, mine - 86.9
The next 2 paragraphs for denspi (scores 136.6. and 135.7):
The Mongol army under Genghis Khan, generals and his sons crossed the Tien Shan mountains by entering the area controlled by the Khwarezmian Empire. After compiling intelligence from many
...
There are conflicting views of Genghis Khan in the People's Republic of China with some viewing him positively in the Inner Mongolia region where there are a monument and buildings about him
...
My trained model returns the following 2 results with the scores 84.5 and 78.3:
He came to power by uniting many of the nomadic tribes of Northeast Asia. After founding the Mongol Empire and being proclaimed "Genghis Khan", he started the Mongol invasions that... (the repeat of the first paragraph with the same highlighted span - duplicates also happen with the denspi_sparc model as well. Why sparc search returns identical answers with different scores?)
...
There are conflicting views of Genghis Khan in the People's Republic of China with some viewing him positively in the Inner Mongolia region where there are a monument and buildings about him
....

I wonder if this gives any hint of the underlying problem.

Here is the log of my negative training:
[Epoch 1] Train loss: 2.667: 100%|██████████| 7371/7371 [3:04:09<00:00, 1.45s/it]
10/23/2020 03:30:17 - INFO - main - [Epoch 1] Average train neg loss: 5.058
...
[Epoch 2] Train loss: 23.471: 100%|██████████| 7371/7371 [3:05:28<00:00, 1.41s/it]
10/23/2020 06:43:11 - INFO - main - [Epoch 2] Average train neg loss: 2.584
...
[Epoch 3] Train loss: 0.144: 100%|██████████| 7371/7371 [3:03:00<00:00, 1.49s/it]
10/23/2020 09:53:37 - INFO - main - [Epoch 3] Average train neg loss: 1.050
..
10/23/2020 10:00:55 - INFO - post - num vecs=45059736, num_words=1783576, nvpw=25.2637
10/23/2020 10:00:58 - INFO - main - [Validation] loss: 14.765, b'{"exact_match": 75.03311258278146, "f1": 83.98681865355799}\n'
10/23/2020 10:01:05 - INFO - main - Model saved at /mnt/home/sparc/models/trained_blu_squad_v1.1_neg/3/model.pt

AlecS12 changed the title ~~I trained a bert-large SQUAD v1.1 model, but it works much worse than the denspi-sparc model you rpovided.~~ I trained a bert-large SQUAD v1.1 model as a test, but it works much worse than the denspi_sparc model from the project Oct 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I trained a bert-large SQUAD v1.1 model as a test, but it works much worse than the denspi_sparc model from the project #1

I trained a bert-large SQUAD v1.1 model as a test, but it works much worse than the denspi_sparc model from the project #1

AlecS12 commented Oct 23, 2020 •

edited

Loading

jhyuklee commented Oct 24, 2020

AlecS12 commented Oct 30, 2020 •

edited

Loading

jhyuklee commented Nov 2, 2020

AlecS12 commented Nov 6, 2020 •

edited

Loading

I trained a bert-large SQUAD v1.1 model as a test, but it works much worse than the denspi_sparc model from the project #1

I trained a bert-large SQUAD v1.1 model as a test, but it works much worse than the denspi_sparc model from the project #1

Comments

AlecS12 commented Oct 23, 2020 • edited Loading

jhyuklee commented Oct 24, 2020

AlecS12 commented Oct 30, 2020 • edited Loading

jhyuklee commented Nov 2, 2020

AlecS12 commented Nov 6, 2020 • edited Loading

AlecS12 commented Oct 23, 2020 •

edited

Loading

AlecS12 commented Oct 30, 2020 •

edited

Loading

AlecS12 commented Nov 6, 2020 •

edited

Loading