TREC-COVID doc2query Baselines

Important reproducibility notes: Anserini was upgraded to Lucene 9.3 at commit 272565 (8/2/2022). This upgrade created backward compatibility issues (see #1952), which means that the runs described on this page cannot be exactly reproduced with Lucene 9 code running on Lucene 8 indexes (since we need to disable consistent tie-breaking).

Following the Lucene upgrade, this page is no longer being maintained. For reproducibility purposes, however, runs with Lucene 8 (at v0.14.4) and Lucene 9 (at 5480dc) are captured and stored here. There are only minor differences in effectiveness between the two sets of runs.

In September 2023, the regression code was refactored such that the following commands run successfully (commits 88935f and 444eac):

python src/main/python/trec-covid/download_doc2query_indexes.py --date 2020-07-16 &
python src/main/python/trec-covid/download_doc2query_indexes.py --date 2020-06-19 &

nohup python src/main/python/trec-covid/generate_round5_doc2query_baselines.py >& logs/log.trec-covid.round5-docTTTTTquery &
nohup python src/main/python/trec-covid/generate_round4_doc2query_baselines.py >& logs/log.trec-covid.round4-docTTTTTquery &

Specifically, the effectiveness of the runs generated by the scripts match the scores encoded in the scripts. However, the scores vary (in most cases, only slightly) from the scores reported below.

This document describes various doc2query baselines for the TREC-COVID Challenge, which uses the COVID-19 Open Research Dataset (CORD-19) from the Allen Institute for AI. Here, we focus on running retrieval experiments; for basic instructions on building Anserini indexes, see this page and for instructions specific to building doc2query expanded Anserini indexes, see this page.

doc2query describes a family of document expansion techniques:

Rodrigo Nogueira, Wei Yang, Jimmy Lin, Kyunghyun Cho. Document Expansion by Query Prediction. arXiv:1904.08375.
Rodrigo Nogueira and Jimmy Lin. From doc2query to docTTTTTquery. December 2019.

The idea is conceptually simple: prior to indexing, for each document, we use a model to predict queries for which that document will be relevant. These predicted queries are then appended to the original document and indexed as usual.

For CORD-19, these predictions were made using only article title and abstracts with T5 trained on MS MARCO passage date. These expansions were then appended to the abstract, full-text, and paragraph index conditions, as described on this page.

All the runs referenced on this page are stored in this repo. As an alternative to downloading each run separately, clone the repo and you'll have everything.

Round 5

These are runs that can be easily reproduced with Anserini, from pre-built doc2query expanded CORD-19 indexes we have provided (version from 2020/07/16, the official corpus used in round 5). They were prepared for round 5 (for participants who wish to have a baseline run to rerank); to provide a sense of effectiveness, we present evaluation results with the cumulative qrels from rounds 1, 2, 3, and 4 (qrels_covid_d4_j0.5-4.txt provided by NIST, stored in our repo as qrels.covid-round4-cumulative.txt).

	index	field(s)	nDCG@10	J@10	R@1k	run file	checksum
1	abstract	query+question	0.4635	0.5300	0.4462	[download]	`9923233a31ac004f84b7d563baf6543c`
2	abstract	UDel qgen	0.4548	0.5000	0.4527	[download]	`e0c7a1879e5b1742045bba0f5293d558`
3	full-text	query+question	0.4450	0.6020	0.4473	[download]	`78aa7f481de91d22192163ed934d02ee`
4	full-text	UDel qgen	0.4817	0.6040	0.4711	[download]	`51cbae025bf90dadf8f26c5c31af9f66`
5	paragraph	query+question	0.4904	0.5820	0.5004	[download]	`0b80444c8a737748ba9199ddf0795421`
6	paragraph	UDel qgen	0.4940	0.5700	0.5070	[download]	`2040b9a4759af722d50610f26989c328`
7	-	reciprocal rank fusion(1, 3, 5)	0.4908	0.5880	0.5119	[download]	`c0ffc7b1719f64d2f37ce99a9ef0413c`
8	-	reciprocal rank fusion(2, 4, 6)	0.4846	0.5740	0.5218	[download]	`329f13267abf3f3d429a1593c1bd862f`
9	abstract	UDel qgen + RF	0.6095	0.6320	0.5280	[download]	`a5e016c84d5547519ffbcf74c9a24fc8`

IMPORTANT NOTES!!!

These runs are performed at 539f7d, 2020/07/24.
J@10 refers to Judged@10 and R@1k refers to Recall@1000.
The evaluation numbers are produced with the NIST-prepared cumulative qrels from rounds 1, 2, 3, and 4 (qrels_covid_d4_j0.5-4.txt provided by NIST, stored in our repo as qrels.covid-round4-cumulative.txt) on the round 5 collection (release of 7/16).
For the abstract and full-text indexes, we request up to 10k hits for each topic; the number of actual hits retrieved is fairly close to this (a bit less because of deduping). For the paragraph index, we request up to 50k hits for each topic; because multiple paragraphs are retrieved from the same document, the number of unique documents in each list of hits is much smaller. A cautionary note: our experience is that choosing the top k documents to rerank has a large impact on end-to-end effectiveness. Reranking the top 100 seems to provide higher precision than top 1000, but the likely tradeoff is lower recall. It is very likely the case that you don't want to rerank all available hits.
Row 9 represents the feedback baseline condition introduced in round 3: abstract index, UDel query generator, BM25+RM3 relevance feedback (100 feedback terms).

The final runs after removing judgments from 1, 2, 3, and 4 (cumulatively), are as follows:

runtag	run file	checksum
`r5.fusion1` = Row 7	[download]	`2295216ed623d2621f00c294f7c389e1`
`r5.fusion2` = Row 8	[download]	`a65fabe7b5b7bc4216be632296269ce6`
`r5.rf` = Row 9	[download]	`24f0b75a25273b7b00d3e65065e98147`

We have written scripts that automate the reproduction of these baselines:

$ python src/main/python/trec-covid/download_doc2query_indexes.py --date 2020-07-16
$ python src/main/python/trec-covid/generate_round5_doc2query_baselines.py

Evaluation with Round 5 Qrels

Since the above runs were prepared for round 5, we do not know how well they actually performed until the round 5 judgments from NIST were released. Here, we provide these evaluation results.

Note that the runs posted on the TREC-COVID archive are not exactly the same the runs we submitted. According to NIST (from email to participants), they removed "documents that were previously judged but had id changes from the Round 5 submissions for scoring, even though the change in cord_uid was unknown at submission time." The actual evaluated runs are (mirrored from URL above):

group	runtag	run file	checksum
`anserini`	`r5.d2q.fusion1` (NIST post-processed)	[download]	`03ad001d94c772649e17f4d164d4b2e2`
`anserini`	`r5.d2q.fusion2` (NIST post-processed)	[download]	`4137c93e76970616e0eff2803501cd08`
`anserini`	`r5.d2q.rf` (NIST post-processed)	[download]	`3dfba85c0630865a7b581c4358cf4587`

Effectiveness results (note that starting in Round 4, NIST changed from nDCG@10 to nDCG@20):

group	runtag	nDCG@20	J@20	AP	R@1k
`anserini`	`r5.d2q.fusion1`	0.5374	0.8530	0.2236	0.5798
`anserini`	`r5.d2q.fusion1` (NIST post-processed)	0.5414	0.8610	0.2246	0.5798
`anserini`	`r5.d2q.fusion2`	0.5393	0.8650	0.2310	0.5861
`anserini`	`r5.d2q.fusion2` (NIST post-processed)	0.5436	0.8700	0.2319	0.5861
`anserini`	`r5.d2q.rf`	0.6040	0.8370	0.2410	0.6039
`anserini`	`r5.d2q.rf` (NIST post-processed)	0.6124	0.8470	0.2433	0.6039

The scores of the post-processed runs match those reported by NIST. We see that that NIST post-processing improves scores slightly.

Below, we report the effectiveness of the runs using the "complete" cumulative qrels file (covering rounds 1 through 5). This qrels file, provided by NIST as qrels-covid_d5_j0.5-5.txt, is stored in our repo as qrels.covid-complete.txt).

	index	field(s)	nDCG@10	J@10	nDCG@20	J@20	AP	R@1k	J@1k
1	abstract	query+question	0.6808	0.9980	0.6375	0.9600	0.2718	0.4550	0.3845
2	abstract	UDel qgen	0.6939	0.9920	0.6524	0.9610	0.2752	0.4595	0.3825
3	full-text	query+question	0.6300	0.9680	0.5843	0.9260	0.2475	0.4201	0.3921
4	full-text	UDel qgen	0.6611	0.9800	0.6360	0.9610	0.2746	0.4496	0.4073
5	paragraph	query+question	0.6827	0.9800	0.6477	0.9670	0.3080	0.4936	0.4360
6	paragraph	UDel qgen	0.7067	0.9960	0.6614	0.9760	0.3127	0.4985	0.4328
7	-	reciprocal rank fusion(1, 3, 5)	0.7072	1.0000	0.6731	0.9920	0.2964	0.5063	0.4528
8	-	reciprocal rank fusion(2, 4, 6)	0.7131	1.0000	0.6755	0.9910	0.3036	0.5166	0.4518
9	abstract	UDel qgen + RF	0.8160	1.0000	0.7787	0.9960	0.3421	0.5249	0.4107

Note that all of the results above can be reproduced with the following script:

$ python src/main/python/trec-covid/download_doc2query_indexes.py --date 2020-07-16
$ python src/main/python/trec-covid/generate_round5_doc2query_baselines.py

Round 4

Document expansion with doc2query was introduced in our round 4 submissions. The runs below represent correspond to our TREC-COVID baselines, except on pre-built CORD-19 indexes that have been expanded using doc2query (version from 2020/06/19, the official corpus used in round 4).

	index	field(s)	run file	checksum
1	abstract	query+question	[download]	`d1d32cd6962c4e355a47e7f1fdfb0c74`
2	abstract	UDel qgen	[download]	`55ae93b92bae20ed64fc9f191c6ea667`
3	full-text	query+question	[download]	`512e14c6d15eb36f7fc9c537281badd3`
4	full-text	UDel qgen	[download]	`0901d7b083aa28afd431cf330fe7293c`
5	paragraph	query+question	[download]	`f8512ba33d5cc79176d71424d05f81cb`
6	paragraph	UDel qgen	[download]	`123896c0af4cdbae471c21d2da7de1f7`
7	-	reciprocal rank fusion(1, 3, 5)	[download]	`77b619a2e6e87852b85d31637ceb6219`
8	-	reciprocal rank fusion(2, 4, 6)	[download]	`1e7bb2a6e483d3629378c3107457b216`
9	abstract	UDel qgen + RF	[download]	`b6b1d949fff00e54b13e533e27455731`

These runs are performed at 539f7d, 2020/07/24. Note that these runs were created after the round 4 qrels became available, so this is a post-hoc simulation of "what would have happened".

The final runs, after removing judgments from 1, 2, and 3 (cumulatively), are as follows:

runtag	run file	checksum
`r4.fusion1` = Row 7	[download]	`ae7513f68e2ca82d8b0efdd244082046`
`r4.fusion2` = Row 8	[download]	`590400c12b72ce8ed3b5af2f4c45f039`
`r4.rf` = Row 9	[download]	`b9e7bb80fd8dc97f93908d895fb07f7f`

We have written scripts that automate the reproduction of these baselines:

$ python src/main/python/trec-covid/download_doc2query_indexes.py --date 2020-06-19
$ python src/main/python/trec-covid/generate_round4_doc2query_baselines.py

Effectiveness results, based on round 4 qrels:

group	runtag	nDCG@20	J@20	AP	R@1k
`anserini`	`r4.fusion1`	0.5115	0.6944	0.2498	0.6717
`anserini`	`r4.fusion2`	0.5175	0.6911	0.2550	0.6800
`anserini`	`r4.rf`	0.5606	0.6833	0.2658	0.6759

Below, we report the effectiveness of the runs using the cumulative qrels file from round 4. This qrels file, provided by NIST as qrels_covid_d4_j0.5-4.txt, is stored in our repo as qrels.covid-round4-cumulative.txt).

	index	field(s)	nDCG@10	J@10	nDCG@20	J@20	AP	R@1k	J@1k
1	abstract	query+question	0.6115	0.8022	0.5823	0.7900	0.2499	0.5038	0.2676
2	abstract	UDel qgen	0.6321	0.8022	0.5922	0.7678	0.2528	0.5098	0.2672
3	full-text	query+question	0.6045	0.9044	0.5640	0.8522	0.2420	0.4996	0.3037
4	full-text	UDel qgen	0.6514	0.9289	0.5991	0.8711	0.2665	0.5240	0.3114
5	paragraph	query+question	0.6429	0.8622	0.6080	0.8333	0.2932	0.5635	0.3256
6	paragraph	UDel qgen	0.6694	0.8622	0.6229	0.8411	0.2953	0.5677	0.3232
7	-	reciprocal rank fusion(1, 3, 5)	0.6739	0.8778	0.6188	0.8533	0.2914	0.5750	0.3362
8	-	reciprocal rank fusion(2, 4, 6)	0.6618	0.8622	0.6331	0.8444	0.2974	0.5847	0.3344
9	abstract	UDel qgen + RF	0.7447	0.8933	0.7067	0.8589	0.3182	0.5812	0.2904

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments-covid-doc2query.md

experiments-covid-doc2query.md

TREC-COVID doc2query Baselines

Round 5

Evaluation with Round 5 Qrels

Round 4

Files

experiments-covid-doc2query.md

Latest commit

History

experiments-covid-doc2query.md

File metadata and controls

TREC-COVID doc2query Baselines

Round 5

Evaluation with Round 5 Qrels

Round 4