Merge branch 'master' into qbs

terrier-org · Jan 11, 2022 · 6748f97 · 6748f97
2 parents 19e6950 + 8a9a455
commit 6748f97
Show file tree

Hide file tree

Showing 46 changed files with 3,085 additions and 2,058 deletions.
diff --git a/.github/workflows/push.yml b/.github/workflows/push.yml
@@ -14,7 +14,7 @@ jobs:
 
     strategy:
       matrix:
-        python-version: [3.6, 3.8]
+        python-version: ['3.7', '3.9']
         java: [11, 13]
         os: ['ubuntu-latest', 'macOs-latest', 'windows-latest'] # 
         architecture: ['x64']
@@ -32,6 +32,7 @@ jobs:
         brew install libomp
 
     - uses: actions/checkout@v2
+
     - name: Set up Python ${{ matrix.python-version }}
       uses: actions/setup-python@v1
       with:
@@ -50,12 +51,18 @@ jobs:
         cd terrier-core
         mvn -B -DskipTests install
 
+    # follows https://medium.com/ai2-blog/python-caching-in-github-actions-e9452698e98d
+    - name: Loading Python & dependencies from cache
+      uses: actions/cache@v2
+      with:
+        path: ${{ env.pythonLocation }}
+        key: ${{ env.pythonLocation }}-${{ hashFiles('requirements.txt') }}-${{ hashFiles('requirements-test.txt') }}
+
     - name: Install Python dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install --upgrade git+https://github.com/kivy/pyjnius.git#egg=pyjnius
-        pip install -r requirements.txt
-        pip install -r requirements-test.txt
+        pip install --upgrade --upgrade-strategy eager -r requirements.txt
+        pip install --upgrade --upgrade-strategy eager -r requirements-test.txt
         #install this software
         pip install --timeout=120 .
         pip install pytest
@@ -79,4 +86,4 @@ jobs:
       env:
         TERRIER_VERSION: ${{ matrix.terrier }}
       run: |
-        pytest -p no:faulthandler
+        pytest --durations=20 -p no:faulthandler
diff --git a/.readthedocs-conda-environment.yml b/.readthedocs-conda-environment.yml
@@ -4,6 +4,7 @@ channels:
   - conda-forge
 dependencies:
   - openjdk
+  - python=3.7
   - pip
   - pip:
     # In Read the Docs, seems installing 'requirements-dev.txt' seems ignored when Conda is used.

diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@
 
 # PyTerrier
 
-A Python API for Terrier - v.0.6
+A Python API for Terrier - v.0.7
 
 # Installation
 
@@ -83,10 +83,10 @@ You can see examples of how to use these, including notebooks that run on Google
 
 Complex learning to rank pipelines, including for learning-to-rank, can be constructed using PyTerrier's operator language. For example, to combine two features and make them available for learning, we can use the `**` operator.
 ```python
-two_features = BM25_br >> ( \
+two_features = BM25_br >> ( 
   pt.BatchRetrieve(indexref, wmodel="DirichletLM") ** 
-  pt.BatchRetrieve(indexref, wmodel="PL2") \
- )
+  pt.BatchRetrieve(indexref, wmodel="PL2") 
+)
 ```
 
 See also the [learning to rank documentation](https://pyterrier.readthedocs.io/en/latest/ltr.html), as well as the worked examples in the [learning-to-rank notebook](examples/notebooks/ltr.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/terrier-org/pyterrier/blob/master/examples/notebooks/ltr.ipynb). Some pipelines can be automatically optimised - more detail about pipeline optimisation are included in our ICTIR 2020 paper.
@@ -159,5 +159,7 @@ By downloading and using PyTerrier, you agree to cite at the undernoted paper de
  - Nicola Tonellotto, University of Pisa
  - Arthur Câmara, Delft University
  - Alberto Ueda, Federal University of Minas Gerais
- - Sean MacAvaney, Georgetown University
+ - Sean MacAvaney, Georgetown University/University of Glasgow
  - Chentao Xu, University of Glasgow
+ - Sarawoot Kongyoung, University of Glasgow
+ - Zhan Su, Copenhagen University
diff --git a/docs/apply.rst b/docs/apply.rst
@@ -13,11 +13,11 @@ functions) to easily transform inputs.
 The table below lists the main classes of transformation in the PyTerrier data 
 model, as well as the appropriate apply method to use in each case. In general,
 if there is a one-to-one mapping between the input and the output, then the specific
-pt.apply methods should be used (i.e. `query()`, `doc_score()`, `.doc_features()`).
+pt.apply methods should be used (i.e. ``query()``, ``doc_score()``, ``.doc_features()``).
 If the cardinality of the dataframe changes through applying the transformer, 
-then `generic()` or `by_query()` must be applied.
+then ``generic()`` or ``by_query()`` must be applied.
 
-In particular, through the use of `pt.apply.doc_score()`, any reranking method that can be expressed
+In particular, through the use of ``pt.apply.doc_score()``, any reranking method that can be expressed
 as a function of the text of the query and the text of the doucment can used as a reranker
 in a PyTerrier pipeline.
 
@@ -45,7 +45,7 @@ function.
 +-------+---------+-------------+------------------+---------------------------+----------------------+-----------------------+
 
 In each case, the result from calling a pyterrier.apply method is another PyTerrier transformer 
-(i.e. extends TransformerBase), which can be used for experimentation or combined with other 
+(i.e. extends ``pt.Transformer``), which can be used for experimentation or combined with other 
 PyTerrier transformers through the standard PyTerrier operators.
 
 If `verbose=True` is passed to any pyterrier apply method (except `generic()`), then a `TQDM <https://tqdm.github.io/>`_ 

diff --git a/docs/experiments.rst b/docs/experiments.rst
@@ -147,6 +147,91 @@ This provides a dataframe where each row is the performance of a given system fo
 
 NB: For brevity, we only show the top 5 rows of the returned table.
 
+Saving and Reusing Results 
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For some research tasks, it is considered good practice to save your results files when conducting experiments. This allows
+several advantages:
+
+ - It permits additional evaluation (e.g. more measures, more signifiance tests) without re-applying potentially slow transformer pipelines.
+ - It allows transformer results to be made available for other experiments, perhaps as a virtual data appendix in a paper.
+
+Saving can be enabled by adding the ``save_dir`` as a kwarg to pt.Experiment::
+
+    pt.Experiment(
+        [tfidf, bm25],
+        dataset.get_topics(),
+        dataset.get_qrels(),
+        eval_metrics=["map", "recip_rank"],
+        names=["TF_IDF", "BM25"],
+        save_dir="./",
+    )
+
+This will save two files, namely, TF_IDF.res.gz and BM25.res.gz to the current directory. If these files already exist,
+they will be "reused", i.e. loaded and evaluated in preference to application of the tfidf and/or bm25 transformers. 
+If experiments are being conducted on multiple different topic sets, care should be taken to ensure that previous 
+results for a different topic set are not reused for evaluation.
+
+If a transformer has been updated, outdated results files can be mistakenly used. To prevent this, set the ``save_mode`` 
+kwarg to ``"overwrite"``::
+
+    pt.Experiment(
+        [tfidf, bm25],
+        dataset.get_topics(),
+        dataset.get_qrels(),
+        eval_metrics=["map", "recip_rank"],
+        names=["TF_IDF", "BM25"],
+        save_dir="./",
+        save_mode="overwrite"
+    )
+
+Missing Topics and/or Qrels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There is not always a one-to-one correspondance between the topic/query IDs (qids) that appear in
+the provided ``topics`` and ``qrels``. Qids that appear in topics but not qrels can be due to incomplete judgments,
+such as in sparsely labeled datasets or shared tasks that choose to omit some topics (e.g., due to cost).
+Qids that appear in qrels but no in topics can happen when running a subset of topics for testing purposes
+(e.g., ``topics.head(5)``).
+
+The ``filter_by_qrels`` and ``filter_by_topics`` parameters control the behaviour of an experiment when topics and qrels
+do not perfectly overlap. When ``filter_by_qrels=True``, topics are filtered down to only the ones that have qids in the
+qrels. Similarly, when ``filter_by_topics=True``, qrels are filtered down to only the ones that have qids in the topics.
+
+For example, consier topics that include qids ``A`` and ``B`` and qrels that include ``B`` and ``C``. The results with
+each combination of settings are:
+
++----------------------+----------------------+------------------+--------------------------------------------------------------------+
+| ``filter_by_topics`` | ``filter_by_qrels``  | Results consider | Notes                                                              |
++======================+======================+==================+====================================================================+
+| ``True`` (default)   | ``False`` (default)  | ``A,B``          | ``C`` is removed because it does not appear in the topics.         |
++----------------------+----------------------+------------------+--------------------------------------------------------------------+
+| ``True`` (default)   | ``True``             | ``B``            | Acts as an intersection of the qids found in the qrels and topics. |
++----------------------+----------------------+------------------+--------------------------------------------------------------------+
+| ``False``            | ``False`` (default)  | ``A,B,C``        | Acts as a union of the qids found in qrels and topics.             |
++----------------------+----------------------+------------------+--------------------------------------------------------------------+
+| ``False``            | ``True``             | ``B,C``          | ``A`` is removed because it does not appear in the qrels.          |
++----------------------+----------------------+------------------+--------------------------------------------------------------------+
+
+Note that, following IR evaluation conventions, topics that have no relevance judgments (``A`` in the above example)
+do not contribute to relevance-based measures (e.g., ``map``), but still contribute to efficiency measures (e.g., ``mrt``).
+As such, aggregate relevance-based measures will not change based on the value of ``filter_by_qrels``. When ``perquery=True``,
+topics that have no relevance judgments (``A``) will give a value of ``NaN``, indicating that they are not defined
+and should not contribute to the average.
+
+The defaults (``filter_by_topics=True`` and ``filter_by_qrels=False``) were chosen because they likely reflect the intent
+of the user in most cases. In particular, it runs all topics requested and evaluates on only those topics. However, you
+may want to change these settings in some circumstnaces. E.g.:
+
+ - If you want to save time and avoid running topics that will not be evaluated, set ``filter_by_qrels=True``.
+   This can be particularly helpful for large collections with many missing judgments, such as MS MARCO.
+ - If you want to evaluate across all topics from the qrels set ``filter_by_topics=False``.
+
+Note that in all cases, if a requested topic that appears in the qrels returns no results, it will properly contribute
+a score of 0 for evaluation.
+
+
+
 Available Evaluation Measures
 =============================
 

diff --git a/docs/ltr.rst b/docs/ltr.rst
@@ -193,13 +193,13 @@ Example::
     # learn a model for all four features
     full = pipeline >> pt.ltr.apply_learned_model(RandomForestRegressor(n_estimators=400))
     full.fit(trainTopics, trainQrels, validTopics, validQrels)
-    ranker.append(full)
+    rankers.append(full)
     
     # learn a model for 3 features, removing one each time
     for fid in range(numf):
         ablated = pipeline >> pt.ltr.ablate_features(fid) >> pt.ltr.apply_learned_model(RandomForestRegressor(n_estimators=400))
         ablated.fit(trainTopics, trainQrels, validTopics, validQrels)
-        rankers.append(full)
+        rankers.append(ablated)
 
     # evaluate the full (4 features) model, as well as the each model containing only 3 features)
     pt.Experiment(

diff --git a/docs/neural.rst b/docs/neural.rst
@@ -23,7 +23,7 @@ Available Neural Dense Retrieval and Re-ranking Integrations
 ============================================================
 
  - `OpenNIR <https://opennir.net/>`_ has integration with PyTerrier - see its `notebook examples <https://github.com/Georgetown-IR-Lab/OpenNIR/tree/master/examples>`_.
- - `PyTerrier_ColBERT <https://github.com/terrierteam/pyterrier_colbert>`_ contains a `ColBERT <https://github.com/stanford-futuredata/ColBERT/tree/v0.2>`_ integration, including both a text-scorer and a end-to-end dense retrieval.
+ - `PyTerrier_ColBERT <https://github.com/terrierteam/pyterrier_colbert>`_ contains a `ColBERT <https://github.com/stanford-futuredata/ColBERT>`_ integration, including both a text-scorer and a end-to-end dense retrieval.
  - `PyTerrier_ANCE <https://github.com/terrierteam/pyterrier_ance>`_ contains an `ANCE <https://github.com/microsoft/ANCE/>`_ integration for end-to-end dense retrieval.
  - `PyTerrier_T5 <https://github.com/terrierteam/pyterrier_t5>`_ contains a `monoT5 <https://arxiv.org/pdf/2101.05667.pdf>`_ integration.
  - `PyTerrier_doc2query <https://github.com/terrierteam/pyterrier_doc2query>`_ contains a `docT5query <https://github.com/castorini/docTTTTTquery>`_ integration.

diff --git a/docs/pipeline_examples.md b/docs/pipeline_examples.md
@@ -4,9 +4,8 @@
 
 ### Sequential Dependence Model
 
-
 ```python
-pt.rewrite.SDM() >> pt.BatchRetrieve(indexref, wmodel="BM25")
+pipe = pt.rewrite.SDM() >> pt.BatchRetrieve(indexref, wmodel="BM25")
 ```
 
 Note that the SDM() rewriter has a number of constructor parameters:
@@ -18,16 +17,17 @@ Note that the SDM() rewriter has a number of constructor parameters:
 
 A simple QE transformer can be achieved using
 ```python
-pt.BatchRetrieve(indexref, wmodel="BM25", controls={"qe" : "on"})
+qe = pt.BatchRetrieve(indexref, wmodel="BM25", controls={"qe" : "on"})
 ```
 
 As this is pseudo-relevance feedback in nature, it identifies a set of documents, extracts informative term in the top-ranked documents, and re-exectutes the query.
 
 However, more control can be achieved by using the QueryExpansion transformer separately, as thus:
 ```python
-pt.BatchRetrieve(indexref, wmodel="BM25") >> \
-    pt.rewrite.QueryExpansion(indexref) >> \
+qe = (pt.BatchRetrieve(indexref, wmodel="BM25") >> 
+    pt.rewrite.QueryExpansion(indexref) >> 
     pt.BatchRetrieve(indexref, wmodel="BM25")
+)
 ```
 
 The QueryExpansion() object has the following constructor parameters:
@@ -38,20 +38,22 @@ The QueryExpansion() object has the following constructor parameters:
 Note that different indexes can be used to achieve query expansion using an external collection (sometimes called collection enrichment or external feedback).  For example, to expand queries using Wikipedia as an external resource, in order to get higher quality query re-weighted queries, would look like this:
 
 ```python
-pt.BatchRetrieve(wikipedia_index, wmodel="BM25") >> \
-    pt.rewrite.QueryExpansion(wikipedia_index) >> \
+pipe = (pt.BatchRetrieve(wikipedia_index, wmodel="BM25") >> 
+    pt.rewrite.QueryExpansion(wikipedia_index) >> 
     pt.BatchRetrieve(local_index, wmodel="BM25")
+)
 ```
 
 ### RM3 Query Expansion
 
 We also provide RM3 query expansion, by virtue of an external plugin to Terrier called [terrier-prf](https://github.com/terrierteam/terrier-prf). This needs to be load at initialisation time.
 
 ```python
-pt.init(boot_packages=["org.terrier:terrier-prf:0.0.1-SNAPSHOT"])
-pt.BatchRetrieve(indexref, wmodel="BM25") >> \
-    pt.rewrite.RM3(indexref) >> \
+pt.init(boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"])
+pipe = (pt.BatchRetrieve(indexref, wmodel="BM25") >> 
+    pt.rewrite.RM3(indexref) >> 
     pt.BatchRetrieve(indexref, wmodel="BM25")
+)
 ```
 ## Combining Rankings
 

diff --git a/docs/terrier-indexing.rst b/docs/terrier-indexing.rst
@@ -84,7 +84,22 @@ IterDictIndexer
 .. autoclass:: pyterrier.IterDictIndexer
    :members: index
 
-Example indexing MSMARCO Passage Ranking dataset::
+**Examples using IterDictIndexer**
+
+An iterdict can just be a list of dictionaries::
+
+    docs = [ { 'docno' : 'doc1', 'text' : 'a b c' }  ]
+    iter_indexer = pt.IterDictIndexer("./index")
+    indexref1 = iter_indexer.index(docs, meta=['docno', 'text'], meta_lengths=[20, 4096])
+
+A dataframe can also be used, virtue of its ``.to_dict()`` method::
+
+    df = pd.DataFrame([['doc1', 'a b c']], columns=['docno', 'text'])
+    iter_indexer = pt.IterDictIndexer("./index")
+    indexref2 = indexer.index(df.to_dict(orient="records"))
+
+However, the main power of using IterDictIndexer is for processing indefinite iterables, such as those returned by generator functions.
+For example, the tsv file of the MSMARCO Passage Ranking corpus can be indexed as follows::
 
     dataset = pt.get_dataset("trec-deep-learning-passages")
     def msmarco_generate():
@@ -96,10 +111,42 @@ Example indexing MSMARCO Passage Ranking dataset::
     iter_indexer = pt.IterDictIndexer("./passage_index")
     indexref3 = iter_indexer.index(msmarco_generate(), meta=['docno', 'text'], meta_lengths=[20, 4096])
 
-On UNIX-based systems, you can also perform multi-threaded indexing::
+IterDictIndexer can be used in connection with :ref:`indexing_pipelines`.
+
+Similarly, indexing of JSONL files is similarly a few lines of Python::
+
+    def iter_file(filename):
+      import json
+      with open(filename, 'rt') as file:
+        for l in file:
+          # assumes that each line contains 'docno', 'text' attributes
+          # yields a dictionary for each json line 
+          yield json.loads(l)
+
+    indexref4 = pt.IterDictIndexer("./index").index(iter_file("/path/to/file.jsonl"), meta=['docno', 'text'], meta_lengths=[20, 4096])
+  
+NB: Use ``pt.io.autoopen()`` as a drop-in replacement for ``open()`` that supports files compressed by gzip etc.
+
+**Indexing TREC-formatted files using IterDictIndexer**
+
+If you have TREC-formatted files that you wish to use with an IterDictIndexer-like indexer, ``pt.index.treccollection2textgen()`` can be used
+as a helper function to aid in parsing such files.
+
+.. autofunction:: pyterrier.index.treccollection2textgen
+
+Example using Indexing Pipelines::
+
+    files = pt.io.find_files("/path/to/Disk45")
+    gen = pt.index.treccollection2textgen(files)
+    indexer = pt.text.sliding() >> pt.IterDictIndexer("./index45")
+    index = indexer.index(gen)
+
+**Threading**
+
+On UNIX-based systems, IterDictIndexer can also perform multi-threaded indexing::
 
     iter_indexer = pt.IterDictIndexer("./passage_index_8", threads=8)
-    indexref4 = iter_indexer.index(msmarco_generate(), meta=['docno', 'text'], meta_lengths=[20, 4096])
+    indexref6 = iter_indexer.index(msmarco_generate(), meta=['docno', 'text'], meta_lengths=[20, 4096])
 
 Note that the resulting index ordering with multiple threads is non-deterministic; if you need 
 deterministic behavior you must index in single-threaded mode. Furthermore, indexing can only go