Skip to content

Commit

Permalink
Merge branch 'master' into qbs
Browse files Browse the repository at this point in the history
  • Loading branch information
cmacdonald committed Jan 11, 2022
2 parents 19e6950 + 8a9a455 commit 6748f97
Show file tree
Hide file tree
Showing 46 changed files with 3,085 additions and 2,058 deletions.
17 changes: 12 additions & 5 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:

strategy:
matrix:
python-version: [3.6, 3.8]
python-version: ['3.7', '3.9']
java: [11, 13]
os: ['ubuntu-latest', 'macOs-latest', 'windows-latest'] #
architecture: ['x64']
Expand All @@ -32,6 +32,7 @@ jobs:
brew install libomp
- uses: actions/checkout@v2

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v1
with:
Expand All @@ -50,12 +51,18 @@ jobs:
cd terrier-core
mvn -B -DskipTests install
# follows https://medium.com/ai2-blog/python-caching-in-github-actions-e9452698e98d
- name: Loading Python & dependencies from cache
uses: actions/cache@v2
with:
path: ${{ env.pythonLocation }}
key: ${{ env.pythonLocation }}-${{ hashFiles('requirements.txt') }}-${{ hashFiles('requirements-test.txt') }}

- name: Install Python dependencies
run: |
python -m pip install --upgrade pip
pip install --upgrade git+https://github.com/kivy/pyjnius.git#egg=pyjnius
pip install -r requirements.txt
pip install -r requirements-test.txt
pip install --upgrade --upgrade-strategy eager -r requirements.txt
pip install --upgrade --upgrade-strategy eager -r requirements-test.txt
#install this software
pip install --timeout=120 .
pip install pytest
Expand All @@ -79,4 +86,4 @@ jobs:
env:
TERRIER_VERSION: ${{ matrix.terrier }}
run: |
pytest -p no:faulthandler
pytest --durations=20 -p no:faulthandler
1 change: 1 addition & 0 deletions .readthedocs-conda-environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ channels:
- conda-forge
dependencies:
- openjdk
- python=3.7
- pip
- pip:
# In Read the Docs, seems installing 'requirements-dev.txt' seems ignored when Conda is used.
Expand Down
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

# PyTerrier

A Python API for Terrier - v.0.6
A Python API for Terrier - v.0.7

# Installation

Expand Down Expand Up @@ -83,10 +83,10 @@ You can see examples of how to use these, including notebooks that run on Google

Complex learning to rank pipelines, including for learning-to-rank, can be constructed using PyTerrier's operator language. For example, to combine two features and make them available for learning, we can use the `**` operator.
```python
two_features = BM25_br >> ( \
two_features = BM25_br >> (
pt.BatchRetrieve(indexref, wmodel="DirichletLM") **
pt.BatchRetrieve(indexref, wmodel="PL2") \
)
pt.BatchRetrieve(indexref, wmodel="PL2")
)
```

See also the [learning to rank documentation](https://pyterrier.readthedocs.io/en/latest/ltr.html), as well as the worked examples in the [learning-to-rank notebook](examples/notebooks/ltr.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/terrier-org/pyterrier/blob/master/examples/notebooks/ltr.ipynb). Some pipelines can be automatically optimised - more detail about pipeline optimisation are included in our ICTIR 2020 paper.
Expand Down Expand Up @@ -159,5 +159,7 @@ By downloading and using PyTerrier, you agree to cite at the undernoted paper de
- Nicola Tonellotto, University of Pisa
- Arthur Câmara, Delft University
- Alberto Ueda, Federal University of Minas Gerais
- Sean MacAvaney, Georgetown University
- Sean MacAvaney, Georgetown University/University of Glasgow
- Chentao Xu, University of Glasgow
- Sarawoot Kongyoung, University of Glasgow
- Zhan Su, Copenhagen University
8 changes: 4 additions & 4 deletions docs/apply.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@ functions) to easily transform inputs.
The table below lists the main classes of transformation in the PyTerrier data
model, as well as the appropriate apply method to use in each case. In general,
if there is a one-to-one mapping between the input and the output, then the specific
pt.apply methods should be used (i.e. `query()`, `doc_score()`, `.doc_features()`).
pt.apply methods should be used (i.e. ``query()``, ``doc_score()``, ``.doc_features()``).
If the cardinality of the dataframe changes through applying the transformer,
then `generic()` or `by_query()` must be applied.
then ``generic()`` or ``by_query()`` must be applied.

In particular, through the use of `pt.apply.doc_score()`, any reranking method that can be expressed
In particular, through the use of ``pt.apply.doc_score()``, any reranking method that can be expressed
as a function of the text of the query and the text of the doucment can used as a reranker
in a PyTerrier pipeline.

Expand Down Expand Up @@ -45,7 +45,7 @@ function.
+-------+---------+-------------+------------------+---------------------------+----------------------+-----------------------+

In each case, the result from calling a pyterrier.apply method is another PyTerrier transformer
(i.e. extends TransformerBase), which can be used for experimentation or combined with other
(i.e. extends ``pt.Transformer``), which can be used for experimentation or combined with other
PyTerrier transformers through the standard PyTerrier operators.

If `verbose=True` is passed to any pyterrier apply method (except `generic()`), then a `TQDM <https://tqdm.github.io/>`_
Expand Down
85 changes: 85 additions & 0 deletions docs/experiments.rst
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,91 @@ This provides a dataframe where each row is the performance of a given system fo

NB: For brevity, we only show the top 5 rows of the returned table.

Saving and Reusing Results
~~~~~~~~~~~~~~~~~~~~~~~~~~

For some research tasks, it is considered good practice to save your results files when conducting experiments. This allows
several advantages:

- It permits additional evaluation (e.g. more measures, more signifiance tests) without re-applying potentially slow transformer pipelines.
- It allows transformer results to be made available for other experiments, perhaps as a virtual data appendix in a paper.

Saving can be enabled by adding the ``save_dir`` as a kwarg to pt.Experiment::

pt.Experiment(
[tfidf, bm25],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank"],
names=["TF_IDF", "BM25"],
save_dir="./",
)

This will save two files, namely, TF_IDF.res.gz and BM25.res.gz to the current directory. If these files already exist,
they will be "reused", i.e. loaded and evaluated in preference to application of the tfidf and/or bm25 transformers.
If experiments are being conducted on multiple different topic sets, care should be taken to ensure that previous
results for a different topic set are not reused for evaluation.

If a transformer has been updated, outdated results files can be mistakenly used. To prevent this, set the ``save_mode``
kwarg to ``"overwrite"``::

pt.Experiment(
[tfidf, bm25],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "recip_rank"],
names=["TF_IDF", "BM25"],
save_dir="./",
save_mode="overwrite"
)

Missing Topics and/or Qrels
~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is not always a one-to-one correspondance between the topic/query IDs (qids) that appear in
the provided ``topics`` and ``qrels``. Qids that appear in topics but not qrels can be due to incomplete judgments,
such as in sparsely labeled datasets or shared tasks that choose to omit some topics (e.g., due to cost).
Qids that appear in qrels but no in topics can happen when running a subset of topics for testing purposes
(e.g., ``topics.head(5)``).

The ``filter_by_qrels`` and ``filter_by_topics`` parameters control the behaviour of an experiment when topics and qrels
do not perfectly overlap. When ``filter_by_qrels=True``, topics are filtered down to only the ones that have qids in the
qrels. Similarly, when ``filter_by_topics=True``, qrels are filtered down to only the ones that have qids in the topics.

For example, consier topics that include qids ``A`` and ``B`` and qrels that include ``B`` and ``C``. The results with
each combination of settings are:

+----------------------+----------------------+------------------+--------------------------------------------------------------------+
| ``filter_by_topics`` | ``filter_by_qrels`` | Results consider | Notes |
+======================+======================+==================+====================================================================+
| ``True`` (default) | ``False`` (default) | ``A,B`` | ``C`` is removed because it does not appear in the topics. |
+----------------------+----------------------+------------------+--------------------------------------------------------------------+
| ``True`` (default) | ``True`` | ``B`` | Acts as an intersection of the qids found in the qrels and topics. |
+----------------------+----------------------+------------------+--------------------------------------------------------------------+
| ``False`` | ``False`` (default) | ``A,B,C`` | Acts as a union of the qids found in qrels and topics. |
+----------------------+----------------------+------------------+--------------------------------------------------------------------+
| ``False`` | ``True`` | ``B,C`` | ``A`` is removed because it does not appear in the qrels. |
+----------------------+----------------------+------------------+--------------------------------------------------------------------+

Note that, following IR evaluation conventions, topics that have no relevance judgments (``A`` in the above example)
do not contribute to relevance-based measures (e.g., ``map``), but still contribute to efficiency measures (e.g., ``mrt``).
As such, aggregate relevance-based measures will not change based on the value of ``filter_by_qrels``. When ``perquery=True``,
topics that have no relevance judgments (``A``) will give a value of ``NaN``, indicating that they are not defined
and should not contribute to the average.

The defaults (``filter_by_topics=True`` and ``filter_by_qrels=False``) were chosen because they likely reflect the intent
of the user in most cases. In particular, it runs all topics requested and evaluates on only those topics. However, you
may want to change these settings in some circumstnaces. E.g.:

- If you want to save time and avoid running topics that will not be evaluated, set ``filter_by_qrels=True``.
This can be particularly helpful for large collections with many missing judgments, such as MS MARCO.
- If you want to evaluate across all topics from the qrels set ``filter_by_topics=False``.

Note that in all cases, if a requested topic that appears in the qrels returns no results, it will properly contribute
a score of 0 for evaluation.



Available Evaluation Measures
=============================

Expand Down
4 changes: 2 additions & 2 deletions docs/ltr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -193,13 +193,13 @@ Example::
# learn a model for all four features
full = pipeline >> pt.ltr.apply_learned_model(RandomForestRegressor(n_estimators=400))
full.fit(trainTopics, trainQrels, validTopics, validQrels)
ranker.append(full)
rankers.append(full)
# learn a model for 3 features, removing one each time
for fid in range(numf):
ablated = pipeline >> pt.ltr.ablate_features(fid) >> pt.ltr.apply_learned_model(RandomForestRegressor(n_estimators=400))
ablated.fit(trainTopics, trainQrels, validTopics, validQrels)
rankers.append(full)
rankers.append(ablated)

# evaluate the full (4 features) model, as well as the each model containing only 3 features)
pt.Experiment(
Expand Down
2 changes: 1 addition & 1 deletion docs/neural.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Available Neural Dense Retrieval and Re-ranking Integrations
============================================================

- `OpenNIR <https://opennir.net/>`_ has integration with PyTerrier - see its `notebook examples <https://github.com/Georgetown-IR-Lab/OpenNIR/tree/master/examples>`_.
- `PyTerrier_ColBERT <https://github.com/terrierteam/pyterrier_colbert>`_ contains a `ColBERT <https://github.com/stanford-futuredata/ColBERT/tree/v0.2>`_ integration, including both a text-scorer and a end-to-end dense retrieval.
- `PyTerrier_ColBERT <https://github.com/terrierteam/pyterrier_colbert>`_ contains a `ColBERT <https://github.com/stanford-futuredata/ColBERT>`_ integration, including both a text-scorer and a end-to-end dense retrieval.
- `PyTerrier_ANCE <https://github.com/terrierteam/pyterrier_ance>`_ contains an `ANCE <https://github.com/microsoft/ANCE/>`_ integration for end-to-end dense retrieval.
- `PyTerrier_T5 <https://github.com/terrierteam/pyterrier_t5>`_ contains a `monoT5 <https://arxiv.org/pdf/2101.05667.pdf>`_ integration.
- `PyTerrier_doc2query <https://github.com/terrierteam/pyterrier_doc2query>`_ contains a `docT5query <https://github.com/castorini/docTTTTTquery>`_ integration.
Expand Down
22 changes: 12 additions & 10 deletions docs/pipeline_examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,8 @@

### Sequential Dependence Model


```python
pt.rewrite.SDM() >> pt.BatchRetrieve(indexref, wmodel="BM25")
pipe = pt.rewrite.SDM() >> pt.BatchRetrieve(indexref, wmodel="BM25")
```

Note that the SDM() rewriter has a number of constructor parameters:
Expand All @@ -18,16 +17,17 @@ Note that the SDM() rewriter has a number of constructor parameters:

A simple QE transformer can be achieved using
```python
pt.BatchRetrieve(indexref, wmodel="BM25", controls={"qe" : "on"})
qe = pt.BatchRetrieve(indexref, wmodel="BM25", controls={"qe" : "on"})
```

As this is pseudo-relevance feedback in nature, it identifies a set of documents, extracts informative term in the top-ranked documents, and re-exectutes the query.

However, more control can be achieved by using the QueryExpansion transformer separately, as thus:
```python
pt.BatchRetrieve(indexref, wmodel="BM25") >> \
pt.rewrite.QueryExpansion(indexref) >> \
qe = (pt.BatchRetrieve(indexref, wmodel="BM25") >>
pt.rewrite.QueryExpansion(indexref) >>
pt.BatchRetrieve(indexref, wmodel="BM25")
)
```

The QueryExpansion() object has the following constructor parameters:
Expand All @@ -38,20 +38,22 @@ The QueryExpansion() object has the following constructor parameters:
Note that different indexes can be used to achieve query expansion using an external collection (sometimes called collection enrichment or external feedback). For example, to expand queries using Wikipedia as an external resource, in order to get higher quality query re-weighted queries, would look like this:

```python
pt.BatchRetrieve(wikipedia_index, wmodel="BM25") >> \
pt.rewrite.QueryExpansion(wikipedia_index) >> \
pipe = (pt.BatchRetrieve(wikipedia_index, wmodel="BM25") >>
pt.rewrite.QueryExpansion(wikipedia_index) >>
pt.BatchRetrieve(local_index, wmodel="BM25")
)
```

### RM3 Query Expansion

We also provide RM3 query expansion, by virtue of an external plugin to Terrier called [terrier-prf](https://github.com/terrierteam/terrier-prf). This needs to be load at initialisation time.

```python
pt.init(boot_packages=["org.terrier:terrier-prf:0.0.1-SNAPSHOT"])
pt.BatchRetrieve(indexref, wmodel="BM25") >> \
pt.rewrite.RM3(indexref) >> \
pt.init(boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"])
pipe = (pt.BatchRetrieve(indexref, wmodel="BM25") >>
pt.rewrite.RM3(indexref) >>
pt.BatchRetrieve(indexref, wmodel="BM25")
)
```
## Combining Rankings

Expand Down
53 changes: 50 additions & 3 deletions docs/terrier-indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,22 @@ IterDictIndexer
.. autoclass:: pyterrier.IterDictIndexer
:members: index

Example indexing MSMARCO Passage Ranking dataset::
**Examples using IterDictIndexer**

An iterdict can just be a list of dictionaries::

docs = [ { 'docno' : 'doc1', 'text' : 'a b c' } ]
iter_indexer = pt.IterDictIndexer("./index")
indexref1 = iter_indexer.index(docs, meta=['docno', 'text'], meta_lengths=[20, 4096])

A dataframe can also be used, virtue of its ``.to_dict()`` method::

df = pd.DataFrame([['doc1', 'a b c']], columns=['docno', 'text'])
iter_indexer = pt.IterDictIndexer("./index")
indexref2 = indexer.index(df.to_dict(orient="records"))

However, the main power of using IterDictIndexer is for processing indefinite iterables, such as those returned by generator functions.
For example, the tsv file of the MSMARCO Passage Ranking corpus can be indexed as follows::

dataset = pt.get_dataset("trec-deep-learning-passages")
def msmarco_generate():
Expand All @@ -96,10 +111,42 @@ Example indexing MSMARCO Passage Ranking dataset::
iter_indexer = pt.IterDictIndexer("./passage_index")
indexref3 = iter_indexer.index(msmarco_generate(), meta=['docno', 'text'], meta_lengths=[20, 4096])

On UNIX-based systems, you can also perform multi-threaded indexing::
IterDictIndexer can be used in connection with :ref:`indexing_pipelines`.

Similarly, indexing of JSONL files is similarly a few lines of Python::

def iter_file(filename):
import json
with open(filename, 'rt') as file:
for l in file:
# assumes that each line contains 'docno', 'text' attributes
# yields a dictionary for each json line
yield json.loads(l)

indexref4 = pt.IterDictIndexer("./index").index(iter_file("/path/to/file.jsonl"), meta=['docno', 'text'], meta_lengths=[20, 4096])
NB: Use ``pt.io.autoopen()`` as a drop-in replacement for ``open()`` that supports files compressed by gzip etc.

**Indexing TREC-formatted files using IterDictIndexer**

If you have TREC-formatted files that you wish to use with an IterDictIndexer-like indexer, ``pt.index.treccollection2textgen()`` can be used
as a helper function to aid in parsing such files.

.. autofunction:: pyterrier.index.treccollection2textgen

Example using Indexing Pipelines::

files = pt.io.find_files("/path/to/Disk45")
gen = pt.index.treccollection2textgen(files)
indexer = pt.text.sliding() >> pt.IterDictIndexer("./index45")
index = indexer.index(gen)

**Threading**

On UNIX-based systems, IterDictIndexer can also perform multi-threaded indexing::

iter_indexer = pt.IterDictIndexer("./passage_index_8", threads=8)
indexref4 = iter_indexer.index(msmarco_generate(), meta=['docno', 'text'], meta_lengths=[20, 4096])
indexref6 = iter_indexer.index(msmarco_generate(), meta=['docno', 'text'], meta_lengths=[20, 4096])

Note that the resulting index ordering with multiple threads is non-deterministic; if you need
deterministic behavior you must index in single-threaded mode. Furthermore, indexing can only go
Expand Down
Loading

0 comments on commit 6748f97

Please sign in to comment.