DFIndexer error messages #348

maxhenze · 2022-12-06T10:43:48Z

While reproducing the PyTerrier Indexing Notebook a lot of JVM exceptions occur in my notebook.

Explicitly I'm loading an ir_dataset and try to index:

import ir_datasets
import pandas as pd
import pyterrier as pt

if not pt.started():
  pt.init()

dataset_msmarco_document_trec_dl_2019_judged = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
docstore = dataset_msmarco_document_trec_dl_2019_judged.docs_store()

df_queries = pd.DataFrame(dataset_msmarco_document_trec_dl_2019_judged.queries_iter())
df_qrels = pd.DataFrame(dataset_msmarco_document_trec_dl_2019_judged.qrels_iter())
df_docs = pd.DataFrame(docstore.get_many_iter(df_qrels.doc_id.unique()))

index_path = "./pd_index"
indexer = pt.DFIndexer(index_path, verbose=True)

indexref = indexer.index(df_docs["body"])

This results in the following error:


  0%|                                                                               | 0/16043 [00:00<?, ?documents/s]

---------------------------------------------------------------------------
JavaException                             Traceback (most recent call last)
Input In [43], in <cell line: 1>()
----> 1 indexref = indexer.index(df_docs["body"])

File /opt/conda/lib/python3.8/site-packages/pyterrier/index.py:667, in DFIndexer.index(self, text, *args, **kwargs)
    665     javaDocCollection = TQDMSizeCollection(javaDocCollection, len(text)) 
    666 index = self.createIndexer()
--> 667 index.index(autoclass("org.terrier.python.PTUtils").makeCollection(javaDocCollection))
    668 global lastdoc
    669 lastdoc = None

File jnius/jnius_export_class.pxi:1177, in jnius.JavaMultipleMethod.__call__()

File jnius/jnius_export_class.pxi:885, in jnius.JavaMethod.__call__()

File jnius/jnius_export_class.pxi:982, in jnius.JavaMethod.call_method()

File jnius/jnius_utils.pxi:91, in jnius.check_exception()

JavaException: JVM exception occurred: For input string: "" java.lang.NumberFormatException

Additionally, if I try to use the indexer as follows:

indexref = indexer.index(df_docs["body"], df_docs["doc_id"])

the following error occurs:

  0%|                                                                               | 0/16043 [00:00<?, ?documents/s]

---------------------------------------------------------------------------
JavaException                             Traceback (most recent call last)
Input In [41], in <cell line: 1>()
----> 1 indexref = indexer.index(df_docs["body"], df_docs["doc_id"])

File /opt/conda/lib/python3.8/site-packages/pyterrier/index.py:667, in DFIndexer.index(self, text, *args, **kwargs)
    665     javaDocCollection = TQDMSizeCollection(javaDocCollection, len(text)) 
    666 index = self.createIndexer()
--> 667 index.index(autoclass("org.terrier.python.PTUtils").makeCollection(javaDocCollection))
    668 global lastdoc
    669 lastdoc = None

File jnius/jnius_export_class.pxi:1177, in jnius.JavaMultipleMethod.__call__()

File jnius/jnius_export_class.pxi:885, in jnius.JavaMethod.__call__()

File jnius/jnius_export_class.pxi:982, in jnius.JavaMethod.call_method()

File jnius/jnius_utils.pxi:91, in jnius.check_exception()

JavaException: JVM exception occurred: Could not instantiate MetaIndexBuilder org.terrier.structures.indexing.ZstdMetaIndexBuilder java.lang.IllegalArgumentException

PyTerrier is updated to the newest version. The documents I want to index have at least a length of 3 characters.

When trying to index only the first 5 documents the problem still persits

Because I'm following the notebook, I would expect the code to work as stated there.

Could it be a problem with my java version or the pt.init()? Please let me know if additional information is needed.

I have checked the PyTerrier documentation for relevant content
I have checked for previous relevant PyTerrier issues

The text was updated successfully, but these errors were encountered:

cmacdonald · 2022-12-06T10:50:39Z

Hi Max

This smells like a mismatch between the PyTerrier version and the underlying Java jar files. Can you show us what PyTerrier says after pt.init()?

maxhenze · 2022-12-06T10:51:49Z

Hi Craig,

of course.

It reports:

PyTerrier 0.9.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.

cmacdonald · 2022-12-06T12:45:38Z

Ok, so not version problem. We dont use DFIndexer a lot now, as its functionality is subsumed by IterDictIndexer (discussed below.). Specifically, for your first example:

indexref = indexer.index(df_docs["body"])

Pretty sure this is wrong, as you need a docno.

Instead, the following works:

indexref = indexer.index(df_docs["body"], docno=df_docs['doc_id'])`

I would encourage you to:
(a) Use pt.IterDictIndexer, as you can also index dataframes. DFIndexer promotes use of corpora as dataframes, which assumes they can be held in memory. Instead and dataframe can be converted to "iter-dict" and indexed as that:

 pt.IterDictIndexer('./idi_index').index(df_docs.rename(columns={'doc_id':'docno', 'body' : 'text'}).to_dict(orient='records'))

(b) Not to create a dataframe in the first place for a large collection like MSMARCO, as you can index the yield generator?

PS: I agree that for both your options DFIndexer could have had better error handling.

maxhenze · 2022-12-06T13:02:51Z

Works like a charm. Thank you, for your fast replies.

The problem is resolved, but I have a follow-up quesion.

Let's say I'm doing an experiment like follows:

import pyterrier as pt

dataset = pt.get_dataset("trec-deep-learning-docs")

bm25 = pt.BatchRetrieve.from_dataset(dataset, "terrier_stemmed", wmodel="BM25")

pt.Experiment(
    [bm25],
    dataset.get_topics("test"),
    dataset.get_qrels("test"),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10"],
    names=["BM25"],
)

This would rank 43 queries with 16K qrels to 3.2M documents. This results in the following scores:

But of course the 16K qrels don't connect to all 3.2M documents, thus I could to the scoring only on the Documents occuring in the Qrels.

This is btw. the problem I'm trying to resolve, because I wan't to manipulate the document text and this would to intensive (and wasted) if I would do it on all 3.2M documents.

With your mentioned approach I would thus do the following:

import ir_datasets
import pyterrier as pt

dataset_msmarco_document_trec_dl_2019_judged = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
docstore = dataset_msmarco_document_trec_dl_2019_judged.docs_store()

df_queries = pt.get_dataset("trec-deep-learning-docs").get_topics("test")
df_qrels = pt.get_dataset("trec-deep-learning-docs").get_qrels("test")
df_docs = pd.DataFrame(docstore.get_many_iter(df_qrels.docno.unique()))

df_docs = df_docs.rename(columns={'body':'text', 'doc_id':'docno'})

index_path = "./pd_index"
indexer = pt.IterDictIndexer(index_path)

indexref = indexer.index(df_docs.to_dict(orient="records"))

bm25 = pt.BatchRetrieve(indexref, wmodel="BM25")

pt.Experiment(
    [bm25],
    df_queries,
    df_qrels,
    eval_metrics=["map", "recip_rank", "ndcg_cut_10"],
    names=["BM25"],
)

But this results in:

What might be the part I'm missing ? The stemmer and stopwords settings should be the default settings, thus I didn't manually set them in the indexer.

cmacdonald · 2022-12-06T13:15:53Z

Your post doesnt make clear what is unexpected in the results.

To debug...

you can report the number of retrieved documents, number of relevant docs, recall etc.
You may want to set the cutoff levels for the eval measures

You may also want to change the number of results retrieved:

index = pt.IndexFactory.of(indexref)
bm25 = pt.BatchRetrieve(index, wmodel='BM25', num_results=len(index))

maxhenze · 2022-12-06T14:10:36Z

Nevermind. It seems like I mixed a few things up. Nevertheless, thanks for your help 👍

…ons (#349) * improve error messages for invalid indexing configurations * type checking iterdict, as requested * refactored, error when docno too long (rather than warn), support for fifo Co-authored-by: Sean MacAvaney <sean.macavaney@gmail.com>

maxhenze added the bug Something isn't working label Dec 6, 2022

cmacdonald changed the title ~~Problems with replicating Indexing Notebook~~ DFIndexer error messages Dec 6, 2022

maxhenze closed this as completed Dec 6, 2022

cmacdonald mentioned this issue Dec 6, 2022

improve error messages for invalid indexing configurations #349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DFIndexer error messages #348

DFIndexer error messages #348

maxhenze commented Dec 6, 2022

cmacdonald commented Dec 6, 2022

maxhenze commented Dec 6, 2022

cmacdonald commented Dec 6, 2022 •

edited

Loading

maxhenze commented Dec 6, 2022

cmacdonald commented Dec 6, 2022

maxhenze commented Dec 6, 2022 •

edited

Loading

DFIndexer error messages #348

DFIndexer error messages #348

Comments

maxhenze commented Dec 6, 2022

cmacdonald commented Dec 6, 2022

maxhenze commented Dec 6, 2022

cmacdonald commented Dec 6, 2022 • edited Loading

maxhenze commented Dec 6, 2022

cmacdonald commented Dec 6, 2022

maxhenze commented Dec 6, 2022 • edited Loading

cmacdonald commented Dec 6, 2022 •

edited

Loading

maxhenze commented Dec 6, 2022 •

edited

Loading