Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DFIndexer error messages #348

Closed
2 tasks done
maxhenze opened this issue Dec 6, 2022 · 6 comments
Closed
2 tasks done

DFIndexer error messages #348

maxhenze opened this issue Dec 6, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@maxhenze
Copy link

maxhenze commented Dec 6, 2022

While reproducing the PyTerrier Indexing Notebook a lot of JVM exceptions occur in my notebook.

Explicitly I'm loading an ir_dataset and try to index:

import ir_datasets
import pandas as pd
import pyterrier as pt

if not pt.started():
  pt.init()

dataset_msmarco_document_trec_dl_2019_judged = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
docstore = dataset_msmarco_document_trec_dl_2019_judged.docs_store()

df_queries = pd.DataFrame(dataset_msmarco_document_trec_dl_2019_judged.queries_iter())
df_qrels = pd.DataFrame(dataset_msmarco_document_trec_dl_2019_judged.qrels_iter())
df_docs = pd.DataFrame(docstore.get_many_iter(df_qrels.doc_id.unique()))

index_path = "./pd_index"
indexer = pt.DFIndexer(index_path, verbose=True)

indexref = indexer.index(df_docs["body"])

This results in the following error:


  0%|                                                                               | 0/16043 [00:00<?, ?documents/s]

---------------------------------------------------------------------------
JavaException                             Traceback (most recent call last)
Input In [43], in <cell line: 1>()
----> 1 indexref = indexer.index(df_docs["body"])

File /opt/conda/lib/python3.8/site-packages/pyterrier/index.py:667, in DFIndexer.index(self, text, *args, **kwargs)
    665     javaDocCollection = TQDMSizeCollection(javaDocCollection, len(text)) 
    666 index = self.createIndexer()
--> 667 index.index(autoclass("org.terrier.python.PTUtils").makeCollection(javaDocCollection))
    668 global lastdoc
    669 lastdoc = None

File jnius/jnius_export_class.pxi:1177, in jnius.JavaMultipleMethod.__call__()

File jnius/jnius_export_class.pxi:885, in jnius.JavaMethod.__call__()

File jnius/jnius_export_class.pxi:982, in jnius.JavaMethod.call_method()

File jnius/jnius_utils.pxi:91, in jnius.check_exception()

JavaException: JVM exception occurred: For input string: "" java.lang.NumberFormatException

Additionally, if I try to use the indexer as follows:

indexref = indexer.index(df_docs["body"], df_docs["doc_id"])

the following error occurs:

  0%|                                                                               | 0/16043 [00:00<?, ?documents/s]

---------------------------------------------------------------------------
JavaException                             Traceback (most recent call last)
Input In [41], in <cell line: 1>()
----> 1 indexref = indexer.index(df_docs["body"], df_docs["doc_id"])

File /opt/conda/lib/python3.8/site-packages/pyterrier/index.py:667, in DFIndexer.index(self, text, *args, **kwargs)
    665     javaDocCollection = TQDMSizeCollection(javaDocCollection, len(text)) 
    666 index = self.createIndexer()
--> 667 index.index(autoclass("org.terrier.python.PTUtils").makeCollection(javaDocCollection))
    668 global lastdoc
    669 lastdoc = None

File jnius/jnius_export_class.pxi:1177, in jnius.JavaMultipleMethod.__call__()

File jnius/jnius_export_class.pxi:885, in jnius.JavaMethod.__call__()

File jnius/jnius_export_class.pxi:982, in jnius.JavaMethod.call_method()

File jnius/jnius_utils.pxi:91, in jnius.check_exception()

JavaException: JVM exception occurred: Could not instantiate MetaIndexBuilder org.terrier.structures.indexing.ZstdMetaIndexBuilder java.lang.IllegalArgumentException

PyTerrier is updated to the newest version. The documents I want to index have at least a length of 3 characters.

When trying to index only the first 5 documents the problem still persits

Because I'm following the notebook, I would expect the code to work as stated there.

Could it be a problem with my java version or the pt.init()? Please let me know if additional information is needed.

@maxhenze maxhenze added the bug Something isn't working label Dec 6, 2022
@cmacdonald
Copy link
Contributor

Hi Max

This smells like a mismatch between the PyTerrier version and the underlying Java jar files. Can you show us what PyTerrier says after pt.init()?

@maxhenze
Copy link
Author

maxhenze commented Dec 6, 2022

Hi Craig,

of course.

It reports:

PyTerrier 0.9.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.

@cmacdonald
Copy link
Contributor

cmacdonald commented Dec 6, 2022

Ok, so not version problem. We dont use DFIndexer a lot now, as its functionality is subsumed by IterDictIndexer (discussed below.). Specifically, for your first example:

indexref = indexer.index(df_docs["body"]) 

Pretty sure this is wrong, as you need a docno.

Instead, the following works:

indexref = indexer.index(df_docs["body"], docno=df_docs['doc_id'])`

I would encourage you to:
(a) Use pt.IterDictIndexer, as you can also index dataframes. DFIndexer promotes use of corpora as dataframes, which assumes they can be held in memory. Instead and dataframe can be converted to "iter-dict" and indexed as that:

 pt.IterDictIndexer('./idi_index').index(df_docs.rename(columns={'doc_id':'docno', 'body' : 'text'}).to_dict(orient='records'))

(b) Not to create a dataframe in the first place for a large collection like MSMARCO, as you can index the yield generator?

PS: I agree that for both your options DFIndexer could have had better error handling.

@maxhenze
Copy link
Author

maxhenze commented Dec 6, 2022

Works like a charm. Thank you, for your fast replies.

The problem is resolved, but I have a follow-up quesion.

Let's say I'm doing an experiment like follows:

import pyterrier as pt

dataset = pt.get_dataset("trec-deep-learning-docs")

bm25 = pt.BatchRetrieve.from_dataset(dataset, "terrier_stemmed", wmodel="BM25")

pt.Experiment(
    [bm25],
    dataset.get_topics("test"),
    dataset.get_qrels("test"),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10"],
    names=["BM25"],
)

This would rank 43 queries with 16K qrels to 3.2M documents. This results in the following scores:
grafik

But of course the 16K qrels don't connect to all 3.2M documents, thus I could to the scoring only on the Documents occuring in the Qrels.

This is btw. the problem I'm trying to resolve, because I wan't to manipulate the document text and this would to intensive (and wasted) if I would do it on all 3.2M documents.

With your mentioned approach I would thus do the following:

import ir_datasets
import pyterrier as pt

dataset_msmarco_document_trec_dl_2019_judged = ir_datasets.load("msmarco-document/trec-dl-2019/judged")
docstore = dataset_msmarco_document_trec_dl_2019_judged.docs_store()

df_queries = pt.get_dataset("trec-deep-learning-docs").get_topics("test")
df_qrels = pt.get_dataset("trec-deep-learning-docs").get_qrels("test")
df_docs = pd.DataFrame(docstore.get_many_iter(df_qrels.docno.unique()))

df_docs = df_docs.rename(columns={'body':'text', 'doc_id':'docno'})

index_path = "./pd_index"
indexer = pt.IterDictIndexer(index_path)

indexref = indexer.index(df_docs.to_dict(orient="records"))

bm25 = pt.BatchRetrieve(indexref, wmodel="BM25")

pt.Experiment(
    [bm25],
    df_queries,
    df_qrels,
    eval_metrics=["map", "recip_rank", "ndcg_cut_10"],
    names=["BM25"],
)

But this results in:
grafik

What might be the part I'm missing ? The stemmer and stopwords settings should be the default settings, thus I didn't manually set them in the indexer.

@cmacdonald cmacdonald changed the title Problems with replicating Indexing Notebook DFIndexer error messages Dec 6, 2022
@cmacdonald
Copy link
Contributor

Your post doesnt make clear what is unexpected in the results.

To debug...

  • you can report the number of retrieved documents, number of relevant docs, recall etc.
  • You may want to set the cutoff levels for the eval measures

You may also want to change the number of results retrieved:

index = pt.IndexFactory.of(indexref)
bm25 = pt.BatchRetrieve(index, wmodel='BM25', num_results=len(index))

@maxhenze
Copy link
Author

maxhenze commented Dec 6, 2022

Nevermind. It seems like I mixed a few things up. Nevertheless, thanks for your help 👍

@maxhenze maxhenze closed this as completed Dec 6, 2022
cmacdonald added a commit that referenced this issue Dec 8, 2022
…ons (#349)

* improve error messages for invalid indexing configurations
* type checking iterdict, as requested
* refactored, error when docno too long (rather than warn), support for fifo

Co-authored-by: Sean MacAvaney <sean.macavaney@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants