Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global properties for BatchRetrieve #188

Closed
2 tasks done
seanmacavaney opened this issue Jun 11, 2021 · 4 comments
Closed
2 tasks done

Global properties for BatchRetrieve #188

seanmacavaney opened this issue Jun 11, 2021 · 4 comments
Labels
bug Something isn't working upstream
Milestone

Comments

@seanmacavaney
Copy link
Collaborator

Describe the bug

It seems that when building multiple BatchRetrieve objects, only the properties from the newest one are applied. E.g., if I do:

bm25_nostem = pt.BatchRetrieve(index_nostem, wmodel='BM25', verbose=True, properties={
    'tokeniser': 'UTFTokeniser',
    'termpipelines': '',
})
bm25_stem = pt.BatchRetrieve(index_stem, wmodel='BM25', verbose=True, properties={
    'tokeniser': 'UTFTokeniser',
    'termpipelines': 'SpanishSnowballStemmer',
})

termpipelines=SpanishSnowballStemmer is applied to both retrievers.

To Reproduce
Steps to reproduce the behavior:

  1. Which index: E.g., any two indices that used different termpipelines when constructed
  2. Which retrieval: Observed with BM25
  3. What pipeline: BatchRetrieve

Expected behavior

The properties should be applied to the corresponding retrievers only.

Documentation and Issues

Additional context
Observed when conducting experiments on irds:wikir/es13k/test, but this behavior seems like it would happen regardless of language.

@seanmacavaney seanmacavaney added the bug Something isn't working label Jun 11, 2021
@cmacdonald
Copy link
Contributor

20 years of technical debt around a global configuration state... 👎

I think the workaround is if you start using the bm25_nostem before instantiating bm25_stem, the problem goes away.

Stemming variations are the most common places where this can arise for BatchRetrieve. There may be other cases for FeaturesBatchRetrieve

The upstream issue is terrier-org/terrier-core#122

@cmacdonald
Copy link
Contributor

@seanmacavaney when you have a chance, could you review terrier-org/terrier-core#181

@cmacdonald
Copy link
Contributor

cmacdonald commented Sep 9, 2021

So I think this is now addressed, and the notebooks could be updated. Two points:

  • I dont think the tokeniser property needs to be adjusted during retrieval?
  • termpipelines are identified automatically from the index, and can be overridden by controls.

Does the documentation need updated?

@cmacdonald cmacdonald reopened this Sep 9, 2021
@cmacdonald cmacdonald added this to the 0.7 milestone Sep 17, 2021
@cmacdonald
Copy link
Contributor

notebook now updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working upstream
Projects
None yet
Development

No branches or pull requests

2 participants