Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved caching #165

Closed
wants to merge 7 commits into from
Closed

improved caching #165

wants to merge 7 commits into from

Conversation

cmacdonald
Copy link
Contributor

  • new caching backend
  • new access via transformer.cache()
  • cache on multiple columns

Copy link
Collaborator

@seanmacavaney seanmacavaney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts about the broad approach here:

  1. This places a lot of importance on the __repr__s, as they are used to identify file keys. There's some potential for problems here if changes are made to components. For instance, BatchRetrieve's looks like this:
BR(/Users/sean/.pyterrier/corpora/vaswani/index/data.properties,{'terrierql': 'on', 'parsecontrols': 'on', [SNIP], 'querying.processes': 'terrierql:TerrierQLParser,parsecontrols:TerrierQLToControls,parseql:TerrierQLToMatchingQueryTerms,matchopql:MatchingOpQLParser,apply[SNIP] 'termpipelines': 'Stopwords,PorterStemmer'})

If the controls here change (Maybe one is added? Or a class is renamed in core?) the cache record would become invalid, even if it means the same thing semantically. I'm not sure if there is any way around this, though.

The person who writes repr for a component may be unaware of this use of the value. Within the main pyterrier library, it can be controlled well enough, but in extension libraries will be harder to moderate (e.g., onir_pt).

Maybe it's worth having a completely separate method for this, like .cache_key()? This could then handle (1) transformers that should not be cached could throw an exception, rather than using the "object at 0x" approach, and (2) there could be some attempt to normalize the key across other software revisions.

  1. The current setup cannot really handle caching the results of neural re-rankers because the keys can conflict across datasets. (E.g., qid=1 means different things in different datasets.) Caching here can be beneficial, such as if you wanted to test different passage aggregation strategies. There are even some edge cases for BatchRetrieve that could cause conflicts: transformer.search() assigns a qid of 1 by default, conflicting with past search invocations.

It seems like a more direct approach would be to have the user explicitly specify a cache key for the entire dataframe. For instance:

pt.Experiment([
  bm25 >> cache('my-cache'),
  bm25 >> cache('my-cache') >> RM3 >> bm25
], ...)

The cache transformer would either call the previous transformer (if my-cache doesn't exist) or read from the cache (if it does). my-cache could just be a file path, with the dataframes stored as pickles (maybe compressed?) or similar.

This approach would have other benefits too-- I feel like I often want to keep a copy of the resulting dataframes, not just get back the performance. By throwing a cache("something") at the end of my pipelines, I could be confident I could just pull it back up later if needed --- even by just using cache("something") as the first element of my piepline.

for index, row in input_res.iterrows():
qid = str(row["qid"])
import pyterrier as pt
iter = input_res.itertuples(index=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the global iter function is shadowed here; rename to it or iterator?

self.requests += 1
try:
df = self.chest.get(qid, None)
df = self.chest.get(key, None)
except:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more specific exception handler here?

@cmacdonald
Copy link
Contributor Author

If the controls here change (Maybe one is added? Or a class is renamed in core?) the cache record would become invalid, even if it means the same thing semantically. I'm not sure if there is any way around this, though.

I think this is by design - in the case of BR, if the controls change, then the results are assumed to change. Cache results are not intended to be long-lived. Think more for a series of experiments that use the same baseline.

The person who writes __repr__ for a component may be unaware of this use of the value. Within the main pyterrier library, it can be controlled well enough, but in extension libraries will be harder to moderate (e.g., onir_pt).

I think thats why it will throw an error if you try to cache something that isnt cachable, i.e. that doesnt define a reasonable __repr__. Isnt this a case of make it easy to do the first integration, then expose them to more functionality incrementally?

Relatedly pipelines using pt.apply cannot be cached, as any lambda cannot be guarenteed to be the same between invocations.

In general, there will need to be more documentation on how to build custom transformers, particularly if we integrate type checking.

The current setup cannot really handle caching the results of neural re-rankers because the keys can conflict across datasets. (E.g., qid=1 means different things in different datasets.)

In this case the user should cache on=['qid', 'query'].

I feel like I often want to keep a copy of the resulting dataframes, not just get back the performance

See also #163. I guess here we are discussing if caching is the same use case as #163 or not.

There are even some edge cases for BatchRetrieve that could cause conflicts: transformer.search() assigns a qid of 1 by default, conflicting with past search invocations.

Yes, perhaps .search() should use "search01" or something. I didnt want to have a counter though. 'search-%d' % time?

bm25 >> cache('my-cache')

I always had a cache transformer as wrapping a transformer rather than being composed with it. Now I'm not so sure that it cant be addressed, at least internally, by composition.

While this bm25 >> cache of some form may semantically work, I think my idea was to make this faster to write using the ~ operator. Inevitably, it would end up being bm25 >> pt.cache().

What is being keyed on may be addressed by the pipeline validation, when pipeline components have to declare what their inputs and outputs are. Would you prefer to park this until pipeline validation is ready, and then revisit it?

@cmacdonald cmacdonald marked this pull request as draft May 20, 2021 11:25
@CodingTil
Copy link

What would be required to make the ChestCacheTransformer also on rerankers?

Caching results on expensive multi-stage rerankers (such as monoT5 & duoT5) is crucial for good computational performance.

Currently, the Concatenation Operator (^) is insufficient for this task
If I had a Pipeline: (BM25%1000 ^ (BM25%1000>>monoT5%500) ^ (BM25%1000>>monoT5%500>>duoT5%25)
I.e., I want the top 1000 documents, but the top 500 even better sorted, and the top 25 the best sorted, then BM25 would run 3 times, and monoT5 twice (very expensive).

For a project of mine I am doing the optimized version manually (see https://github.com/CodingTil/2023_24---IRTM---Group-Project/blob/main/py_css/models/baseline.py#L71 and https://github.com/CodingTil/2023_24---IRTM---Group-Project/blob/main/py_css/models/base.py#L100), but I believe it would help a lot of users if this was a native feature.

@cmacdonald
Copy link
Contributor Author

@CodingTil see https://github.com/seanmacavaney/pyterrier-caching for a reranking cacher

We're going to mature that separate package until its ready to integrate into properly into pyterrier.

@seanmacavaney
Copy link
Collaborator

^ Note that ScorerCache from the package will not be suitable for caching results from DuoT5, since it only caches results based on query and docno. The "caveats" section hints at this, but I'll make it more direct.

@cmacdonald
Copy link
Contributor Author

I think this is stale and superseded by pyterrier-caching; lets close

@cmacdonald cmacdonald closed this Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants