improved caching #165

cmacdonald · 2021-05-20T08:59:37Z

new caching backend
new access via transformer.cache()
cache on multiple columns

seanmacavaney

A few thoughts about the broad approach here:

This places a lot of importance on the __repr__s, as they are used to identify file keys. There's some potential for problems here if changes are made to components. For instance, BatchRetrieve's looks like this:

BR(/Users/sean/.pyterrier/corpora/vaswani/index/data.properties,{'terrierql': 'on', 'parsecontrols': 'on', [SNIP], 'querying.processes': 'terrierql:TerrierQLParser,parsecontrols:TerrierQLToControls,parseql:TerrierQLToMatchingQueryTerms,matchopql:MatchingOpQLParser,apply[SNIP] 'termpipelines': 'Stopwords,PorterStemmer'})

If the controls here change (Maybe one is added? Or a class is renamed in core?) the cache record would become invalid, even if it means the same thing semantically. I'm not sure if there is any way around this, though.

The person who writes repr for a component may be unaware of this use of the value. Within the main pyterrier library, it can be controlled well enough, but in extension libraries will be harder to moderate (e.g., onir_pt).

Maybe it's worth having a completely separate method for this, like .cache_key()? This could then handle (1) transformers that should not be cached could throw an exception, rather than using the "object at 0x" approach, and (2) there could be some attempt to normalize the key across other software revisions.

The current setup cannot really handle caching the results of neural re-rankers because the keys can conflict across datasets. (E.g., qid=1 means different things in different datasets.) Caching here can be beneficial, such as if you wanted to test different passage aggregation strategies. There are even some edge cases for BatchRetrieve that could cause conflicts: transformer.search() assigns a qid of 1 by default, conflicting with past search invocations.

It seems like a more direct approach would be to have the user explicitly specify a cache key for the entire dataframe. For instance:

pt.Experiment([
  bm25 >> cache('my-cache'),
  bm25 >> cache('my-cache') >> RM3 >> bm25
], ...)

The cache transformer would either call the previous transformer (if my-cache doesn't exist) or read from the cache (if it does). my-cache could just be a file path, with the dataframes stored as pickles (maybe compressed?) or similar.

This approach would have other benefits too-- I feel like I often want to keep a copy of the resulting dataframes, not just get back the performance. By throwing a cache("something") at the end of my pipelines, I could be confident I could just pull it back up later if needed --- even by just using cache("something") as the first element of my piepline.

seanmacavaney · 2021-05-20T09:39:45Z

pyterrier/cache.py

-        for index, row in input_res.iterrows():
-            qid = str(row["qid"])
+        import pyterrier as pt
+        iter = input_res.itertuples(index=False)


the global iter function is shadowed here; rename to it or iterator?

seanmacavaney · 2021-05-20T09:41:41Z

pyterrier/cache.py

            self.requests += 1
            try:
-                df = self.chest.get(qid, None)
+                df = self.chest.get(key, None)
            except:


more specific exception handler here?

cmacdonald · 2021-05-20T11:04:09Z

If the controls here change (Maybe one is added? Or a class is renamed in core?) the cache record would become invalid, even if it means the same thing semantically. I'm not sure if there is any way around this, though.

I think this is by design - in the case of BR, if the controls change, then the results are assumed to change. Cache results are not intended to be long-lived. Think more for a series of experiments that use the same baseline.

The person who writes __repr__ for a component may be unaware of this use of the value. Within the main pyterrier library, it can be controlled well enough, but in extension libraries will be harder to moderate (e.g., onir_pt).

I think thats why it will throw an error if you try to cache something that isnt cachable, i.e. that doesnt define a reasonable __repr__. Isnt this a case of make it easy to do the first integration, then expose them to more functionality incrementally?

Relatedly pipelines using pt.apply cannot be cached, as any lambda cannot be guarenteed to be the same between invocations.

In general, there will need to be more documentation on how to build custom transformers, particularly if we integrate type checking.

The current setup cannot really handle caching the results of neural re-rankers because the keys can conflict across datasets. (E.g., qid=1 means different things in different datasets.)

In this case the user should cache on=['qid', 'query'].

I feel like I often want to keep a copy of the resulting dataframes, not just get back the performance

See also #163. I guess here we are discussing if caching is the same use case as #163 or not.

There are even some edge cases for BatchRetrieve that could cause conflicts: transformer.search() assigns a qid of 1 by default, conflicting with past search invocations.

Yes, perhaps .search() should use "search01" or something. I didnt want to have a counter though. 'search-%d' % time?

bm25 >> cache('my-cache')

I always had a cache transformer as wrapping a transformer rather than being composed with it. Now I'm not so sure that it cant be addressed, at least internally, by composition.

While this bm25 >> cache of some form may semantically work, I think my idea was to make this faster to write using the ~ operator. Inevitably, it would end up being bm25 >> pt.cache().

What is being keyed on may be addressed by the pipeline validation, when pipeline components have to declare what their inputs and outputs are. Would you prefer to park this until pipeline validation is ready, and then revisit it?

CodingTil · 2023-10-05T09:58:23Z

What would be required to make the ChestCacheTransformer also on rerankers?

Caching results on expensive multi-stage rerankers (such as monoT5 & duoT5) is crucial for good computational performance.

Currently, the Concatenation Operator (^) is insufficient for this task
If I had a Pipeline: (BM25%1000 ^ (BM25%1000>>monoT5%500) ^ (BM25%1000>>monoT5%500>>duoT5%25)
I.e., I want the top 1000 documents, but the top 500 even better sorted, and the top 25 the best sorted, then BM25 would run 3 times, and monoT5 twice (very expensive).

For a project of mine I am doing the optimized version manually (see https://github.com/CodingTil/2023_24---IRTM---Group-Project/blob/main/py_css/models/baseline.py#L71 and https://github.com/CodingTil/2023_24---IRTM---Group-Project/blob/main/py_css/models/base.py#L100), but I believe it would help a lot of users if this was a native feature.

cmacdonald · 2023-11-24T11:39:40Z

@CodingTil see https://github.com/seanmacavaney/pyterrier-caching for a reranking cacher

We're going to mature that separate package until its ready to integrate into properly into pyterrier.

seanmacavaney · 2023-11-24T11:43:55Z

^ Note that ScorerCache from the package will not be suitable for caching results from DuoT5, since it only caches results based on query and docno. The "caveats" section hints at this, but I'll make it more direct.

cmacdonald · 2024-08-23T11:04:31Z

I think this is stale and superseded by pyterrier-caching; lets close

cmacdonald added 7 commits April 21, 2021 22:16

improvements to caching

f175857

more complex testing

da896e9

some fixes

cd4cad2

added a shelve-based cache, which is 36% faster

071b298

Merge branch 'master' into newcache

c333cdb

add cache() method to transformer

55b123c

separate test dirs

ea658d2

seanmacavaney reviewed May 20, 2021

View reviewed changes

cmacdonald marked this pull request as draft May 20, 2021 11:25

cmacdonald mentioned this pull request May 28, 2021

alternative caching proposal #173

Closed

cmacdonald closed this Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improved caching #165

improved caching #165

cmacdonald commented May 20, 2021

seanmacavaney left a comment

seanmacavaney May 20, 2021

seanmacavaney May 20, 2021

cmacdonald commented May 20, 2021

CodingTil commented Oct 5, 2023

cmacdonald commented Nov 24, 2023

seanmacavaney commented Nov 24, 2023

cmacdonald commented Aug 23, 2024

improved caching #165

improved caching #165

Conversation

cmacdonald commented May 20, 2021

seanmacavaney left a comment

Choose a reason for hiding this comment

seanmacavaney May 20, 2021

Choose a reason for hiding this comment

seanmacavaney May 20, 2021

Choose a reason for hiding this comment

cmacdonald commented May 20, 2021

CodingTil commented Oct 5, 2023

cmacdonald commented Nov 24, 2023

seanmacavaney commented Nov 24, 2023

cmacdonald commented Aug 23, 2024