Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi_cached not preserving original keys when calling the underlying function #563

Closed
haizhou opened this issue Apr 11, 2022 · 12 comments
Closed

Comments

@haizhou
Copy link
Contributor

haizhou commented Apr 11, 2022

@jkgenser and I were testing multi_cached in our project, and found that transformed keys are being passed into the underlying function call after key_builder is called. Created #564 to track.

@Dreamsorcerer
Copy link
Member

Needs a test/reproducer.

@padraic-shafer
Copy link
Contributor

padraic-shafer commented Jan 14, 2023

I don't want to disrupt PR #564, so I'm moving this bit of discussion here instead.

If I'm understanding correctly, you're suggesting in interpretation 2, that a function can return keys which don't match the input? Like:

def foo(keys=(1, 2)):
   return {"a": "foo", "b": "bar"}

Well, that's an extreme example, but I suppose "legal" under Interpretation 2. :)

I was expecting something more like this...

def sales(mail_codes=(10101, 90210)):
    return {"sales_10101": 546_295, "sales_90210": 133_420}

def financials(year=(2021, 2022, 2023), stats=("revenue", "expenses")):
    return {"2021.revenue": 245_364, "2021.expenses": 546_295, "2022.revenue": 1_216_892, "2022.expenses": 748_398, "2023.revenue": 83_294, "2023.expenses": 56_492}

I agree that these functions should know nothing about the cache.

My interpretation was that separation of concerns is achieved by:

  1. key_builder(key, func, *args, **kwargs) in @multicached(..., key_builder=...) converts the input keys from the decorated function into the keys that are returned by that function.
  2. The cache used by multicached calls its build_key(key, namespace) member to optionally add a namespace to the keys returned by the decorated function to avoid a key naming clash.

On subsequent review, I'm realizing why this is confusing.

In decorator, before reading from the cache, the cache keys are built by get_cache_keys(), which calls key_builder() on the keys referenced by keys_from_attr. The keys returned by the decorated function are not used here.

def get_cache_keys(self, f, args, kwargs):
args_dict = _get_args_dict(f, args, kwargs)
orig_keys = args_dict.get(self.keys_from_attr, []) or []
cache_keys = [self.key_builder(key, f, *args, **kwargs) for key in orig_keys]

When set_in_cache() writes to the cache, it calls key_builder() on the keys returned by the decorated function. keys_from_attr and get_cache_keys() are not used here.

result = await f(*new_args, **kwargs)
result.update(partial)
if cache_write:
if aiocache_wait_for_write:
await self.set_in_cache(result, f, args, kwargs)

This means that the cache keys used to set and get are generally not matching each other, unless the keys returned by the decorated function are the same as the keys listed in the arg that is named by keys_from_attr.

Conclusion

So, that seems to imply that Interpretation 1 needs to be correct. It's done a bit oddly at the moment though, because cache is read with the keys_from_attr keys, whereas cache is set with keys returned by the decorated function.

Should there be a check that the decorated function actually returns keys that match those in the keys_from_attr arg? Otherwise it seems like there might be potential to corrupt/abuse the cache when these keys do not match. Maybe when calling set_in_cache(), the keys of result should be overwritten with the keys_from_attr keys?

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Jan 14, 2023

def sales(mail_codes=(10101, 90210)):
    return {"sales_10101": 546_295, "sales_90210": 133_420}

def financials(year=(2021, 2022, 2023), stats=("revenue", "expenses")):
    return {"2021.revenue": 245_364, "2021.expenses": 546_295, "2022.revenue": 1_216_892, "2022.expenses": 748_398, "2023.revenue": 83_294, "2023.expenses": 56_492}

My impression of this @multi_cached decorator, is that it is fairly specific and expects a function which returns the keys it was passed. i.e. The function needs to be specifically designed as something which receives a list of inputs and returns values indexed by those inputs.

So, to support arbitrary functions like these seems out of scope (and the second wouldn't work when keys_from_attr only accepts a single argument anyway). To do something more complex like the above, I think you'd just wrap the fetching function into another function:

def sales(mail_codes=(10101, 90210)):
    @multicached(mail_codes)
    def fetch_codes(mail_codes):
        ...

    return {"sales_{}".format(k): v for k, v in fetch_codes(mail_codes)}

Or something to that extent (i.e. splitting your fetching logic from your application-specific logic).

I agree that these functions should know nothing about the cache.

1. `key_builder(key, func, *args, **kwargs)` in `@multicached(..., key_builder=...)` converts the input keys from the decorated function into the keys that are returned by that function.

But, now your cache code needs to understand how the function transforms its input to its output, which seems to me like you're mixing the cache and function code again. I don't think they should be coupled that tightly.

I can see the logic of only transforming it on the get and expecting the function to return keys that match this, but the code actually transforms it both on the get and the set, so this behaviour didn't work before either.

2. The cache used by `multicached` calls its `build_key(key, namespace)` member to optionally add a namespace to the keys returned by the decorated function to avoid a key naming clash.

The only use of key_builder mentioned in the docs is an alternative to namespace, not a totally different use case. Note that the same parameter is used on the other decorators and does the same thing (i.e. generates the key used for the cache), there is just a little extra complication in @multicached's behaviour.

In decorator, before reading from the cache, the cache keys are built by get_cache_keys(), which calls key_builder() on the keys referenced by keys_from_attr. The keys returned by the decorated function are not used here.

It can't use the keys returned by the function or it wouldn't be a cache, it needs to look them up without calling the function.

When set_in_cache() writes to the cache, it calls key_builder() on the keys returned by the decorated function. keys_from_attr and get_cache_keys() are not used here.

Likewise, we need to save whatever was returned from the function, so what would either of these have to do with that?

I'm not sure I'm following the confusion on that part...

This means that the cache keys used to set and get are generally not matching each other, unless the keys returned by the decorated function are the same as the keys listed in the arg that is named by keys_from_attr.

Yes, I think it should be treated as a simple API that is not uncommon.
e.g. FetchUsers(1, 4, 7) => {1: {"user_id": 1, ...}, ...}
That's a fairly common pattern, probably seen on REST APIs, some DB ORM code etc.

If a function does not implement this simple API design, then it is not appropriate to put the @multicached decorator on it.

So, that seems to imply that Interpretation 1 needs to be correct. It's done a bit oddly at the moment though, because cache is read with the keys_from_attr keys, whereas cache is set with keys returned by the decorated function.

I'm not sure what else you could propose. keys_from_attr just specifies where the keys are being passed to the function. I wonder if it's even needed, maybe it should just take the first argument or something. I'm not sure why you'd need additional arguments in this kind of example, it should probably just accept only a sequence of keys or use all arguments (*args) as keys...

Should there be a check that the decorated function actually returns keys that match those in the keys_from_attr arg? Otherwise it seems like there might be potential to corrupt/abuse the cache when these keys do not match.

There's some discussion on that here: #490 (comment)

Currently, I'd say that returning anything other than the requested keys is undefined behaviour, which we should narrow down and decide how it should work. Current ideas that could support returning contradictory values include eager loading (a function returns extra keys to cache, expecting a call for those keys in the future) and a no-cache (excluding keys that shouldn't be cached, maybe because the request timed out or something).

Maybe when calling set_in_cache(), the keys of result should be overwritten with the keys_from_attr keys?

I'm not sure what you're suggesting. If the result doesn't include the correct keys, how am I supposed to know which output key relates to which input key?

@Dreamsorcerer
Copy link
Member

Closing because the issue is completed, but feel free to continue the discussion.

@padraic-shafer
Copy link
Contributor

@Dreamsorcerer Thank you for your thoughtful consideration of these points.

My impression of this @multi_cached decorator, is that it is fairly specific and expects a function which returns the keys it was passed. i.e. The function needs to be specifically designed as something which receives a list of inputs and returns values indexed by those inputs.

I agree with you. This paradigm makes the most sense, all considered.

@padraic-shafer
Copy link
Contributor

I'll add a few responses for posterity, but I think none of it changes the outcome of this discussion. I.e., that the @multi_cached decorator expects a function which returns the keys it was passed

@padraic-shafer
Copy link
Contributor

So, to support arbitrary functions like these seems out of scope (and the second wouldn't work when keys_from_attr only accepts a single argument anyway). To do something more complex like the above, I think you'd just wrap the fetching function into another function:

def sales(mail_codes=(10101, 90210)):
    @multicached(mail_codes)
    def fetch_codes(mail_codes):
        ...

    return {"sales_{}".format(k): v for k, v in fetch_codes(mail_codes)}

Or something to that extent (i.e. splitting your fetching logic from your application-specific logic).

I had been thinking about the inverse of that situation, where the decorator is being applied to an existing function that returns complex keys. But either way, I think you're right that a simple wrapper could/should be applied to keep the caching logic separate from the function logic.

@padraic-shafer
Copy link
Contributor

But, now your cache code needs to understand how the function transforms its input to its output, which seems to me like you're mixing the cache and function code again. I don't think they should be coupled that tightly.

I originally thought this was the problem that key_builder was meant to solve: letting the cache construct the output keys (according to the function logic) without actually needing to hard code the function logic into the cache/decorator code.

But of course it's even better to keep these responsibilities completely separate, as you're suggesting.

I can see the logic of only transforming it on the get and expecting the function to return keys that match this, but the code actually transforms it both on the get and the set, so this behaviour didn't work before either.

Right. I missed this distinction until now.

@padraic-shafer
Copy link
Contributor

In decorator, before reading from the cache, the cache keys are built by get_cache_keys(), which calls key_builder() on the keys referenced by keys_from_attr. The keys returned by the decorated function are not used here.

It can't use the keys returned by the function or it wouldn't be a cache, it needs to look them up without calling the function.

Well, ok, yeah...I suppose I was stating the obvious there. ;)

When set_in_cache() writes to the cache, it calls key_builder() on the keys returned by the decorated function. keys_from_attr and get_cache_keys() are not used here.

Likewise, we need to save whatever was returned from the function, so what would either of these have to do with that?

Sort of. For the other decorators, the cache serializes the exact output of the function. With multi_cached, teach item of the output gets serialized individually.

For the complex-key functions I was considering in Interpretation 2, key_builder() was a translator between the function's input keys and output keys (not directly creating cache keys). But by now that's clearly an erroneous interpretation. The other option in my mind was that key_builder() should be omitted in set_in_cache(), but that's a bad idea for multiple reasons.

@padraic-shafer
Copy link
Contributor

Maybe when calling set_in_cache(), the keys of result should be overwritten with the keys_from_attr keys?

I'm not sure what you're suggesting. If the result doesn't include the correct keys, how am I supposed to know which output key relates to which input key?

I had been thinking of a straight 1-to-1 replacement of the keys, something like:

response = {attr_key: response[resp_key] for (attr_key, resp_key) in zip(orig_keys, response.keys())}

...but of course that's no good because there's no guarantee that the response keys have the same order as the attribute keys. Maybe set logic would be better.

if set(response.keys()).intersection(set(orig_keys)) != set(orig_keys):
    logger.warning(f"Keys being cached by multi_cached for {f} do not match keys in {keys_from_attr}")

It's better, but now it's getting into the territory of the discussion in #490 that you mentioned. For instance, it doesn't consider the case where some keys are returned by f and other keys are not.

At the moment, I don't have substantial suggestions for handling the empty / partial response cases.

@zkewal
Copy link

zkewal commented Jun 14, 2024

@Dreamsorcerer The fix for this issue is not merged into 0.12 branch and hence the fix is not available in the latest pip package.

@Dreamsorcerer
Copy link
Member

I don't think it's backwards compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants