Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix #8518 (sync requests being cached wrongly on timeout) #9358

Merged
merged 7 commits into from
Feb 24, 2021

Conversation

ShadowJonathan
Copy link
Contributor

@ShadowJonathan ShadowJonathan commented Feb 9, 2021

This fixes #8518 by adding a conditional check on SyncResult in a function when prev_stream_token == current_stream_token, as a sanity check. In CachedResponse.set.<remove>(), the result is immediately popped from the cache if the conditional function returns "false".

This prevents the caching of a timed-out SyncResult (that has next_key as the stream key that produced that SyncResult). The cache is prevented from returning a SyncResult that makes the client request the same stream key over and over again, effectively making it stuck in a loop of requesting and getting a response immediately for as long as the cache keeps those values.

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
  • Pull request includes a sign off
  • Code style is correct (run the linters)

Signed-off-by: Jonathan de Jong <jonathan@automatia.nl>

Fixes #8518

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Feb 9, 2021

I could make this cleaner by using contextvars, but i've heard that so far it does not correctly work with twisted, so i'm not doing that now, and instead this could be replaced with contextvars when it does work. (see twisted/twisted#1262 for the fixing PR)

@ShadowJonathan ShadowJonathan changed the title Add in NoTimedCache to prevent sync loops Add in NoTimedCache to prevent cached SyncResult looping through self-reference Feb 9, 2021
@ShadowJonathan ShadowJonathan changed the title Add in NoTimedCache to prevent cached SyncResult looping through self-reference Fix #8518 (sync requests being cached wrongly on timeout) Feb 9, 2021
@clokep clokep requested a review from a team February 9, 2021 18:57
@ShadowJonathan
Copy link
Contributor Author

(Note: i am fully aware this is kind-of a hack, and i'm prodding with this PR to see if this hack is acceptable for the problem)

@erikjohnston
Copy link
Member

Thanks for digging into the bug. I think your approach is probably fine, it doesn't really fill me with joy but oh well. Perhaps one refinement/alternative would be to have ResponseCache take a should_cache: Callable[[T], bool] arg, which is called on every returned result to see if it should be cached? The default would effectively be lambda _: True (i.e. cache everything), and for the sync cache could either be lambda r: since_token != r.next_batch or even just bool (as the sync results are falsey if they're empty, and we may simply just not want to cache empty results).

Jonathon/et al.: Thoughts?

(This will also need a test)

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Feb 22, 2021

(as the sync results are falsey if they're empty, and we may simply just not want to cache empty results)

I don't wanna bet on this, and make the response depend on falsey/truthy evaluation, which can be wonky, but i'll look at should_cache, thanks for the in-depth reply!

I'll also look at making a test somewhere

@ShadowJonathan
Copy link
Contributor Author

I made it so that ResponseCache has a second method; wrap_conditional, which expects a position argument of a callable, which all have to return True for it to actually be cached in a timed fashion.

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Feb 22, 2021

@erikjohnston there seems to be no test_responsecache in tests.util.caches, do you;

a. want me to make one, and only add a test for the conditional
b. want me to make one, add a test for the conditional, and then some other tests (e.g. timeout and actually checking if it's cached or not)
c. do nothing for now, have the tests arrive in another PR later

?

@erikjohnston
Copy link
Member

Ugh, I really thought there were tests already. I think adding tests should be easy enough, so if you could do that then that would be great. Some simple tests plus some tests of the new logic would be fab, assuming that by the time you've done the tests for the new logic it'll be trivial to do some simple tests as well.

@ShadowJonathan ShadowJonathan changed the title Fix #8518 (sync requests being cached wrongly on timeout) Fix #8518 (sync requests being cached wrongly on timeout) and add ResponseCache tests Feb 22, 2021
@ShadowJonathan

This comment has been minimized.

@ShadowJonathan

This comment has been minimized.

@ShadowJonathan
Copy link
Contributor Author

Some conversation in #synapse-dev uncovered that adding tests in this PR probably isn't the best option, so i'll make another PR that depends on this one, and adds tests for ResponseCache

@ShadowJonathan ShadowJonathan changed the title Fix #8518 (sync requests being cached wrongly on timeout) and add ResponseCache tests Fix #8518 (sync requests being cached wrongly on timeout) Feb 22, 2021
@ShadowJonathan ShadowJonathan mentioned this pull request Feb 22, 2021
4 tasks
@ShadowJonathan

This comment has been minimized.

Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine for me, thanks for clearly describing what caused the bug in the description!

synapse/util/caches/response_cache.py Outdated Show resolved Hide resolved
synapse/util/caches/response_cache.py Outdated Show resolved Hide resolved
@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Feb 24, 2021

Oh, the description is actually outdated right now, let me update that.

Edit: The previous approach used a new type called NoTimedCache, the new approach just adds a new function, wrap_conditional, to ResponseCache, to optionally signal to not cache the result when it is returned.

@anoadragon453
Copy link
Member

Thanks! Merging as the tests in #9458 are passing.

@anoadragon453 anoadragon453 merged commit f5c93fc into matrix-org:develop Feb 24, 2021
sync_config.request_key,
lambda result: since_token != result.next_batch,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should have had a comment. Why does doing this give the right behaviour? (I shouldn't have to go and find the PR that added it, and even having done so it's hard to correlate the description to the code.)

Copy link
Contributor Author

@ShadowJonathan ShadowJonathan Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was described in the PR description, but i'll take it in another PR to add a comment here, i.e. "small fixes for ResponseCache"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was described in the PR description,

yeah as I said: having to go and find the PR and grok it is a pain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, i'll add a comment describing what that conditional does 👍

Comment on lines +120 to +121
def add_conditional(self, key: T, conditional: Callable[[Any], bool]):
self.pending_conditionals.setdefault(key, set()).add(conditional)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a dangerous thing to add to the public interface. It seems like it would be very easy to use in racy way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should i signal that add_conditional is private by prefixing it with _ (in a new PR)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's only called in one place, I'd just inline it.

# See if there's already a result on this key that hasn't yet completed. Due to the single-threaded nature of
# python, adding a key immediately in the same execution thread will not cause a race condition.
result = self.get(key)
if not result or isinstance(result, defer.Deferred) and not result.called:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct, but I had to go and look up the precedence of and vs or to check it. Please use parens to disambiguate in future.

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to write a bunch of comments after the event; I wanted to review it in the light of #9507 and these were a few things I spotted.

TL;DR is that this could be easier to grok but I can't see any obvious bugs here.

Comment on lines +140 to +141
if not result or isinstance(result, defer.Deferred) and not result.called:
self.add_conditional(key, should_cache)
Copy link
Member

@richvdh richvdh Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it correct that the conditional is added when not result.called ? I'd argue that cachability of the result should be a property of the callback being used (and hence set at the same time as that callback is set) rather than any subsequent callbacks (which are discarded).

Copy link
Contributor Author

@ShadowJonathan ShadowJonathan Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operator precedence is or -> and -> not, so this is actually

(not result) or (isinstance(result, defer.Deferred) and (not result.called))

I had to check after your previous comment, it could actually be if not result and not result.called, but because result can be a plain return value, i need to insert the check to isinstance(..., defer.Deferred)

Yes, the conditional is added only to be executed after the result is called, i do agree that now i realise that any future calls can be like "hey, this shouldn't be cached anyways because the returned value doesn't agree with what i already got from somewhere else", in which case I need to rewrite this to execute the conditional locally whenever it's called, and evict the cache when it's not valid according to the new conditional. (the callLater already taking place would simply execute anyways, no way to properly garbage collect that early at that point (to my knowledge), if it was an asyncio Task, i could also map it somewhere to then .cancel() it, if this is possible with callLater handles or something, please tell me)

(this last approach (letting callLater still call) could potentially evict a cache early if a subsequent quick wrap call has the same key, though idk how much of a problem that'd be, seeing as it'd evict cache shortly after the call, and only if;

  • the cache was deemed to be time-cached (with conditionals or not)
  • after the function's return, another function does wrap_conditional with a conditional that evaluates to False
  • shortly after that wrap_conditional call and then-immediate cache eviction, a function (within the remaining timeout) calls wrap* with the same key- wait.

Oh, this might actually be a problem, if the function (with the same key) hasn't returned yet on the second wrap* call, then the previous callLater will evict the deferred, which could lead to more wonky stuff, but the fact it evicts a call in flight already bad enough.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed at considerable length in the room: I don't agree with your logic here. I think that supporting multiple should_cache callbacks per key significantly complicates the implentation, and semantically is dubious at best.

This comment was marked as outdated.

Copy link
Contributor Author

@ShadowJonathan ShadowJonathan Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, i'll just make it a simple Dict[T, Callable] mapping then 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, more discussion in #9522 on this.

Comment on lines +139 to +141
result = self.get(key)
if not result or isinstance(result, defer.Deferred) and not result.called:
self.add_conditional(key, should_cache)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is duplicating half the logic of wrap, which is kinda the worst of all worlds. Could you not make should_cache nullable, and have wrap call wrap_conditional rather than the other way around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially did that, but that became a bit of a mess, and you'd had to shift around the positional arguments on any wrap calls currently in place.

I also didn't wanna make it a keyword argument, because that'd shadow any potential keyword arguments to the inner call. While i know it is trivial to "just add it", i didn't want to because it is non-explicit to anyone not aware of this change to ResponseCache, and so it is possible for it to become a bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially did that, but that became a bit of a mess, and you'd had to shift around the positional arguments on any wrap calls currently in place.

I'm not suggesting changing the signature of wrap.

Wrap would be:

    def wrap(
        self, key: T, callback: "Callable[..., Any]", *args: Any, **kwargs: Any
    ) -> defer.Deferred:
        return self.wrap_conditional(key, None, callback, *args, **kwargs)

why is that a mess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, i'll change it to that in that PR, then 👍

@clokep clokep mentioned this pull request Mar 2, 2021
4 tasks
clokep added a commit that referenced this pull request Mar 2, 2021
)"

This reverts commit f5c93fc.

This is being backed out due to a regression (#9507) and additional
review feedback being provided.
@clokep
Copy link
Member

clokep commented Mar 2, 2021

I've backed this out in aee1076. I think we'll want to:

  • Re-land this with the additional edge-case handled.
  • Ensure the tests cover the additional edge-case.

I think the current known issue of #8518 is better than the unknown impact of the regression issue #9507. Unfortunately with this change we did not have confidence in creating a new release candidate or deploying to matrix.org, which are necessities for the team.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync requests immediately return with empty payload
5 participants