Fix #8518 (sync requests being cached wrongly on timeout) #9358

ShadowJonathan · 2021-02-09T11:30:07Z

This fixes #8518 by adding a conditional check on SyncResult in a function when prev_stream_token == current_stream_token, as a sanity check. In CachedResponse.set.<remove>(), the result is immediately popped from the cache if the conditional function returns "false".

This prevents the caching of a timed-out SyncResult (that has next_key as the stream key that produced that SyncResult). The cache is prevented from returning a SyncResult that makes the client request the same stream key over and over again, effectively making it stuck in a loop of requesting and getting a response immediately for as long as the cache keeps those values.

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
Pull request includes a sign off
Code style is correct (run the linters)

Signed-off-by: Jonathan de Jong <jonathan@automatia.nl>

Fixes #8518

ShadowJonathan · 2021-02-09T11:33:37Z

I could make this cleaner by using contextvars, but i've heard that so far it does not correctly work with twisted, so i'm not doing that now, and instead this could be replaced with contextvars when it does work. (see twisted/twisted#1262 for the fixing PR)

ShadowJonathan · 2021-02-09T19:00:53Z

(Note: i am fully aware this is kind-of a hack, and i'm prodding with this PR to see if this hack is acceptable for the problem)

erikjohnston · 2021-02-22T09:47:56Z

Thanks for digging into the bug. I think your approach is probably fine, it doesn't really fill me with joy but oh well. Perhaps one refinement/alternative would be to have ResponseCache take a should_cache: Callable[[T], bool] arg, which is called on every returned result to see if it should be cached? The default would effectively be lambda _: True (i.e. cache everything), and for the sync cache could either be lambda r: since_token != r.next_batch or even just bool (as the sync results are falsey if they're empty, and we may simply just not want to cache empty results).

Jonathon/et al.: Thoughts?

(This will also need a test)

ShadowJonathan · 2021-02-22T10:19:22Z

(as the sync results are falsey if they're empty, and we may simply just not want to cache empty results)

I don't wanna bet on this, and make the response depend on falsey/truthy evaluation, which can be wonky, but i'll look at should_cache, thanks for the in-depth reply!

I'll also look at making a test somewhere

ShadowJonathan · 2021-02-22T11:05:06Z

I made it so that ResponseCache has a second method; wrap_conditional, which expects a position argument of a callable, which all have to return True for it to actually be cached in a timed fashion.

ShadowJonathan · 2021-02-22T11:08:52Z

@erikjohnston there seems to be no test_responsecache in tests.util.caches, do you;

a. want me to make one, and only add a test for the conditional
b. want me to make one, add a test for the conditional, and then some other tests (e.g. timeout and actually checking if it's cached or not)
c. do nothing for now, have the tests arrive in another PR later

?

erikjohnston · 2021-02-22T11:44:24Z

Ugh, I really thought there were tests already. I think adding tests should be easy enough, so if you could do that then that would be great. Some simple tests plus some tests of the new logic would be fab, assuming that by the time you've done the tests for the new logic it'll be trivial to do some simple tests as well.

synapse/federation/federation_server.py

ShadowJonathan · 2021-02-22T14:23:18Z

Some conversation in #synapse-dev uncovered that adding tests in this PR probably isn't the best option, so i'll make another PR that depends on this one, and adds tests for ResponseCache

anoadragon453

This looks fine for me, thanks for clearly describing what caused the bug in the description!

synapse/util/caches/response_cache.py

ShadowJonathan · 2021-02-24T12:09:30Z

Oh, the description is actually outdated right now, let me update that.

Edit: The previous approach used a new type called NoTimedCache, the new approach just adds a new function, wrap_conditional, to ResponseCache, to optionally signal to not cache the result when it is returned.

anoadragon453 · 2021-02-24T13:56:43Z

Thanks! Merging as the tests in #9458 are passing.

changelog.d/9358.misc

richvdh · 2021-03-02T11:10:13Z

synapse/handlers/sync.py

            sync_config.request_key,
+            lambda result: since_token != result.next_batch,


this should have had a comment. Why does doing this give the right behaviour? (I shouldn't have to go and find the PR that added it, and even having done so it's hard to correlate the description to the code.)

It was described in the PR description, but i'll take it in another PR to add a comment here, i.e. "small fixes for ResponseCache"

It was described in the PR description,

yeah as I said: having to go and find the PR and grok it is a pain.

Alright, i'll add a comment describing what that conditional does 👍

richvdh · 2021-03-02T11:12:00Z

synapse/util/caches/response_cache.py

+    def add_conditional(self, key: T, conditional: Callable[[Any], bool]):
+        self.pending_conditionals.setdefault(key, set()).add(conditional)


this seems like a dangerous thing to add to the public interface. It seems like it would be very easy to use in racy way.

should i signal that add_conditional is private by prefixing it with _ (in a new PR)?

since it's only called in one place, I'd just inline it.

richvdh · 2021-03-02T11:20:35Z

synapse/util/caches/response_cache.py

+        # See if there's already a result on this key that hasn't yet completed. Due to the single-threaded nature of
+        # python, adding a key immediately in the same execution thread will not cause a race condition.
+        result = self.get(key)
+        if not result or isinstance(result, defer.Deferred) and not result.called:


I think this is correct, but I had to go and look up the precedence of and vs or to check it. Please use parens to disambiguate in future.

richvdh

sorry to write a bunch of comments after the event; I wanted to review it in the light of #9507 and these were a few things I spotted.

TL;DR is that this could be easier to grok but I can't see any obvious bugs here.

richvdh · 2021-03-02T11:21:47Z

synapse/util/caches/response_cache.py

+        if not result or isinstance(result, defer.Deferred) and not result.called:
+            self.add_conditional(key, should_cache)


is it correct that the conditional is added when not result.called ? I'd argue that cachability of the result should be a property of the callback being used (and hence set at the same time as that callback is set) rather than any subsequent callbacks (which are discarded).

operator precedence is or -> and -> not, so this is actually

(not result) or (isinstance(result, defer.Deferred) and (not result.called))

I had to check after your previous comment, it could actually be if not result and not result.called, but because result can be a plain return value, i need to insert the check to isinstance(..., defer.Deferred)

Yes, the conditional is added only to be executed after the result is called, i do agree that now i realise that any future calls can be like "hey, this shouldn't be cached anyways because the returned value doesn't agree with what i already got from somewhere else", in which case I need to rewrite this to execute the conditional locally whenever it's called, and evict the cache when it's not valid according to the new conditional. (the callLater already taking place would simply execute anyways, no way to properly garbage collect that early at that point (to my knowledge), if it was an asyncio Task, i could also map it somewhere to then .cancel() it, if this is possible with callLater handles or something, please tell me)

(this last approach (letting callLater still call) could potentially evict a cache early if a subsequent quick wrap call has the same key, though idk how much of a problem that'd be, seeing as it'd evict cache shortly after the call, and only if;

the cache was deemed to be time-cached (with conditionals or not)

after the function's return, another function does wrap_conditional with a conditional that evaluates to False

shortly after that wrap_conditional call and then-immediate cache eviction, a function (within the remaining timeout) calls wrap* with the same key- wait.

Oh, this might actually be a problem, if the function (with the same key) hasn't returned yet on the second wrap* call, then the previous callLater will evict the deferred, which could lead to more wonky stuff, but the fact it evicts a call in flight already bad enough.)

as discussed at considerable length in the room: I don't agree with your logic here. I think that supporting multiple should_cache callbacks per key significantly complicates the implentation, and semantically is dubious at best.

Alright, i'll just make it a simple Dict[T, Callable] mapping then 👍

Actually, more discussion in #9522 on this.

richvdh · 2021-03-02T11:24:56Z

synapse/util/caches/response_cache.py

+        result = self.get(key)
+        if not result or isinstance(result, defer.Deferred) and not result.called:
+            self.add_conditional(key, should_cache)


I feel like this is duplicating half the logic of wrap, which is kinda the worst of all worlds. Could you not make should_cache nullable, and have wrap call wrap_conditional rather than the other way around?

I initially did that, but that became a bit of a mess, and you'd had to shift around the positional arguments on any wrap calls currently in place.

I also didn't wanna make it a keyword argument, because that'd shadow any potential keyword arguments to the inner call. While i know it is trivial to "just add it", i didn't want to because it is non-explicit to anyone not aware of this change to ResponseCache, and so it is possible for it to become a bug.

I initially did that, but that became a bit of a mess, and you'd had to shift around the positional arguments on any wrap calls currently in place.

I'm not suggesting changing the signature of wrap.

Wrap would be:

def wrap( self, key: T, callback: "Callable[..., Any]", *args: Any, **kwargs: Any ) -> defer.Deferred: return self.wrap_conditional(key, None, callback, *args, **kwargs)

why is that a mess?

hmmm, i'll change it to that in that PR, then 👍

)" This reverts commit f5c93fc. This is being backed out due to a regression (#9507) and additional review feedback being provided.

clokep · 2021-03-02T14:47:05Z

I've backed this out in aee1076. I think we'll want to:

Re-land this with the additional edge-case handled.
Ensure the tests cover the additional edge-case.

I think the current known issue of #8518 is better than the unknown impact of the regression issue #9507. Unfortunately with this change we did not have confidence in creating a new release candidate or deploying to matrix.org, which are necessities for the team.

ShadowJonathan added 2 commits February 9, 2021 12:29

add fixes

406c8f1

news

bbad98c

ShadowJonathan changed the title ~~Add in NoTimedCache to prevent sync loops~~ Add in NoTimedCache to prevent cached SyncResult looping through self-reference Feb 9, 2021

ShadowJonathan added 3 commits February 9, 2021 12:37

add message for the future

e8296ff

add if-else on return (forgot in initial commit)

3b08294

forgot to remove Generic[T]

b239033

ShadowJonathan changed the title ~~Add in NoTimedCache to prevent cached SyncResult looping through self-reference~~ Fix #8518 (sync requests being cached wrongly on timeout) Feb 9, 2021

clokep requested a review from a team February 9, 2021 18:57

change strategy to ResponseCache.wrap_conditional()

04c399d

ShadowJonathan changed the title ~~Fix #8518 (sync requests being cached wrongly on timeout)~~ Fix #8518 (sync requests being cached wrongly on timeout) and add ResponseCache tests Feb 22, 2021

This comment has been minimized.

Sign in to view

ShadowJonathan commented Feb 22, 2021

View reviewed changes

synapse/federation/federation_server.py Outdated Show resolved Hide resolved

ShadowJonathan changed the title ~~Fix #8518 (sync requests being cached wrongly on timeout) and add ResponseCache tests~~ Fix #8518 (sync requests being cached wrongly on timeout) Feb 22, 2021

ShadowJonathan force-pushed the fix/8518 branch from 754eb0d to 04c399d Compare February 22, 2021 14:24

ShadowJonathan mentioned this pull request Feb 22, 2021

Add ResponseCache tests #9458

Merged

4 tasks

This comment has been minimized.

Sign in to view

anoadragon453 approved these changes Feb 24, 2021

View reviewed changes

synapse/util/caches/response_cache.py Outdated Show resolved Hide resolved

synapse/util/caches/response_cache.py Outdated Show resolved Hide resolved

Apply suggestions from code review

ed39223

anoadragon453 merged commit f5c93fc into matrix-org:develop Feb 24, 2021

clokep mentioned this pull request Mar 1, 2021

Synapse potentially cached a failed sync request #9507

Closed

richvdh reviewed Mar 2, 2021

View reviewed changes

changelog.d/9358.misc Show resolved Hide resolved

richvdh reviewed Mar 2, 2021

View reviewed changes

clokep mentioned this pull request Mar 2, 2021

Harden ResponseCache (and fix #9507) #9522

Closed

4 tasks

ShadowJonathan mentioned this pull request Mar 2, 2021

[discussion] Changing the behaviour of ResponseCache.wrap_conditional to only honour the very first caller #9525

Closed

clokep added a commit that referenced this pull request Mar 2, 2021

Revert "Fix #8518 (sync requests being cached wrongly on timeout) (#9358

aee1076

)" This reverts commit f5c93fc. This is being backed out due to a regression (#9507) and additional review feedback being provided.

clokep mentioned this pull request Mar 2, 2021

Sync requests immediately return with empty payload #8518

Closed

ShadowJonathan mentioned this pull request Apr 3, 2021

Introduce SyncResponseCache and add /sync response cache timeout #9739

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #8518 (sync requests being cached wrongly on timeout) #9358

Fix #8518 (sync requests being cached wrongly on timeout) #9358

ShadowJonathan commented Feb 9, 2021 •

edited

Loading

ShadowJonathan commented Feb 9, 2021 •

edited

Loading

ShadowJonathan commented Feb 9, 2021

erikjohnston commented Feb 22, 2021

ShadowJonathan commented Feb 22, 2021 •

edited

Loading

ShadowJonathan commented Feb 22, 2021

ShadowJonathan commented Feb 22, 2021 •

edited

Loading

erikjohnston commented Feb 22, 2021

This comment has been minimized.

This comment has been minimized.

ShadowJonathan commented Feb 22, 2021

This comment has been minimized.

anoadragon453 left a comment

ShadowJonathan commented Feb 24, 2021 •

edited

Loading

anoadragon453 commented Feb 24, 2021

richvdh Mar 2, 2021

ShadowJonathan Mar 2, 2021 •

edited

Loading

richvdh Mar 2, 2021

ShadowJonathan Mar 2, 2021

richvdh Mar 2, 2021

ShadowJonathan Mar 2, 2021

richvdh Mar 2, 2021

richvdh Mar 2, 2021

richvdh left a comment

richvdh Mar 2, 2021 •

edited

Loading

ShadowJonathan Mar 2, 2021 •

edited

Loading

richvdh Mar 2, 2021

This comment was marked as outdated.

ShadowJonathan Mar 2, 2021 •

edited

Loading

ShadowJonathan Mar 2, 2021

richvdh Mar 2, 2021

ShadowJonathan Mar 2, 2021

richvdh Mar 2, 2021

ShadowJonathan Mar 2, 2021

clokep commented Mar 2, 2021

		sync_config.request_key,
		lambda result: since_token != result.next_batch,

		def add_conditional(self, key: T, conditional: Callable[[Any], bool]):
		self.pending_conditionals.setdefault(key, set()).add(conditional)

		if not result or isinstance(result, defer.Deferred) and not result.called:
		self.add_conditional(key, should_cache)

Fix #8518 (sync requests being cached wrongly on timeout) #9358

Fix #8518 (sync requests being cached wrongly on timeout) #9358

Conversation

ShadowJonathan commented Feb 9, 2021 • edited Loading

Pull Request Checklist

ShadowJonathan commented Feb 9, 2021 • edited Loading

ShadowJonathan commented Feb 9, 2021

erikjohnston commented Feb 22, 2021

ShadowJonathan commented Feb 22, 2021 • edited Loading

ShadowJonathan commented Feb 22, 2021

ShadowJonathan commented Feb 22, 2021 • edited Loading

erikjohnston commented Feb 22, 2021

This comment has been minimized.

This comment has been minimized.

ShadowJonathan commented Feb 22, 2021

This comment has been minimized.

anoadragon453 left a comment

Choose a reason for hiding this comment

ShadowJonathan commented Feb 24, 2021 • edited Loading

anoadragon453 commented Feb 24, 2021

Choose a reason for hiding this comment

ShadowJonathan Mar 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richvdh left a comment

Choose a reason for hiding this comment

richvdh Mar 2, 2021 • edited Loading

Choose a reason for hiding this comment

ShadowJonathan Mar 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

ShadowJonathan Mar 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clokep commented Mar 2, 2021

ShadowJonathan commented Feb 9, 2021 •

edited

Loading

ShadowJonathan commented Feb 9, 2021 •

edited

Loading

ShadowJonathan commented Feb 22, 2021 •

edited

Loading

ShadowJonathan commented Feb 22, 2021 •

edited

Loading

ShadowJonathan commented Feb 24, 2021 •

edited

Loading

ShadowJonathan Mar 2, 2021 •

edited

Loading

richvdh Mar 2, 2021 •

edited

Loading

ShadowJonathan Mar 2, 2021 •

edited

Loading

ShadowJonathan Mar 2, 2021 •

edited

Loading