Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

SerializationFailure in notify_interested_services_ephemeral under heavy load #11195

Closed
Fizzadar opened this issue Oct 27, 2021 · 1 comment · Fixed by #11207
Closed

SerializationFailure in notify_interested_services_ephemeral under heavy load #11195

Fizzadar opened this issue Oct 27, 2021 · 1 comment · Fixed by #11207
Labels
A-Application-Service Related to AS support S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@Fizzadar
Copy link
Contributor

Fizzadar commented Oct 27, 2021

Description

We are seeing periodic SerializationFailure errors in Sentry in the notify_interested_services_ephemeral background process. This is running in a dedicated appservice-pusher worker instance. At peak we see around 30 ephemeral events per second being sent to the appservice (presence is disabled so these should all be read receipts), the error appears much more regularly during these peaks. Some initial investigation:

The query in question is invoked in set_type_stream_id_for_appservice_txn, updating the stream position for the appservice:

def set_type_stream_id_for_appservice_txn(txn):
stream_id_type = "%s_stream_id" % type
txn.execute(
"UPDATE application_services_state SET %s = ? WHERE as_id=?"
% stream_id_type,
(pos, service.id),
)

This itself is called in the appservice handler:

elif stream_key == "receipt_key":
events = await self._handle_receipts(service)
if events:
self.scheduler.submit_ephemeral_events_for_as(service, events)
# Persist the latest handled stream token for this appservice
await self.store.set_type_stream_id_for_appservice(
service, "read_receipt", new_token
)

I'm new to most of the synapse codebase so may have this wrong but it appears this is a race condition between two parallel executions of the notify_interested_services_ephemeral process for the same appservice/stream. Am I correct in thinking this means some events are getting sent twice to the appservice as the position is not always updated? (Could events also be missed somehow?).

We'd like to figure out a way to fix this issue, currently only two possible solutions have come to mind:

  • Implement some kind of locking on (appservice, stream_id), would probably need to be on the entire handler though which may be a significant performance impact. Not a fan of this!
  • Assuming transaction IDs are sequential, a global object tracking the highest value across multiple processes could be used to prevent updating the table if it would update to a lower number - this becomes harder assuming a future where multiple appservice push workers may exist

Keen to hear thoughts, and can most likely find time to work on any potential soltuion.

@anoadragon453
Copy link
Member

anoadragon453 commented Oct 27, 2021

Hi there @Fizzadar. This has been noticed before, but not yet had it been pulled out into its own issue. So thank you for writing this up!

As the linked comment suggests, using a Linearizer can work. This allows queuing method calls with a given key, for which as you suggest (appservice, stream_id) seems appropriate.

Am I correct in thinking this means some events are getting sent twice to the appservice as the position is not always updated?

It's quite likely. If two events with the same stream_key are sent to the same appservice, they'll likely both use end up with the same value from get_type_stream_id_for_appservice.

which may be a significant performance impact

It's true that this would prevent us from sending requests involving a particular (appservice, stream_id) tuple in parallel. However, updating the stream token is not blocked on sending the request to the appservice. Requests are enqueued in the background, then set_type_stream_id_for_appservice_txn is called.

Currently the situation is that these requests may be sent in parallel, but contain duplicate information. So I'm not sure if sticking in a Linearizer would be much worse.

Yes, the linearizer would need to be replaced with something that works across multiple workers if we eventually got to the point of being able to parallelize streams (unless we pin (appservice, stream_id) tuples to certain workers).

I'm not sure how to improve upon that further. We'll also need to take reliability of these requests into account with the upcoming device EDU work. #11150 talks a little about that.

@anoadragon453 anoadragon453 added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Oct 27, 2021
@MadLittleMods MadLittleMods added the A-Application-Service Related to AS support label Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Application-Service Related to AS support S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants