Cross-signing signatures not being always federated correctly #7418

babolivier · 2020-05-05T13:44:38Z

Over the past couple of days I've seen a few occurrences of people on other servers verifying a new device, but my server not receiving the resulting signature, thus leaving them with a red shield from my pov. I've also seen people saying they were seeing others with a red shield whereas I would see them with a green one.

@bwindels had a look at my Riot logs upon failing to see one of @anoadragon453's devices as verified, and could see that it was indeed missing a signature, with the signatures on Andrew's device being:

"signatures": {
  "@andrewm:amorgan.xyz": {
    "ed25519:ZYMXYYQZTP": "aBAnNnO3Qzz8xMLXtSjnwDummc4F4vewygUBvdRKJWd2/326Se4knYkJkp/TKa0sXjeGKIUko8JXNorN2m9TDw",
    "ed25519:QjSD8srN17RiDzIBgzVbncj+NMdvDHRY4N2b8w+oq9Y": "FV5bCKEdp1q9SFvy7MzUgTj3SzAeHwuK4vvtgnIp/WFxdnYQ4XL2C64dW28oJsK4yJ95NkuFcL9uCTPaDKPoAw"
  }
}

and the ones my Riot would see for that same device being:

"signatures": {
  "@andrewm:amorgan.xyz": {
    "ed25519:ZYMXYYQZTP": "aBAnNnO3Qzz8xMLXtSjnwDummc4F4vewygUBvdRKJWd2/326Se4knYkJkp/TKa0sXjeGKIUko8JXNorN2m9TDw"
  }
}

Looking at the device_lists_remote_cache on my homeserver's database, I can see that it's indeed missing the signature from Andrew's self-signing key (ed25519:QjSD8srN17RiDzIBgzVbncj+NMdvDHRY4N2b8w+oq9Y).

https://github.com/matrix-org/riot-web-rageshakes/issues/2740#issuecomment-623992322 provides more info about that specific occurrence.

The text was updated successfully, but these errors were encountered:

babolivier · 2020-05-05T13:45:11Z

Possibly related: #7350

babolivier · 2020-05-06T13:55:45Z

Since this issue was opened, Andrew connected a new device to his account and verified it from Riot Web. I can now see ZYMXYYQZTP as verified, but not the new device.

afau two things could be happening here:

either we're using the wrong stream ID to send device updates and we're one ID short of the entry containing the new signature
or we don't correctly update the devices cache upon receiving an update

babolivier · 2020-05-06T17:09:56Z

Right, I think I've figured out what's happening here.

I've noticed that Andrew's device_lists_outbound_last_success table had a higher stream ID for my homeserver than the one mine had in its device_lists_remote_extremeties table. This likely means that the latest update was sent to my homeserver, that something happened while my homeserver was processing the update, causing it to abort it, but it still returned 200 to the Andrew's HS.

So I think the explanation is the following: when server A sends out a device list update to server B, server B will look at the updates' stream ids to see if it has missed any, if it did it will try to get the missing keys via /_matrix/federation/v1/user/devices/....

If that request fails, server B will:

Not retry it (as pointed out by this comment)
Abort the update (because otherwise Synapse forgets that it's out of sync and will never care about updates it might have missed)
Return a 200 to server A (because it's returning here rather than raising an exception), causing server A to record that update as successful and not resend it ever

In that specific issue it looks like abolivier.bzh's /devices/... requests to amorgan.xyz timed out, which produced this failure mode:

2020-05-04 18:15:50,575 - synapse.federation.transport.server - 406 - INFO - PUT-1487720- Received txn 1588613504503 from amorgan.xyz. (PDUs: 0, EDUs: 1)
2020-05-04 18:15:50,581 - synapse.handlers.device - 625 - INFO - PUT-1487720- Received device list update for @andrewm:amorgan.xyz, requiring resync: True. Devices: ZYMXYYQZTP
2020-05-04 18:15:50,582 - synapse.http.matrixfederationclient - 408 - INFO - PUT-1487720- {GET-O-118918} [amorgan.xyz] Sending request: GET matrix://amorgan.xyz/_matrix/federation/v1/user/devices/%40andrewm%3Aamorgan.xyz; timeout 30.000000s
2020-05-04 18:15:50,584 - synapse.http.federation.matrix_federation_agent - 242 - INFO - PUT-1487720- Connecting to matrix.amorgan.xyz:8448
2020-05-04 18:16:20,588 - synapse.http.matrixfederationclient - 434 - INFO - PUT-1487720- Failed to send request: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
2020-05-04 18:16:20,589 - synapse.http.matrixfederationclient - 491 - WARNING - PUT-1487720- {GET-O-118918} [amorgan.xyz] Request failed: GET matrix://amorgan.xyz/_matrix/federation/v1/user/devices/%40andrewm%3Aamorgan.xyz: ResponseNeverReceived:[CancelledError()]
2020-05-04 18:16:20,589 - synapse.util.retryutils - 218 - INFO - PUT-1487720- Connection to amorgan.xyz was unsuccessful (<class 'synapse.api.errors.RequestSendFailed'>(Failed to send request: ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >])); backoff now 600000
2020-05-04 18:16:20,590 - synapse.handlers.device - 696 - WARNING - PUT-1487720- Failed to handle device list update for @andrewm:amorgan.xyz
2020-05-04 18:16:20,601 - synapse.access.http.8008 - 302 - INFO - PUT-1487720- 127.0.0.1 - 8008 - {amorgan.xyz} Processed request: 30.028sec/0.001sec (0.011sec, 0.001sec) (0.004sec/0.008sec/2) 12B 200 "PUT /_matrix/federation/v1/send/1588613504503 HTTP/1.1" "Synapse/1.12.4" [0 dbevts]

Thus server B is stuck with an outdated devices list that it'll never try to refresh, except if server A sends another update and isn't under too much load that it wouldn't time out on a /devices/... request at that specific time.

The correct fix here would either be to fail the whole transaction (which would also fail the PDUs so might not be the correct one) or have server B still send a 200 back and retry /devices/... periodically (which would give server A a wrong idea of the updates server B has successfully processed).

I believe the latter solution might be the most preferable one. I'm unfamiliar with the existing backoff/retry mechanism, so I might be forcing through an open door here, but we should absolutely persist that retry schedule in the database otherwise just restarting Synapse could lead to device lists staying out of sync.

babolivier · 2020-05-07T16:04:24Z

This issue would be fixed by #7453

ara4n · 2020-05-12T01:12:37Z

woo, thanks for hunting this down. device list sync is the most wobbly bit of synapse, imo :(

babolivier · 2020-05-12T10:00:29Z

device list sync is the most wobbly bit of synapse, imo :(

Agreed. Hopefully #7453 will make it way less wobbly.

flackr · 2020-05-13T03:37:11Z

Not sure if this is related, but I recently set up a new homeserver B and am missing everyone's master_keys and self_signing_keys except for people who have since set up their E2E cross signing keys.

When I compared a sample user in device_lists_remote_extremeties I found that their stream_id was much larger (59005256) than on my long-running homeserver A (26308567). Unfortunately I can't check the other end (as it's matrix.org).

I tested checking out #7453 and inserted their user_id into device_lists_remote_resync but after it was removed from the table I still didn't have master_keys or self_signing_keys for the user however I did notice a timeout entry for the user_id in the log:

2020-05-12 22:04:53,641 - synapse.http.matrixfederationclient - 408 - INFO -  - {GET-O-3986} [matrix.org] Sending request: GET matrix://matrix.org/_matrix/federation/v1/user/devices/<user_id>; timeout 30.000000s
2020-05-12 22:04:53,651 - synapse.handlers.presence - 343 - INFO - persist_presence_changes-1 - Persisting 3 unpersisted presence updates
2020-05-12 22:04:53,809 - synapse.http.matrixfederationclient - 164 - INFO -  - {GET-O-3986} [matrix.org] Completed: 200 OK
2020-05-12 22:04:53,810 - synapse.storage.database - 527 - WARNING -  - Starting db txn 'update_remote_device_list_cache' from sentinel context
2020-05-12 22:04:53,810 - synapse.storage.database - 566 - WARNING -  - Starting db connection from sentinel context: metrics will be lost

babolivier · 2020-05-13T13:12:54Z

Not sure if your problem is the same issue - however thanks to your logs it's obvious that I didn't do the right thing wrt logging contexts in #7453, I'll try to fix that soon.

flackr · 2020-05-13T13:36:51Z

Glad I could help. I am currently stuck in this situation of missing many people's signing keys, let me know if there's other things I can check.

If you determine it's not the same we should reopen #7350.

flackr · 2020-05-14T02:07:54Z

I checked with one of my friend's running a homeserver. My server does not exist in their device_lists_outbound_last_success table, but their server's users do exist in my device_lists_remote_extremeties, though their keys aren't on my homeserver.

Is there any way to safely insert the keys into the database or force a refresh?

babolivier · 2020-05-14T17:01:31Z

Hey, so after thinking about it, I don't think your issue share a common cause with this one (nor with #7350). I've opened #7504 to track its progress.

Is there any way to safely insert the keys into the database or force a refresh?

The only way I see of forcing a refresh would be to run #7453 and do the insertion mentioned at #7453 (comment), but if that doesn't solve it I can't think of anything else without more investigation.

flackr · 2020-05-14T17:27:36Z

Is there any way to safely insert the keys into the database or force a refresh?

The only way I see of forcing a refresh would be to run #7453 and do the insertion mentioned at #7453 (comment), but if that doesn't solve it I can't think of anything else without more investigation.

Thanks, sadly it didn't seem to solve the issue though I'll try again when this lands. I'm a little hesitant to run development versions of synapse against my regular HS since it seems to perform a non-reversible upgrade to the database (which I'm guessing could also get federation out of sync when I revert the db) and I'd like to be able to run the stable version ongoing, however if there are other things you'd like me to try I could set up a development HS. Feel free to reach out to me, @flackr:serializer.ca for higher bandwidth discussion.

babolivier · 2020-05-15T08:38:36Z

Will do when I'll get to investigating it 👍

When a call to `user_device_resync` fails, we don't currently mark the remote user's device list as out of sync, nor do we retry to sync it. #6776 introduced some code infrastructure to mark device lists as stale/out of sync. This commit uses that code infrastructure to mark device lists as out of sync if processing an incoming device list update makes the device handler realise that the device list is out of sync, but we can't resync right now. It also adds a looping call to retry all failed resync every 30s. This shouldn't cause too much spam in the logs as this commit also removes the "Failed to handle device list update for..." warning logs when catching `NotRetryingDestination`. Fixes #7418

babolivier · 2020-05-21T15:43:52Z

For people getting bitten by this issue after 1.14.0 is out, or people seeing users with stale device lists from before 1.14.0, just run the following SQL in Synapse's database:

INSERT INTO device_lists_remote_resync
VALUES ('USER_ID', (EXTRACT(epoch FROM NOW()) * 1000)::BIGINT);

where USER_ID is the Matrix ID of the user whose device list has gone stale. I've tried this successfully on PostgreSQL, not yet on SQLite.

When a call to `user_device_resync` fails, we don't currently mark the remote user's device list as out of sync, nor do we retry to sync it. matrix-org#6776 introduced some code infrastructure to mark device lists as stale/out of sync. This commit uses that code infrastructure to mark device lists as out of sync if processing an incoming device list update makes the device handler realise that the device list is out of sync, but we can't resync right now. It also adds a looping call to retry all failed resync every 30s. This shouldn't cause too much spam in the logs as this commit also removes the "Failed to handle device list update for..." warning logs when catching `NotRetryingDestination`. Fixes matrix-org#7418

babolivier added z-bug (Deprecated Label) z-p2 (Deprecated Label) z-cross-signing (Deprecated Label) labels May 5, 2020

jryans added the story:1 label May 6, 2020

babolivier self-assigned this May 6, 2020

This was referenced May 6, 2020

self-signing keys are not returned from CS-API for user on different HS #7350

Closed

Retry to sync out of sync device lists #7453

Merged

jryans mentioned this issue May 12, 2020

Cross-Signing not working properly element-hq/element-web#13614

Closed

babolivier linked a pull request May 12, 2020 that will close this issue

Retry to sync out of sync device lists #7453

Merged

babolivier mentioned this issue May 14, 2020

New homeserver doesn't know about cross-signing keys created before it was set up #7504

Closed

babolivier closed this as completed in #7453 May 21, 2020

mweinelt mentioned this issue Jun 30, 2020

Verification came through in a dialog, not the right panel element-hq/element-web#13248

Closed

Ghabry mentioned this issue Jun 30, 2020

Cross-signing verification of a remote user is incomplete element-hq/element-web#13824

Open

Thatoo mentioned this issue Jul 17, 2020

Cross-signing verification of a remote user is incomplete YunoHost-Apps/synapse_ynh#203

Closed

kegsay mentioned this issue Jul 29, 2020

MSC2638: Ability for clients to request homeservers to resync device lists matrix-org/matrix-spec-proposals#2638

Closed

anoadragon453 mentioned this issue Feb 6, 2024

Add query to update local cache of a remote user's device list to docs element-hq/synapse#16892

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-signing signatures not being always federated correctly #7418

Cross-signing signatures not being always federated correctly #7418

babolivier commented May 5, 2020 •

edited

Loading

babolivier commented May 5, 2020

babolivier commented May 6, 2020

babolivier commented May 6, 2020

babolivier commented May 7, 2020

ara4n commented May 12, 2020

babolivier commented May 12, 2020

flackr commented May 13, 2020

babolivier commented May 13, 2020 •

edited

Loading

flackr commented May 13, 2020

flackr commented May 14, 2020

babolivier commented May 14, 2020

flackr commented May 14, 2020

babolivier commented May 15, 2020

babolivier commented May 21, 2020 •

edited

Loading

Cross-signing signatures not being always federated correctly #7418

Cross-signing signatures not being always federated correctly #7418

Comments

babolivier commented May 5, 2020 • edited Loading

babolivier commented May 5, 2020

babolivier commented May 6, 2020

babolivier commented May 6, 2020

babolivier commented May 7, 2020

ara4n commented May 12, 2020

babolivier commented May 12, 2020

flackr commented May 13, 2020

babolivier commented May 13, 2020 • edited Loading

flackr commented May 13, 2020

flackr commented May 14, 2020

babolivier commented May 14, 2020

flackr commented May 14, 2020

babolivier commented May 15, 2020

babolivier commented May 21, 2020 • edited Loading

babolivier commented May 5, 2020 •

edited

Loading

babolivier commented May 13, 2020 •

edited

Loading

babolivier commented May 21, 2020 •

edited

Loading