Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Guest user can set display names' is flaky #836

Closed
richvdh opened this issue Mar 19, 2020 · 12 comments · Fixed by #871
Closed

'Guest user can set display names' is flaky #836

richvdh opened this issue Mar 19, 2020 · 12 comments · Fixed by #871

Comments

@richvdh
Copy link
Member

richvdh commented Mar 19, 2020

this seems to be failing on about 50% of test runs currently. Haven't yet had a chance to investigate

@richvdh
Copy link
Member Author

richvdh commented Mar 19, 2020

could be just limited to worker configs?

example: https://buildkite.com/matrix-dot-org/synapse/builds/7637#1a3300b7-7acb-404a-8664-a28f0fefb722

@clokep
Copy link
Member

clokep commented Mar 19, 2020

Meh I thought we deflaked this, see #819 / #825.

@babolivier babolivier self-assigned this Mar 25, 2020
@babolivier
Copy link
Contributor

It looks like the flakiness is caused by the updated join event failing to send in time, while the profile is correctly updated in the database (i.e. GET /[...]/displayname returns the right thing). It also looks like it's indeed only happening with workers, and a good way to reproduce the failure locally is by turning the replication torture flag up until it fails (almost) reliably.
I'll continue looking for a reason this happens and a way to improve that.

@babolivier
Copy link
Contributor

babolivier commented Apr 2, 2020

So since Monday I haven't been managing to reproduce it, both on my machine (even with torture = 300) and on buildkite. I got it once on buildkite yesterday when testing another PR, so it's still somewhat flaky, but definitely not as much as it used to be, which makes it quite difficult to debug and test, and much less annoying.

I've opened #846, which has the extra logging I used to test, and can always be useful, but with that I've just described it looks like going any further on it right now is going to turn into a time sink that wouldn't gain us a lot.

@anoadragon453
Copy link
Member

I run into this pretty consistently, and it's annoyed me to the point where I want to investigate it.

I found some logs from a bad test run, in server-0's homeserver.log:

2020-04-12 16:02:07,716 - synapse.replication.http.send_event - 119 - INFO - POST-1176 - Got event to send with ID: $Wyh5zl914K2GIfm3X9WIN5xRp1CSK7q6RMz5RXqVsEQ into room: !UJfEiVhWyjgjLYLDjv:localhost:8831
2020-04-12 16:02:07,733 - synapse.access.http.8801 - 301 - INFO - POST-1176 - 127.0.0.1 - 8801 - {@anon-20200412_155942-173:localhost:8831} Processed request: 0.016sec/0.000sec (0.001sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "POST /_synapse/replication/send_event/%24Wyh5zl914K2GIfm3X9WIN5xRp1CSK7q6RMz5RXqVsEQ/LetjjSSVwI HTTP/1.1" "Synapse/1.12.3 homeserver[0]" [0 dbevts]
2020-04-12 16:02:07,742 - synapse.replication.tcp.resource - 178 - INFO - replication_notifier-1443 - Streaming: events -> 775
2020-04-12 16:02:07,870 - synapse.access.http.8801 - 301 - INFO - PUT-1177 - 127.0.0.1 - 8801 - {@anon-20200412_155942-173:localhost:8831} Processed request: 0.037sec/0.000sec (0.005sec, 0.002sec) (0.002sec/0.006sec/4) 59B 200 "PUT /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/state/m.room.guest_access?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]
2020-04-12 16:02:07,877 - synapse.replication.tcp.resource - 178 - INFO - replication_notifier-1444 - Streaming: caches -> 1365
2020-04-12 16:02:07,894 - synapse.replication.tcp.resource - 178 - INFO - replication_notifier-1444 - Streaming: events -> 776
2020-04-12 16:02:07,956 - synapse.replication.http.send_event - 119 - INFO - POST-1178 - Got event to send with ID: $F_t8Jzte71HEfYW-bBM4pLqG26bNivEyJEt_1HV4k9E into room: !UJfEiVhWyjgjLYLDjv:localhost:8831
2020-04-12 16:02:07,987 - synapse.access.http.8801 - 301 - INFO - POST-1178 - 127.0.0.1 - 8801 - {@7:localhost:8831} Processed request: 0.031sec/0.000sec (0.002sec, 0.000sec) (0.001sec/0.001sec/1) 2B 200 "POST /_synapse/replication/send_event/%24F_t8Jzte71HEfYW-bBM4pLqG26bNivEyJEt_1HV4k9E/oOLVSKmtSl HTTP/1.1" "Synapse/1.12.3 homeserver[0]" [0 dbevts]
2020-04-12 16:02:07,992 - synapse.replication.http.membership - 193 - INFO - POST-1179 - user membership change: @7:localhost:8831 in !UJfEiVhWyjgjLYLDjv:localhost:8831
2020-04-12 16:02:07,994 - synapse.access.http.8801 - 301 - INFO - POST-1179 - 127.0.0.1 - 8801 - {None} Processed request: 0.001sec/0.001sec (0.001sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "POST /_synapse/replication/membership_change/%21UJfEiVhWyjgjLYLDjv%3Alocalhost%3A8831/%407%3Alocalhost%3A8831/joined HTTP/1.1" "Synapse/1.12.3 homeserver[0]" [0 dbevts]
2020-04-12 16:02:07,996 - synapse.replication.tcp.resource - 178 - INFO - replication_notifier-1445 - Streaming: caches -> 1366
2020-04-12 16:02:08,014 - synapse.replication.tcp.resource - 178 - INFO - replication_notifier-1445 - Streaming: events -> 777
2020-04-12 16:02:08,056 - synapse.replication.tcp.resource - 178 - INFO - replication_notifier-1446 - Streaming: federation -> 203
2020-04-12 16:02:08,078 - synapse.access.http.8801 - 301 - INFO - GET-1180 - 127.0.0.1 - 8801 - {@7:localhost:8831} Processed request: 0.009sec/0.000sec (0.003sec, 0.000sec) (0.001sec/0.002sec/2) 72B 200 "GET /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/state/m.room.member/@7:localhost:8831?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]
2020-04-12 16:02:08,239 - synapse.access.http.8801 - 301 - INFO - GET-1181 - 127.0.0.1 - 8801 - {@7:localhost:8831} Processed request: 0.002sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 72B 200 "GET /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/state/m.room.member/@7:localhost:8831?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]
2020-04-12 16:02:08,471 - synapse.access.http.8801 - 301 - INFO - GET-1182 - 127.0.0.1 - 8801 - {@7:localhost:8831} Processed request: 0.003sec/0.000sec (0.003sec, 0.000sec) (0.000sec/0.000sec/0) 72B 200 "GET /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/state/m.room.member/@7:localhost:8831?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]
2020-04-12 16:02:08,817 - synapse.access.http.8801 - 301 - INFO - GET-1183 - 127.0.0.1 - 8801 - {@7:localhost:8831} Processed request: 0.003sec/0.000sec (0.003sec, 0.000sec) (0.000sec/0.000sec/0) 72B 200 "GET /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/state/m.room.member/@7:localhost:8831?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]
2020-04-12 16:02:09,331 - synapse.access.http.8801 - 301 - INFO - GET-1184 - 127.0.0.1 - 8801 - {@7:localhost:8831} Processed request: 0.003sec/0.000sec (0.003sec, 0.000sec) (0.000sec/0.000sec/0) 72B 200 "GET /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/state/m.room.member/@7:localhost:8831?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]
2020-04-12 16:02:10,099 - synapse.access.http.8801 - 301 - INFO - GET-1185 - 127.0.0.1 - 8801 - {@7:localhost:8831} Processed request: 0.004sec/0.000sec (0.002sec, 0.000sec) (0.000sec/0.000sec/0) 72B 200 "GET /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/state/m.room.member/@7:localhost:8831?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]
2020-04-12 16:02:11,075 - twisted - 192 - INFO -  - Timing out client: IPv4Address(type='TCP', host='127.0.0.1', port=57918)
2020-04-12 16:02:11,247 - synapse.access.http.8801 - 301 - INFO - GET-1186 - 127.0.0.1 - 8801 - {@7:localhost:8831} Processed request: 0.004sec/0.000sec (0.002sec, 0.000sec) (0.000sec/0.000sec/0) 72B 200 "GET /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/state/m.room.member/@7:localhost:8831?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]
2020-04-12 16:02:12,965 - synapse.access.http.8801 - 301 - INFO - GET-1187 - 127.0.0.1 - 8801 - {@7:localhost:8831} Processed request: 0.003sec/0.000sec (0.001sec, 0.000sec) (0.000sec/0.000sec/0) 72B 200 "GET /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/state/m.room.member/@7:localhost:8831?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]
2020-04-12 16:02:15,536 - synapse.access.http.8801 - 301 - INFO - GET-1188 - 127.0.0.1 - 8801 - {@7:localhost:8831} Processed request: 0.004sec/0.000sec (0.003sec, 0.000sec) (0.000sec/0.000sec/0) 72B 200 "GET /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/state/m.room.member/@7:localhost:8831?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]
2020-04-12 16:02:17,016 - synapse.metrics - 464 - INFO -  - Collecting gc 1

The master process is notified that the user's membership changed in the room, but even requesting the state of the room 8 seconds later we still send back the old value.

Seeing as the event_creator is supposedly the one handling these /state/ requests, according to workers.md, I took a look at its logs and found the following:

2020-04-12 16:02:07,714 - synapse.http.client - 281 - INFO - POST-214 - Sending request POST http://localhost:8801/_synapse/replication/send_event/%24Wyh5zl914K2GIfm3X9WIN5xRp1CSK7q6RMz5RXqVsEQ/LetjjSSVwI
2020-04-12 16:02:07,739 - synapse.http.client - 318 - INFO - POST-214 - Received response to POST http://localhost:8801/_synapse/replication/send_event/%24Wyh5zl914K2GIfm3X9WIN5xRp1CSK7q6RMz5RXqVsEQ/LetjjSSVwI: 200
2020-04-12 16:02:07,741 - synapse.access.http.8825 - 301 - INFO - POST-214 - 127.0.0.1 - 8825 - {@anon-20200412_155942-173:localhost:8831} Processed request: 0.051sec/0.001sec (0.006sec, 0.002sec) (0.004sec/0.008sec/12) 59B 200 "POST /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/send/m.room.test?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]
2020-04-12 16:02:07,950 - synapse.http.client - 281 - INFO - POST-215 - Sending request POST http://localhost:8801/_synapse/replication/send_event/%24F_t8Jzte71HEfYW-bBM4pLqG26bNivEyJEt_1HV4k9E/oOLVSKmtSl
2020-04-12 16:02:07,989 - synapse.http.client - 318 - INFO - POST-215 - Received response to POST http://localhost:8801/_synapse/replication/send_event/%24F_t8Jzte71HEfYW-bBM4pLqG26bNivEyJEt_1HV4k9E/oOLVSKmtSl: 200
2020-04-12 16:02:07,990 - synapse.http.client - 281 - INFO - POST-215 - Sending request POST http://localhost:8801/_synapse/replication/membership_change/%21UJfEiVhWyjgjLYLDjv%3Alocalhost%3A8831/%407%3Alocalhost%3A8831/joined
2020-04-12 16:02:07,996 - synapse.http.client - 318 - INFO - POST-215 - Received response to POST http://localhost:8801/_synapse/replication/membership_change/%21UJfEiVhWyjgjLYLDjv%3Alocalhost%3A8831/%407%3Alocalhost%3A8831/joined: 200
2020-04-12 16:02:07,997 - synapse.access.http.8825 - 301 - INFO - POST-215 - 127.0.0.1 - 8825 - {@7:localhost:8831} Processed request: 0.121sec/0.001sec (0.027sec, 0.001sec) (0.014sec/0.025sec/17) 48B 200 "POST /_matrix/client/r0/join/!UJfEiVhWyjgjLYLDjv:localhost:8831?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [1 dbevts]

It receives the membership change and the profile requests, but it doesn't seem to be receiving the /state/ ones...

At this point I realised it was the main homeserver process that was handling these requests. Which is fine, but it's odd that the main process, which logs that it receives the membership update for the room, would still be responding with the old displayname...

However, I did notice that in homeserver.log there is a request for the state of that room just before the membership change:

2020-04-12 16:02:07,870 - synapse.access.http.8801 - 301 - INFO - PUT-1177 - 127.0.0.1 - 8801 - {@anon-20200412_155942-173:localhost:8831} Processed request: 0.037sec/0.000sec (0.005sec, 0.002sec) (0.002sec/0.006sec/4) 59B 200 "PUT /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/state/m.room.guest_access?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]
2020-04-12 16:02:07,877 - synapse.replication.tcp.resource - 178 - INFO - replication_notifier-1444 - Streaming: caches -> 1365
2020-04-12 16:02:07,894 - synapse.replication.tcp.resource - 178 - INFO - replication_notifier-1444 - Streaming: events -> 776
2020-04-12 16:02:07,956 - synapse.replication.http.send_event - 119 - INFO - POST-1178 - Got event to send with ID: $F_t8Jzte71HEfYW-bBM4pLqG26bNivEyJEt_1HV4k9E into room: !UJfEiVhWyjgjLYLDjv:localhost:8831
2020-04-12 16:02:07,987 - synapse.access.http.8801 - 301 - INFO - POST-1178 - 127.0.0.1 - 8801 - {@7:localhost:8831} Processed request: 0.031sec/0.000sec (0.002sec, 0.000sec) (0.001sec/0.001sec/1) 2B 200 "POST /_synapse/replication/send_event/%24F_t8Jzte71HEfYW-bBM4pLqG26bNivEyJEt_1HV4k9E/oOLVSKmtSl HTTP/1.1" "Synapse/1.12.3 homeserver[0]" [0 dbevts]
2020-04-12 16:02:07,992 - synapse.replication.http.membership - 193 - INFO - POST-1179 - user membership change: @7:localhost:8831 in !UJfEiVhWyjgjLYLDjv:localhost:8831

Of course our methods for grabbing the latest state from a room are cached:

https://github.com/matrix-org/synapse/blob/4a17a647a9508b70de35130fd82e3e21474270a9/synapse/storage/data_stores/main/event_federation.py#L386-L393

Could it be that we're not invalidating this cache on membership change?

The method that's run after the user_membership_change log line doesn't inspire much confidence:

https://github.com/matrix-org/synapse/blob/118b58f0c9897243f98072755cbc6606528dfa6c/synapse/util/distributor.py#L30-L32

@anoadragon453
Copy link
Member

anoadragon453 commented Apr 22, 2020

An easy way to reproduce this is with the following:

while true; do ./sytest-workers.sh tests/30rooms/13guestaccess.pl --stop || break; done 

where sytest-workers.sh is:

docker run --rm -it -e POSTGRES=true -e WORKERS=true -v /home/user/code/synapse\:/src:ro -v /home/user/code/sytest/logs\:/logs -v /home/user/code/sytest\:/sytest:ro matrixdotorg/sytest-synapse:py35 "$@"

@richvdh
Copy link
Member Author

richvdh commented Apr 22, 2020

The master process is notified that the user's membership changed in the room, but even requesting the state of the room 8 seconds later we still send back the old value.

I think the membership change notification is just meant to update the list of current members in the room (ie, local_current_membership); this is a slightly different question to "what is the state of the room" (current_state_events), though obviously the former is generally based on the latter.

However, I did notice that in homeserver.log there is a request for the state of that room just before the membership change:

2020-04-12 16:02:07,870 - synapse.access.http.8801 - 301 - INFO - PUT-1177 - 127.0.0.1 - 8801 - {@anon-20200412_155942-173:localhost:8831} Processed request: 0.037sec/0.000sec (0.005sec, 0.002sec) (0.002sec/0.006sec/4) 59B 200 "PUT /_matrix/client/r0/rooms/!UJfEiVhWyjgjLYLDjv:localhost:8831/state/m.room.guest_access?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]

this is a PUT request: it's updating the state of the room, which is why it goes to the master.

Of course our methods for grabbing the latest state from a room are cached:

https://github.com/matrix-org/synapse/blob/4a17a647a9508b70de35130fd82e3e21474270a9/synapse/storage/data_stores/main/event_federation.py#L386-L393

Could it be that we're not invalidating this cache on membership change?

Well, the get_latest_event_ids_in_room cache is supposed to be invalidated by the arrival of the event (ie, POST /_synapse/replication/send_event. It happens at https://github.com/matrix-org/synapse/blob/4a17a647a9508b70de35130fd82e3e21474270a9/synapse/storage/data_stores/main/events.py#L650). So we shouldn't need to invalidate the cache on membership change.

@richvdh
Copy link
Member Author

richvdh commented Apr 22, 2020

oh, 🤦 . The problem is happening during the processing of PUT /profile/.../displayname. This should show the event creator sending an updated membership event to the master, but it does not:

$ grep PUT-11 logs/server-0/event_creator.log
2020-04-22 22:21:33,625 - synapse.access.http.8825 - 301 - INFO - PUT-11 - 127.0.0.1 - 8825 - {@6:localhost:8831} Processed request: 0.017sec/0.000sec (0.002sec, 0.000sec) (0.007sec/0.002sec/3) 2B 200 "PUT /_matrix/client/r0/profile/@6:localhost:8831/displayname?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]

From a successful test run:

2020-04-22 22:24:22,898 - synapse.http.client - 281 - INFO - PUT-11 - Sending request POST http://localhost:8801/_synapse/replication/send_event/%24PdohsecGxHWFvkQka_Aw6z7L7cCwNtw29wziVLo2AzQ/wtfYQCTEGj
2020-04-22 22:24:22,917 - synapse.http.client - 318 - INFO - PUT-11 - Received response to POST http://localhost:8801/_synapse/replication/send_event/%24PdohsecGxHWFvkQka_Aw6z7L7cCwNtw29wziVLo2AzQ/wtfYQCTEGj: 200
2020-04-22 22:24:22,919 - synapse.access.http.8825 - 301 - INFO - PUT-11 - 127.0.0.1 - 8825 - {@6:localhost:8831} Processed request: 0.047sec/0.000sec (0.007sec, 0.000sec) (0.004sec/0.008sec/13) 2B 200 "PUT /_matrix/client/r0/profile/@6:localhost:8831/displayname?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.45" [0 dbevts]

so the problem is that at the point PUT /profile/.../displayname is being called, the event creator doesn't know that the user is in in the room and hence that it needs to send a new membership event to the room. That sounds like a wrong situation, but I don't yet know why it's happening.

Sorry if everybody else had already figured that out.

@anoadragon453
Copy link
Member

Sorry if everybody else had already figured that out.

I hadn't, so thank you 🙂

@anoadragon453 anoadragon453 self-assigned this Apr 28, 2020
@babolivier babolivier removed their assignment Apr 30, 2020
@anoadragon453
Copy link
Member

So I became unable to reproduce this bug with the setup described in #836 (comment), however switching to using redis-workers works again.

New command:

while true; do ./sytest-redis-workers.sh tests/30rooms/13guestaccess.pl --stop || break; done 

Contents of sytest-redis-workers.sh:

docker run --rm -it -e POSTGRES=1 -e WORKERS=1 -e BLACKLIST="synapse-blacklist-with-workers" -v /home/user/code/synapse\:/src -v /home/user/code/sytest/logs\:/logs -v /home/user/code/sytest\:/sytest:ro --network testfoobar matrixdotorg/sytest-synapse:py37 --redis-host testredis "$@"

Note that you'll need to run the following to set up the docker network:

docker network create testfoobar
docker run --network testfoobar --name testredis -d redis:5.0

as described in sytest's docs.

Eventually the process will die on a failing test case :)

@anoadragon453
Copy link
Member

Hm, well scratch that. It's not reproducing anymore. There must be something that's getting left around that's allowing this test to succeed, but deleting all docker containers/images and retrying still produces passing tests.

I think I'll have to start using CI :|

@richvdh
Copy link
Member Author

richvdh commented May 15, 2020

so basically what happens is:

  • we send a join; that is processed and persisted by the master
  • the master sends the join event out over the replication stream, but before it is received...
  • we send the profile update; the event creator notes that we are in zero rooms
  • the event creator receives the event over the replication stream.

Easy fix: use matrix_join_room_synced

richvdh added a commit that referenced this issue May 15, 2020
*Hopefully*, fixes #836
richvdh added a commit that referenced this issue May 15, 2020
pull bot pushed a commit to valkum/sytest that referenced this issue May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants