Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionClose sometimes hangs when called from different thread #1668

Closed
wfurt opened this issue Jun 1, 2021 · 7 comments
Closed

ConnectionClose sometimes hangs when called from different thread #1668

wfurt opened this issue Jun 1, 2021 · 7 comments
Labels
Bug: App A bug in the application above MsQuic question Further information is requested
Milestone

Comments

@wfurt
Copy link
Member

wfurt commented Jun 1, 2021

I run into this when working on dotnet/runtime#52048
I made fixes to keep the handle open to avoid crash. However sometimes when we try to close the connection the API calls hangs for > 15s and test fail with timeout. I collected traces and when looking at the api I only see

[0][53805.5381b][11:22:52.888561][ api] Enter 18 (0x7fd3700b1090).
[0][53805.5381b][11:22:52.888564][ api] Exit 0
[0][53805.5381b][11:22:52.888586][ api] Enter 20 (0x7fd36c004450).
[0][53805.5381b][11:22:52.888598][ api] Exit 4294967294
[0][53805.5381b][11:22:52.888929][ api] Enter 1 (0x7fd36c004450).
[0][53805.5381b][11:22:52.888931][ api] Waiting on operation
[0][53805.5381b][11:22:52.889431][ api] Exit 0
[0][53805.5381b][11:22:52.890968][ api] Enter 19 (0x7fd36c004450).
[0][53805.5381b][11:22:52.890980][ api] Waiting on operation
[0][53805.5381b][11:22:52.891579][ api] Exit
[0][53805.5381b][11:22:52.896762][ api] Enter 13 (0x7fd36c0019c0).
[0][53805.5381b][11:22:52.896765][ api] Waiting on operation

full trace attached. I can reproduce this semi-reliably on my Ubuntu machine.
quic.log.gz

@nibanks
Copy link
Member

nibanks commented Jun 1, 2021

Are you calling close on an MsQuic thread, not owned/used by the connection you are closing? That could definitely cause a deadlock and is not allowed.

@nibanks
Copy link
Member

nibanks commented Jun 1, 2021

Or, are you blocking the MsQuic thread in a callback, waiting for this thread? That can also cause a deadlock and is not allowed.

@wfurt
Copy link
Member Author

wfurt commented Jun 1, 2021

I'm not sure what that reapply means. With .NET async lot of code runs on thread pool without us controlling it.
In this particular case the QUIC connection is purposely left behind so the close is called from runtime finalizer queue thread.

@wfurt
Copy link
Member Author

wfurt commented Jun 1, 2021

AFAIK there is no concept of thread affinity or thread ownership for given connection.
Is there any doc for the threading constrains?
cc: @geoffkizer @scalablecory

@nibanks
Copy link
Member

nibanks commented Jun 1, 2021

Most of the documentation related to this is found here. These are two rules to follow:

  1. Do not block the MsQuic thread/callback for any length of time. You may acquire a lock/mutex, but you must guarantee very quick execution. Do not grab a lock that you also hold (on a different thread) when calling back into MsQuic.
  2. Do not call MsQuic APIs cross-object on MsQuic the thread/callbacks. For instance, if you're in a callback for Connection A, do not call ConnectionClose for Connection B.

@nibanks nibanks added the question Further information is requested label Jun 1, 2021
@nibanks
Copy link
Member

nibanks commented Jun 1, 2021

I just looked at the logs and I see the following (simply added a filter of 7fd36c0019c0):

[0][53805.5381b][11:22:52.896762][ api] Enter 13 (0x7fd36c0019c0).

[0][53805.53815][11:22:52.896796][conn][0x7fd36c0019c0] Scheduling: 2
[0][53805.53815][11:22:52.896797][conn][0x7fd36c0019c0] Execute: 1
[0][53805.53815][11:22:52.896797][conn][0x7fd36c0019c0] Recv 1 UDP datagrams
[0][53805.53815][11:22:52.896825][conn][0x7fd36c0019c0] IN: BytesRecv=2901
[0][53805.53815][11:22:52.896826][conn][0x7fd36c0019c0] Batch Recv 1 UDP datagrams
[0][53805.53815][11:22:52.896851][conn][0x7fd36c0019c0][RX][4] � (21 bytes)
[0][53805.53815][11:22:52.896854][strm][0x7fd378028360] Created, Conn=0x7fd36c0019c0 ID=0 IsLocal=0
[0][53805.53815][11:22:52.896856][conn][0x7fd36c0019c0] Indicating QUIC_CONNECTION_EVENT_PEER_STREAM_STARTED [0x7fd378028360, 0x0]

[1][53805.53813][11:22:53.142398][conn][0x7fd36c0019c0] Queuing 1 UDP datagrams
[1][53805.53813][11:22:53.392819][conn][0x7fd36c0019c0] Queuing 1 UDP datagrams
[1][53805.53813][11:22:53.644259][conn][0x7fd36c0019c0] Queuing 1 UDP datagrams

You will notice 3 different threads:

  • 5381b - The app (.NET) thread that is calling in to close the connection.
  • 53815 - The MsQuic worker thread that drives execution for the connection.
  • 53813 - The MsQuic UDP thread that is processing received packets and queuing them on the connection.

As you can see, the last event/log on the MsQuic worker was an indication of a QUIC_CONNECTION_EVENT_PEER_STREAM_STARTED event to the app. Since I see nothing else after that, I assume .NET is blocking that callback/thread.

@nibanks nibanks added the Bug: App A bug in the application above MsQuic label Jun 1, 2021
@nibanks nibanks added this to the Future milestone Jun 1, 2021
@wfurt
Copy link
Member Author

wfurt commented Jun 2, 2021

I think you are right. I changed how ConnectionClose is invoked and the problem is gone.

@wfurt wfurt closed this as completed Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug: App A bug in the application above MsQuic question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants