Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpStress: connection failures with HTTP 1.1 caused by Windows Firewall #50854

Closed
antonfirsov opened this issue Apr 7, 2021 · 41 comments · Fixed by #52381
Closed

HttpStress: connection failures with HTTP 1.1 caused by Windows Firewall #50854

antonfirsov opened this issue Apr 7, 2021 · 41 comments · Fixed by #52381
Assignees
Labels
area-System.Net.Http test-bug Problem in test source code (most likely)
Milestone

Comments

@antonfirsov
Copy link
Member

antonfirsov commented Apr 7, 2021

Currently, the the vast majority of our HttpStress failures (#42211) are HTTP 1.1 Socket connection failures. (Socket error 10060 running inside, 10061 running outside of Docker):
https://dev.azure.com/dnceng/public/_build/results?buildId=1045441&view=logs&j=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&t=ac0b0b0f-051f-52f7-8fb3-a7e384b0dde9&l=1244
https://gist.github.com/antonfirsov/dc7af2d9cb4213c835a41f59f09a0775#file-type1-txt

Update 1: I no longer think this is an issue with TIME_WAIT, see #50854 (comment).

Update 2: Disabling the Windows Firewall seems to fix the issue


While running these tests, there are thousands of Kestrel sockets in lingering in TIME_WAIT. This may lead to port exhaustion, which I believe explains the CI failures.

It doesn't look like the stress client's ClientOperations are doing anything unconventional -- HttpRequestMessage-es seem to be properly disposed.

Things I tried:

  • Reducing -cancelRate to 0, but it did not change the amount of lingering sockets
  • (Unless I made a mistake in my experiment) After limiting MaxConnectionsPerServer to 100 there were still thousands of lingering sockets

I'm not familiar enough with SocketsHttpHandler codebase and Kestrel (or webservers in general), so I can't assess if what I'm seeing is extreme or not in a stress scenario.

@wfurt @geoffkizer @scalablecory @halter73 @stephentoub thoughts?

@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Net untriaged New issue has not been triaged by the area owner labels Apr 7, 2021
@ghost
Copy link

ghost commented Apr 7, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Currently, the the vast majority of our HttpStress failures are Socket connection failures on HTTP 1.1. (Socket error 10060 running inside, 10061 running outside of Docker):
https://dev.azure.com/dnceng/public/_build/results?buildId=1045441&view=logs&j=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&t=ac0b0b0f-051f-52f7-8fb3-a7e384b0dde9&l=1244
https://gist.github.com/antonfirsov/dc7af2d9cb4213c835a41f59f09a0775#file-type1-txt

While running these tests, there are thousands of Kestrel sockets in lingering in TIME_WAIT. This may lead to port exhaustion, which I believe explains the CI failures.

It doesn't look like the stress client's ClientOperations are doing anything unconventional -- HttpRequestMessage-es seem to be properly disposed.

Things I tried:

  • Reducing -cancelRate to 0, but it did not change the amount of lingering sockets
  • (Unless I made a mistake in my experiment) After limiting MaxConnectionsPerServer to 100 there were still thousands of lingering sockets

I'm not familiar enough with SocketsHttpHandler codebase and Kestrel (or webservers in general), so I can't assess if what I'm seeing is extreme or not in a stress scenario.

@geoffkizer @scalablecory @halter73 @stephentoub thoughts?

Author: antonfirsov
Assignees: -
Labels:

area-System.Net, untriaged

Milestone: -

@ghost
Copy link

ghost commented Apr 7, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Currently, the the vast majority of our HttpStress failures (#42211) are HTTP 1.1 Socket connection failures. (Socket error 10060 running inside, 10061 running outside of Docker):
https://dev.azure.com/dnceng/public/_build/results?buildId=1045441&view=logs&j=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&t=ac0b0b0f-051f-52f7-8fb3-a7e384b0dde9&l=1244
https://gist.github.com/antonfirsov/dc7af2d9cb4213c835a41f59f09a0775#file-type1-txt

While running these tests, there are thousands of Kestrel sockets in lingering in TIME_WAIT. This may lead to port exhaustion, which I believe explains the CI failures.

It doesn't look like the stress client's ClientOperations are doing anything unconventional -- HttpRequestMessage-es seem to be properly disposed.

Things I tried:

  • Reducing -cancelRate to 0, but it did not change the amount of lingering sockets
  • (Unless I made a mistake in my experiment) After limiting MaxConnectionsPerServer to 100 there were still thousands of lingering sockets

I'm not familiar enough with SocketsHttpHandler codebase and Kestrel (or webservers in general), so I can't assess if what I'm seeing is extreme or not in a stress scenario.

@wfurt @geoffkizer @scalablecory @halter73 @stephentoub thoughts?

Author: antonfirsov
Assignees: -
Labels:

area-System.Net.Http, untriaged

Milestone: -

@benaadams
Copy link
Member

Non-clean shutdown of the sockets from the client?

Does the HttpClient ever get disposed or is the client just terminated?

@wfurt
Copy link
Member

wfurt commented Apr 7, 2021

All the code lives in https://github.com/dotnet/runtime/tree/main/src/libraries/System.Net.Http/tests/StressTests/HttpStress @benaadams .

I assume the TIME_WAIT is on server side? e.g. Kestrel closed the socket but HttpClient did not. Can you please confirm?
HttpClient will not see the EOF unless it tries to read from the socket. We may set more aggressive idle timeout policy and/or check if we are not disposing when we should. We should be able to find the particular HttpClient instance from the socket.

@benaadams
Copy link
Member

As I understand the closer of the socket moves to TIME_WAIT and other side moves to CLOSE_WAIT; so if the server has lots of TIME_WAIT it would suggest either lots of bad requests with server terminating them or a non-clean shutdown on the client (e.g. just exiting the process without closing the socket normally); hence why I was asking if it was being disposed, though it looks like it is (not sure what the underlying handler does when the client is disposed though)

@ManickaP
Copy link
Member

ManickaP commented Apr 7, 2021

Hmmm, could the partial reads on the client side be a culprit here?

("GET Partial",
async ctx =>
{
using var req = new HttpRequestMessage(HttpMethod.Get, "/slow");
int expectedLength = ctx.SetExpectedResponseContentLengthHeader(req.Headers, minLength: 2);
using HttpResponseMessage m = await ctx.SendAsync(req, HttpCompletionOption.ResponseHeadersRead);
ValidateStatusCode(m);
using (Stream s = await m.Content.ReadAsStreamAsync())
{
s.ReadByte(); // read single byte from response and throw the rest away
}

I'd maybe try the test without these "ugly" operations and see if it reproduces.

edit: Not saying the code is wrong per se, just that there might be error in client/server that manifests under these conditions.

@geoffkizer
Copy link
Contributor

As I understand the closer of the socket moves to TIME_WAIT and other side moves to CLOSE_WAIT; so if the server has lots of TIME_WAIT it would suggest either lots of bad requests with server terminating them or a non-clean shutdown on the client (e.g. just exiting the process without closing the socket normally);

Yeah, exactly.

We need to understand why Kestrel is closing the connections on their end.

@antonfirsov
Copy link
Member Author

antonfirsov commented Apr 8, 2021

Ok, seeing the server sockets at the first part of at my netstat -b output confused me. Here is the actual full story:

  1. Normally, there are a few hundred lingering sockets on the server, and about 15k on the client, which means that we are opening lots of HTTP 1.1 connections in the stress test.
  2. Setting MaxConnectionsPerServer = 2 flips this around (16k on the server, and just a few client sockets)
  3. Setting MaxConnectionsPerServer = 50 balances the situation (sg like 5k on server 10k on client)

It seems like there are always around 16k sockets lingering in total, which is the size of the ephemeral port range on Windows.

Seeing both server and client sockets lingering for a while after the connections have been terminated is a normal thing AFAIK, even if the sockets are properly closed. I can't decide if what we are seeing is normal, and it's just that the CI machine is unable to handle this kind of load, or if there is actual space for improvement either in stress test implementation or in SocketsHttpHandler itself (or maybe even in Kestrel).

@geoffkizer @benaadams @scalablecory @ManickaP any further insights / ideas / suggestions?

Edit: Removing GET Partial as suggested in #50854 (comment) does not change the picture.

@antonfirsov antonfirsov changed the title HttpStress: connection failures with HTTP 1.1 -- thousands of server sockets in TIME_WAIT HttpStress: connection failures with HTTP 1.1 -- 16k sockets lingering in TIME_WAIT Apr 8, 2021
@benaadams
Copy link
Member

Could you increase the ip addresses used to increase the ports?

e.g. if its localhost 127.0.0.1; can actually use 127.0.0.1 to 127.255.255.255

@antonfirsov
Copy link
Member Author

antonfirsov commented Apr 8, 2021

@benaadams not a bad idea. Note that in our stress test design, there is only one Kestrel instance on a pre-configured endpoint, and only one HttpClient instance on the client side. We may consider overriding ConnectCallback to round-robin over an IP range, so the client sockets are bound to a different IP addresses.

As a first step, I'm really looking for a confirmation from more experienced folks if this is just a normal HTTP 1.1 behavior we need to accept and workaround in the tests, or do we think this might be a potential customer issue we need to address in the product?

@benaadams
Copy link
Member

if this is just a normal HTTP 1.1 behavior we need to accept and workaround in the tests, or do we think this might be a potential customer issue we need to address in the product?

If its doing keep-alives (HTTP 1.1 default) it should be using the same connection so not creating lots of sockets; and it doesn't look like it setting connection close.

_config.ConcurrentRequests should be the max number of sockets + potentially the ones randomly cancelled _config.CancellationProbability if they aren't cleaned up and returned to the pool?

else if (GetRandomBoolean(_config.CancellationProbability))
{
// trigger a random cancellation
using var cts = CancellationTokenSource.CreateLinkedTokenSource(_globalToken);
Task<HttpResponseMessage> task = _client.SendAsync(request, httpCompletion, cts.Token);

@halter73
Copy link
Member

halter73 commented Apr 8, 2021

While running these tests, there are thousands of Kestrel sockets in lingering in TIME_WAIT

I agree with @benaadams that this means that Kestrel initiated the FIN: See https://en.wikipedia.org/wiki/Transmission_Control_Protocol#/media/File:TCP_CLOSE.svg

For HTTP/1.1, this should only happen if the app explicitly set the "Connection: close" header, the app called HttpContext.Abort(), Kestrel timed out the connection or Kestrel determined the request body was too large. If you can collect Debug-level ASP.NET logs, that should make the exact cause clear.

As a first step, I'm really looking for a confirmation from more experienced folks if this is just a normal HTTP 1.1 behavior we need to accept and workaround in the tests, or do we think this might be a potential customer issue we need to address in the product?

Port exhaustion is a well known issue with quickly opening and closing TCP connections to a single endpoint. This (along with improved perf in general) is partly why HTTP/1.1 keeps the TCP connection alive by default and Kestrel only closes the connection if it's explicitly told to by the app or if there's an error that makes reusing the connection infeasible.

@benaadams
Copy link
Member

benaadams commented Apr 8, 2021

or if there's an error that makes reusing the connection infeasible.

Wondering if that's the // trigger a random cancellation path above #50854 (comment) (not sure the percentage and how many requests are made in stress)

@halter73
Copy link
Member

halter73 commented Apr 8, 2021

Wouldn't the // trigger a random cancellation path cause the client to initiate the FIN?

@benaadams
Copy link
Member

Wouldn't the // trigger a random cancellation path cause the client to initiate the FIN?

🤷‍♂️ From an brief glance at the code it didn't look like any of the requests were bad requests and it didn't look like close was being sent; so just guessing the cancel without reading the response might be something to look at?

@antonfirsov
Copy link
Member Author

I think, I managed to track it down:

Client-side lingering is caused entirely by cancellation. Setting -cancelRate 0 completely eliminates the lingering sockets, setting -cancelRate 0.2 will instantly exhaust the port range in the first minute of the run. I assume we are closing down HTTP 1.1 connections whenever a cancellation happens, and there is no good way to avoid that with HTTP 1.1. (@geoffkizer @scalablecory can you confirm?)

Server-side lingering is caused by the GET Aborted workload. On the server, this triggers HttpContext.Abort() on each request, which - I assume - will close the server-side socket. Eventually, this leads to thousands of new HTTP 1.1 connections being created and aborted as we round-robin through the stress workloads.

In short: It seems to me that there is nothing unexpected happening, and we need to workaround this in the stress test code.

@geoffkizer
Copy link
Contributor

Sounds right to me.

@antonfirsov
Copy link
Member Author

I think there are two things we can do:

  1. Reduce the cancelRate used with HTTP 1.1 to reduce the client-side lingering. I'm currently doing experiments to find the minimum necessary change, but something like 8% will likely do the job.
  2. Currently the 15 client operations are being executed with the same uniform frequency. I would add the ability to define weights for each, so we can distribute their runs unequally, instead of giving a 1/15 slot each. This would enable reducing the amount GET Aborted runs, so we can reduce server-side lingering. (I'm not sure how important is that scenario in terms of it's ability to find bugs.)

(1) is most likely enough to address this, but I'm curious on your thoughts regarding (2).

@ManickaP @alnikola @eiriktsarpalis @stephentoub?

@alnikola
Copy link
Contributor

I would try a minimum working solution which means to start with (1) and see if it's enough. Only go to (2), if it's not.

@antonfirsov
Copy link
Member Author

This is blocked on #51117 now 😞

@geoffkizer
Copy link
Contributor

+1 to what @alnikola said.

Also: re 1, I think it's fine to reduce the cancel rate even further, like just make it 5% or whatever. We should still get plenty of coverage of cancellation, and having a bit of extra room here is probably a good thing.

@eiriktsarpalis
Copy link
Member

For context, many of these defaults were informed by stress testing specific to HTTP 2, which multiplexes requests. Adding frequent cancellation was a relatively inexpensive way of introducing entropy to the socket handler, which is not necessarily true in the case of HTTP 1.1.

@geoffkizer
Copy link
Contributor

And HTTP2 would not have the same issue of TIME_WAIT on cancellation, because it is multiplexed. So perhaps we should have a higher rate for cancellation on HTTP2 than HTTP/1.1.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 14, 2021
@antonfirsov
Copy link
Member Author

Unfortunately, reducing the client cancellation is not enough on the actual CI machine, but disabling the GET Aborted (server-side abort) scenario in addition seems to do the job.

@geoffkizer @ManickaP @alnikola how important is that in it's ability to find bugs in HTTP 1.1? A relatively easy fix could be to introduce a switch to completely disable it.

@geoffkizer
Copy link
Contributor

I don't think we should disable it entirely. It does seem like it induces code paths that aren't otherwise hit.

Can we just reduce it significantly?

@karelz karelz removed the untriaged New issue has not been triaged by the area owner label Apr 15, 2021
@karelz karelz added this to the 6.0.0 milestone Apr 15, 2021
@karelz karelz added the test-bug Problem in test source code (most likely) label Apr 15, 2021
@antonfirsov
Copy link
Member Author

Can we just reduce it significantly?

Yes, but I wouldn't pollute the code with hacks specific to GET Aborted, instead introduce a general logic that allows assigning weights to operations as described in point 2. of #50854 (comment).

@ManickaP
Copy link
Member

Yeah, I like the weights per operation and through that reducing the GET Aborted. Seems like a good feature to have.

@antonfirsov
Copy link
Member Author

According to experiments in #51233, failures still may happen even with cancelRate=0.5 and P(GET Aborted)=0.2.

Summary of the symptoms again:

  • The socket error is WSAETIMEDOUT (10060) when hosting the server and the client in separate containers, and WSAECONNREFUSED (10061) when running the tests without docker.
    • This is weird, since a port exhaustion typically results in WSAEADDRINUSE (10048), or similar
    • But: Getting rid of lingering sockets entirely (by removing cancellation and the GET Aborted operation) fixes the issue
  • The failure is machine dependent in a weird way. I have another VM from the same image where it occurs much less likely. I have a 3rd VM (again from the same image), where I cannot reproduce the failure at all
  • The ephemeral port range is already extended on the machines to ~65k

Feels like the problem is rooted in some weird OS behavior I don't understand. @wfurt do you have any ideas?

@wfurt
Copy link
Member

wfurt commented Apr 20, 2021

I know this is late in the discussion but why HttpClient keeps the half-closed sockets after cancellation?
I would expect that especial when client tries to abort/cancel it would dispose the socket as quickly as possible if it can to be used. It seems to me that playing with the cancelRate is just a workaround.

@antonfirsov
Copy link
Member Author

antonfirsov commented Apr 20, 2021

@wfurt it does dispose (close) the sockets. The sockets then TIME_WAIT after closure for a while.

@benaadams
Copy link
Member

I know this is late in the discussion but why HttpClient keeps the half-closed sockets after cancellation?

It's the operating system that keeps them open as part of the normal operation of TCP

@wfurt
Copy link
Member

wfurt commented Apr 20, 2021

yah, I miss-read the conversation @benaadams.
We should be able to set this on Linux

/proc/sys/net/ipv4/tcp_tw_recycle
/proc/sys/net/ipv4/tcp_tw_reuse
and
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\ Services\TCPIP\Parameters\TcpTimedWaitDelay on Windows.

That should help with lingering sockets and port exhausting.

@antonfirsov
Copy link
Member Author

antonfirsov commented Apr 20, 2021

The problem is that I'm starting to question my original assumptions.

A port exhaustion should end with WSAEADDRINUSE on Windows. This is deterministically reproducible with an application using up the port range by connecting & closing socket pairs. Seeing WSAETIMEDOUT and WSAECONNREFUSED in a manner that is not reproducible across machines is very weird.

What made me think it's port exhaustion is that if I get rid of the connection abortion scenarios, the issue disappears. But the whole picture tells me that there is some further weirdness in the OS I fail to understand.

@antonfirsov
Copy link
Member Author

antonfirsov commented Apr 23, 2021

I think, I have a proof now, that this is very unlikely an issue with TIME_WAIT:

  1. By reducing cancelRate and the amount of GET Aborted by an order of magnitude, there are only a few thousand sockets lingering, but I'm still seeing WSAETIMEDOUT (docker) and WSAECONNREFUSED (outside of docker). From what I see, the probability of those errors happening scales with the amount of connection-recreations with a non-linear manner. They disappear when cancelRate=0 and P(GET Aborted)=0, but even with very small rate of connection recreations there is a significant chance for them to happen.

  2. By increasing cancelRate to a very high value, while also setting TcpTimedWaitDelay = 5 minutes, actual port exhaustion starts to happen after a while, with the expected WSAEADDRINUSE error as a symptom.


The next step would be to collect ETW traces and packet captures from the test lab VM, however this gonna be difficult chore work, since the disk space is limited there, and the data we need to analyze is of several 10s of GB in scale.

@antonfirsov antonfirsov changed the title HttpStress: connection failures with HTTP 1.1 -- 16k sockets lingering in TIME_WAIT HttpStress: connection failures with HTTP 1.1 -- related to high connection recreation rate Apr 23, 2021
@karelz
Copy link
Member

karelz commented Apr 23, 2021

@antonfirsov just an idea: What about collecting ETW traces (and hopefully as well packet capture) into circular buffer? I know ETW has the option ...
If we modify the tests/runtime to FailFast as soon as it happens, that might work ...

@antonfirsov
Copy link
Member Author

Disabling the Windows firewall makes the connection failures disappear, which practically rules out the possibility of a product bug for this failure type.

The probability of the issue does not scale linearly with the amount of connection recreation, so even with a very low (~0.5%) cancellation/server abortion rate there will be a significant chance for connection failures.

We need a quick, temporary workaround to remove this failure type ASAP, since the high amount of these false negatives renders our stress test reports practically useless. Considering other priorities, I would go with the easiest solution, and disable cancellation and GET Aborted for HTTP 1.1, until we'll be able to find a more sophisticated fix by root causing the firewall issue with the help of the Windows firewall team. @geoffkizer @ManickaP @alnikola agreed?

@ManickaP
Copy link
Member

ManickaP commented May 4, 2021

Instead of completely turning off those scenarios, could we rather catch and filter the exception by the known error codes for this issues? So that we don't miss potentially different problems.
But I don't know how hard would it be to achieve, so if it's too complicated, don't bother, it's probably not worth the effort.

@antonfirsov
Copy link
Member Author

@ManickaP this is the second thing I was thinking about, it might be actually a better solution. Could be done with a switch -doNotFailOnConnectionTimeout which would make us filter out SocketException(10060). My main problem with this approach is not the extra effort, but rather the (otherwise unnecessary) arbitrary complexity we would add to the test code.

I wonder about other opinions?

@benaadams
Copy link
Member

benaadams commented May 4, 2021

Reminds me of this SignalR/SignalR/../WebSocketHandler.cs#L245-L255

// If this exception is due to the underlying TCP connection going away, treat as a normal close
// rather than a fatal exception.
COMException ce = ex as COMException;
if (ce != null)
{
    switch ((uint)ce.ErrorCode)
    {
        // These are the three error codes we've seen in testing which can be caused by the TCP connection going away unexpectedly.
        case 0x800703e3:
        case 0x800704cd:
        case 0x80070026:
            return false;
    }
}

@antonfirsov
Copy link
Member Author

@benaadams this is interesting, do you know more details about that story? Do those COMException error codes map to WSA error codes?

/cc @davidfowl

@antonfirsov antonfirsov changed the title HttpStress: connection failures with HTTP 1.1 -- related to high connection recreation rate HttpStress: connection failures with HTTP 1.1 caused by Windows Firewall May 6, 2021
@davidfowl
Copy link
Member

@GrabYourPitchforks wrote that code a long time ago 😄

antonfirsov added a commit that referenced this issue May 11, 2021
#50854 seems to be rooted in an unknown firewall problem, and from the POV of http stress testing it's a false negative. We need a quick solution to remove this failure type, because the high amount failures renders our stress test reports useless. Since the build agents do not have a public IP, it seems OK to disable the Windows firewall for the runs.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 22, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jun 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Http test-bug Problem in test source code (most likely)
Projects
None yet
10 participants