Start new per-host queue when an async over sync Dns call is cancelled #92862

antonfirsov · 2023-10-01T18:16:55Z

We had several customer reports of #81023 -- long running getaddrinfo lookups stalling Dns.*Async calls as a result of #49171. This is a simple mitigation of the problem by starting a new per-hostname Task queue in case a cancellation fires and also fixing #92054 to make sure the CancellationToken flows down to GetHostAddressesAsync.

The downside of this mitigation is that it actually needs a cancellation (~ConnectTimeout) to work. IMO this is still better than not having a solution to the problem at .NET level at all. We could implement a more sophisticated fix (see #92863), but I'm assuming that we are looking for a low-risk change to service .NET 8 and maybe earlier versions.

Fixes #92054
Fixes #92045
Mitigates #81023 (doesn't fully fix the issue)

ghost · 2023-10-01T18:17:07Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

We had several customer reports of #81023 -- long running getaddrinfo lookups stalling Dns.*Async calls as a result of #49171. This is a simple mitigation of the problem by starting a new per-hostname Task queue in case a cancellation fires and also fixing #92054 to make sure the CancellationToken flows down to GetHostAddressesAsync.

The downside of this mitigation is that it actually needs a cancellation (~ConnectTimeout) to work. IMO this is still better than not having a solution to the problem at .NET level at all. We can implement a more sophisticated fix, but I'm assuming that we are looking for a low-risk change to service .NET 8 and maybe earlier versions.

Fixes #92054
Fixes #92045
Mitigates #81023 (doesn't fully fix the issue)

Author:	antonfirsov
Assignees:	-
Labels:	`area-System.Net.Sockets`
Milestone:	-

antonfirsov · 2023-10-01T18:35:17Z

src/libraries/System.Net.NameResolution/src/System/Net/Dns.cs

-        /// serialized.  Once the data for that host is cached locally by the OS, the subsequent requests should all complete
-        /// very quickly, and if the head-of-line request is taking a long time due to the connection to the server, we won't
-        /// block lots of threads all getting data for that one host.  We also still want to issue the request to the OS, rather


Once the data for that host is cached locally by the OS, the subsequent requests should all complete very quickly, and if the head-of-line request is taking a long time due to the connection to the server, we won't block lots of threads all getting data for that one host

This turned out to be a false assumption. In Ubuntu there is no caching by default, and each getaddrinfo call sends an A & AAAA DNS request and takes 5-50ms. If the packet is lost, it waits 5 sec before retrying.

This turned out to be a false assumption. In Ubuntu there is no caching by default, and each getaddrinfo call sends an A & AAAA DNS request and takes 5-50ms.

On what are you basing this? On Ubuntu for:

[Benchmark] public object GetHostEntry() => Dns.GetHostEntry("www.microsoft.com");

I get:

Method Mean

GetHostEntry 498.9 us

[Benchmark] public void GetHostAddresses() => Dns.GetHostAddresses("fb.lt.main.ml.mdn.skype.net");

averages to 24.84 ms in my local 22.04 environment and to 9.051 ms in the Azure D32 VM from #92054.

fb.lt.main.ml.mdn.skype.net

That name resolves through a CNAME of fb.lt.main.ml.mdn.trafficmanager.net, which advertises a CNAME for fileserver-backend-lb.lt-westus-01.ic3-middle-lane-main-lt.westus-test.cosmic-int.office.net with a 0 second TTL. So my guess is the OS is just honoring the TTLs in this case.

The www.microsoft.com example above on the other hand uses non-0 TLLs.

(that's in no way saying that this isn't a real scenario, DNS-based connection load balancing works this way)

You can also check whether DNS caching in your system is enabled, e.g.

systemctl is-active systemd-resolved

My point is simply that stating that it's a false assumption that OSes cache DNS information is itself a false assumption. Every major OS I'm aware of is capable of doing DNS caching. Whether it's enabled is a different question, though to my knowledge it is by default in Ubuntu 22.04. And we shouldn't just be killing the associated comment. If you want to augment the comment to say that caching might be disabled, that's fine.

I was wrong thinking that answers are never cached on Ubuntu by default. However, to me the comment suggests that we can generally rely on short completion times because of the caching, which is also not true, eg. as Miha pointed out answers with TTL=0 are not cached, which then increases the probability of hitting #81023.

I will add the comment back & augment it.

stephentoub · 2023-10-02T15:22:47Z

src/libraries/System.Net.NameResolution/src/System/Net/Dns.cs

                if (cancellationToken.CanBeCanceled)
                {
-                    task.ContinueWith((task, key) =>
+                    task.ContinueWith(delegate


Previously this wasn't allocating a new delegate and a new closure on every call. Now it is, as it's closing over the key and the task and the startingTimestamp. It'll even result in that allocation even if !CanBeCanceled, because state it's closing over is at the top-level method scope.

This is a fix for #92045. I don't see a way to capture startingTimestamp without an allocation. We can pass the info in a tuple to avoid the allocation in the !CanBeCanceled case. Or we can create a private class & cache instances to reduce the allocations in general.

At a minimum you can pass the required state in via the object state parameter. That'll require a tuple allocation/boxing, but the one allocation for that is still better than the two allocations for the closure and delegate. And it'll make it specific to the CanBeCanceled case, whereas with how it's currently written the closure allocation will happen first thing in the method body.

stephentoub · 2023-10-02T15:28:16Z

src/libraries/System.Net.NameResolution/src/System/Net/Dns.cs

+                        using (cancellationToken.UnsafeRegister(static s =>
+                        {
+                            // In case a request is cancelled, start a new queue for this key.
+                            // This helps avoiding congestion when a getaddrinfo call runs for too long.


What's the rationale for why a cancellation request means we no longer want to serialize?

If configured well, a cancellation would likely kick-in for a long-running or stalled request. Closing the existing queue and starting a new one makes sure that requests following the cancellation would not be blocked by the long-running request. This helped in my experiments + validations are in progress in customer environments.

One of the original motivations for doing this queueing was a DNS request that could get stalled, and thus all requests for that DNS endpoint would end up blocking all threads in the pool. Doesn't this effectively put us back into the same place?

From what I see, #48566 doesn't talk about stalled (up to several seconds) requests but about slow (tens of milliseconds?) requests using up the ThreadPool threads. This does not remove the serialization of those. The queue is only being reset when a cancellation kicks in which is typically triggered by a ConnectTimeout, which is usually in a multiple seconds range. If the customer mentioned in #34633 had stalled requests, they would have likely came back to us reporting #81023 much earlier.

cc @davidfowl if you can recall how long did the slow requests take in #34633. Without that I would assume it's a case similar to what we discussed in #92862 (comment) and "slow" means that the requests are in the 5-50 ms range.

Right, so if every request for a given destination ends up taking a longer period of time and ends up being canceled as a result, every cancellation is going to cause subsequent requests to run concurrently with ones already there, and if it happens enough, won't it similarly spread out over all of the pool threads?

when a cancellation kicks in which is typically triggered by a ConnectTimeout

That's one source. The Dns APIs that support cancellation are public.

if every request for a given destination ends up taking a longer period of time and ends up being canceled as a result, every cancellation is going to cause subsequent requests to run concurrently with ones already there, and if it happens enough, won't it similarly spread out over all of the pool threads?

Yes, but how likely is this scenario, and is it more probable than hitting #81023? To me it looks like having all requests to the same host stalled for a long enough period to starve the ThreadPool is a catsrophic event which would have serious impact on the application's funcionality even without the pool starvation, and the customer has to go and fix the environmental setup anyways. Occasional stalled requests are more common IMO, and the problem is that they are currently killing a possibly long chain of follow-up requests.

That's one source. The Dns APIs that support cancellation are public.

IMO this doesn't really change things. Regularly occuring cases of cancellation are most likely timeouts, and cancelling a significant percentage of DNS lookups by setting an overly short timeout doesn't look like a sane scenario to me.

Occasional stalled requests are more common IMO

What is the cause of the stalled requests you're seeing that makes this the case?

In #92054 we see the OS resolver (succesfully) retrying a query after the default timeout of 5 seconds, and getaddrinfo is blocking for this 5sec period. The retries are sporadic and do not come in bursts. They are caused by a packet loss in my understanding, which shouldn't be very uncommon.

I don't have such confident information from the other cases, because this is typcially happening in production systems in a sporadic manner. Currently I'm working with @matt-augustine checking if a private build of the changes in this PR helps with their symptoms. Would that bring enough confidence for you that the change is net positive?

We see the issue in a production service with a containerized application running in a Kubernetes cluster on AKS. It usually occurs in new pods shortly after they are created. The cluster is using the default Kubernetes DNS configuration, which as I understand it means that DNS lookups are not cached in the container/pod, so every DNS query is routed to the CoreDNS service as the nameserver, which might be running on a different node.

Our theory is that UDP packet loss might be responsible for failed/retried DNS queries when new pods are created, but we have not confirmed that. We are currently testing a private build to see if it helps, and I'm going to try to make a simpler way to test it in a more repeatable way.

antonfirsov · 2024-01-04T14:15:27Z

Currently #81023 is no longer blocking the customers who originally reported the issue. Closing the PR in favor of a broader investigation into potential DNS improvements for 9.0.

antonfirsov added 3 commits October 1, 2023 03:28

propagate cancellation

e1b4fd0

cleanup queue on cancellation

98c8d7a

log AfterResolution, add comments

457a315

dotnet-issue-labeler bot added the area-System.Net.Sockets label Oct 1, 2023

ghost assigned antonfirsov Oct 1, 2023

better comment

fde8511

antonfirsov mentioned this pull request Oct 1, 2023

Unblock subsequent requests when an async over sync Dns call runs for too long #92863

Closed

antonfirsov commented Oct 1, 2023

View reviewed changes

antonfirsov requested a review from stephentoub October 1, 2023 18:52

build-analysis bot mentioned this pull request Oct 1, 2023

System.Data.OleDb.Tests timeout in net48 x86 Release leg #87783

Open

stephentoub reviewed Oct 2, 2023

View reviewed changes

karelz added this to the 9.0.0 milestone Oct 24, 2023

antonfirsov closed this Jan 4, 2024

github-actions bot locked and limited conversation to collaborators Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start new per-host queue when an async over sync Dns call is cancelled #92862

Start new per-host queue when an async over sync Dns call is cancelled #92862

antonfirsov commented Oct 1, 2023 •

edited

Loading

ghost commented Oct 1, 2023

antonfirsov Oct 1, 2023 •

edited

Loading

stephentoub Oct 1, 2023

antonfirsov Oct 1, 2023

MihaZupan Oct 1, 2023 •

edited

Loading

stephentoub Oct 1, 2023 •

edited

Loading

antonfirsov Oct 1, 2023

stephentoub Oct 2, 2023

antonfirsov Oct 2, 2023

stephentoub Oct 2, 2023

stephentoub Oct 2, 2023

antonfirsov Oct 2, 2023

stephentoub Oct 2, 2023 •

edited

Loading

antonfirsov Oct 2, 2023 •

edited

Loading

stephentoub Oct 2, 2023

antonfirsov Oct 2, 2023 •

edited

Loading

stephentoub Oct 3, 2023

antonfirsov Oct 3, 2023

matt-augustine Oct 3, 2023

antonfirsov commented Jan 4, 2024

Start new per-host queue when an async over sync Dns call is cancelled #92862

Start new per-host queue when an async over sync Dns call is cancelled #92862

Conversation

antonfirsov commented Oct 1, 2023 • edited Loading

ghost commented Oct 1, 2023

antonfirsov Oct 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MihaZupan Oct 1, 2023 • edited Loading

Choose a reason for hiding this comment

stephentoub Oct 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Oct 2, 2023 • edited Loading

Choose a reason for hiding this comment

antonfirsov Oct 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonfirsov Oct 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonfirsov commented Jan 4, 2024

antonfirsov commented Oct 1, 2023 •

edited

Loading

antonfirsov Oct 1, 2023 •

edited

Loading

MihaZupan Oct 1, 2023 •

edited

Loading

stephentoub Oct 1, 2023 •

edited

Loading

stephentoub Oct 2, 2023 •

edited

Loading

antonfirsov Oct 2, 2023 •

edited

Loading

antonfirsov Oct 2, 2023 •

edited

Loading