Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start new per-host queue when an async over sync Dns call is cancelled #92862

Closed
wants to merge 4 commits into from

Conversation

antonfirsov
Copy link
Member

@antonfirsov antonfirsov commented Oct 1, 2023

We had several customer reports of #81023 -- long running getaddrinfo lookups stalling Dns.*Async calls as a result of #49171. This is a simple mitigation of the problem by starting a new per-hostname Task queue in case a cancellation fires and also fixing #92054 to make sure the CancellationToken flows down to GetHostAddressesAsync.

The downside of this mitigation is that it actually needs a cancellation (~ConnectTimeout) to work. IMO this is still better than not having a solution to the problem at .NET level at all. We could implement a more sophisticated fix (see #92863), but I'm assuming that we are looking for a low-risk change to service .NET 8 and maybe earlier versions.

Fixes #92054
Fixes #92045
Mitigates #81023 (doesn't fully fix the issue)

@ghost
Copy link

ghost commented Oct 1, 2023

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

We had several customer reports of #81023 -- long running getaddrinfo lookups stalling Dns.*Async calls as a result of #49171. This is a simple mitigation of the problem by starting a new per-hostname Task queue in case a cancellation fires and also fixing #92054 to make sure the CancellationToken flows down to GetHostAddressesAsync.

The downside of this mitigation is that it actually needs a cancellation (~ConnectTimeout) to work. IMO this is still better than not having a solution to the problem at .NET level at all. We can implement a more sophisticated fix, but I'm assuming that we are looking for a low-risk change to service .NET 8 and maybe earlier versions.

Fixes #92054
Fixes #92045
Mitigates #81023 (doesn't fully fix the issue)

Author: antonfirsov
Assignees: -
Labels:

area-System.Net.Sockets

Milestone: -

Comment on lines -644 to -646
/// serialized. Once the data for that host is cached locally by the OS, the subsequent requests should all complete
/// very quickly, and if the head-of-line request is taking a long time due to the connection to the server, we won't
/// block lots of threads all getting data for that one host. We also still want to issue the request to the OS, rather
Copy link
Member Author

@antonfirsov antonfirsov Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the data for that host is cached locally by the OS, the subsequent requests should all complete very quickly, and if the head-of-line request is taking a long time due to the connection to the server, we won't block lots of threads all getting data for that one host

This turned out to be a false assumption. In Ubuntu there is no caching by default, and each getaddrinfo call sends an A & AAAA DNS request and takes 5-50ms. If the packet is lost, it waits 5 sec before retrying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turned out to be a false assumption. In Ubuntu there is no caching by default, and each getaddrinfo call sends an A & AAAA DNS request and takes 5-50ms.

On what are you basing this? On Ubuntu for:

[Benchmark]
public object GetHostEntry() => Dns.GetHostEntry("www.microsoft.com");

I get:

Method Mean
GetHostEntry 498.9 us

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Benchmark]
public void GetHostAddresses() => Dns.GetHostAddresses("fb.lt.main.ml.mdn.skype.net");

averages to 24.84 ms in my local 22.04 environment and to 9.051 ms in the Azure D32 VM from #92054.

Copy link
Member

@MihaZupan MihaZupan Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fb.lt.main.ml.mdn.skype.net

That name resolves through a CNAME of fb.lt.main.ml.mdn.trafficmanager.net, which advertises a CNAME for fileserver-backend-lb.lt-westus-01.ic3-middle-lane-main-lt.westus-test.cosmic-int.office.net with a 0 second TTL. So my guess is the OS is just honoring the TTLs in this case.

The www.microsoft.com example above on the other hand uses non-0 TLLs.

(that's in no way saying that this isn't a real scenario, DNS-based connection load balancing works this way)

Copy link
Member

@stephentoub stephentoub Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also check whether DNS caching in your system is enabled, e.g.

systemctl is-active systemd-resolved

My point is simply that stating that it's a false assumption that OSes cache DNS information is itself a false assumption. Every major OS I'm aware of is capable of doing DNS caching. Whether it's enabled is a different question, though to my knowledge it is by default in Ubuntu 22.04. And we shouldn't just be killing the associated comment. If you want to augment the comment to say that caching might be disabled, that's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong thinking that answers are never cached on Ubuntu by default. However, to me the comment suggests that we can generally rely on short completion times because of the caching, which is also not true, eg. as Miha pointed out answers with TTL=0 are not cached, which then increases the probability of hitting #81023.

I will add the comment back & augment it.

if (cancellationToken.CanBeCanceled)
{
task.ContinueWith((task, key) =>
task.ContinueWith(delegate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this wasn't allocating a new delegate and a new closure on every call. Now it is, as it's closing over the key and the task and the startingTimestamp. It'll even result in that allocation even if !CanBeCanceled, because state it's closing over is at the top-level method scope.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fix for #92045. I don't see a way to capture startingTimestamp without an allocation. We can pass the info in a tuple to avoid the allocation in the !CanBeCanceled case. Or we can create a private class & cache instances to reduce the allocations in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum you can pass the required state in via the object state parameter. That'll require a tuple allocation/boxing, but the one allocation for that is still better than the two allocations for the closure and delegate. And it'll make it specific to the CanBeCanceled case, whereas with how it's currently written the closure allocation will happen first thing in the method body.

using (cancellationToken.UnsafeRegister(static s =>
{
// In case a request is cancelled, start a new queue for this key.
// This helps avoiding congestion when a getaddrinfo call runs for too long.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for why a cancellation request means we no longer want to serialize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If configured well, a cancellation would likely kick-in for a long-running or stalled request. Closing the existing queue and starting a new one makes sure that requests following the cancellation would not be blocked by the long-running request. This helped in my experiments + validations are in progress in customer environments.

Copy link
Member

@stephentoub stephentoub Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the original motivations for doing this queueing was a DNS request that could get stalled, and thus all requests for that DNS endpoint would end up blocking all threads in the pool. Doesn't this effectively put us back into the same place?

Copy link
Member Author

@antonfirsov antonfirsov Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see, #48566 doesn't talk about stalled (up to several seconds) requests but about slow (tens of milliseconds?) requests using up the ThreadPool threads. This does not remove the serialization of those. The queue is only being reset when a cancellation kicks in which is typically triggered by a ConnectTimeout, which is usually in a multiple seconds range. If the customer mentioned in #34633 had stalled requests, they would have likely came back to us reporting #81023 much earlier.

cc @davidfowl if you can recall how long did the slow requests take in #34633. Without that I would assume it's a case similar to what we discussed in #92862 (comment) and "slow" means that the requests are in the 5-50 ms range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so if every request for a given destination ends up taking a longer period of time and ends up being canceled as a result, every cancellation is going to cause subsequent requests to run concurrently with ones already there, and if it happens enough, won't it similarly spread out over all of the pool threads?

when a cancellation kicks in which is typically triggered by a ConnectTimeout

That's one source. The Dns APIs that support cancellation are public.

Copy link
Member Author

@antonfirsov antonfirsov Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if every request for a given destination ends up taking a longer period of time and ends up being canceled as a result, every cancellation is going to cause subsequent requests to run concurrently with ones already there, and if it happens enough, won't it similarly spread out over all of the pool threads?

Yes, but how likely is this scenario, and is it more probable than hitting #81023? To me it looks like having all requests to the same host stalled for a long enough period to starve the ThreadPool is a catsrophic event which would have serious impact on the application's funcionality even without the pool starvation, and the customer has to go and fix the environmental setup anyways. Occasional stalled requests are more common IMO, and the problem is that they are currently killing a possibly long chain of follow-up requests.

That's one source. The Dns APIs that support cancellation are public.

IMO this doesn't really change things. Regularly occuring cases of cancellation are most likely timeouts, and cancelling a significant percentage of DNS lookups by setting an overly short timeout doesn't look like a sane scenario to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Occasional stalled requests are more common IMO

What is the cause of the stalled requests you're seeing that makes this the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #92054 we see the OS resolver (succesfully) retrying a query after the default timeout of 5 seconds, and getaddrinfo is blocking for this 5sec period. The retries are sporadic and do not come in bursts. They are caused by a packet loss in my understanding, which shouldn't be very uncommon.

I don't have such confident information from the other cases, because this is typcially happening in production systems in a sporadic manner. Currently I'm working with @matt-augustine checking if a private build of the changes in this PR helps with their symptoms. Would that bring enough confidence for you that the change is net positive?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We see the issue in a production service with a containerized application running in a Kubernetes cluster on AKS. It usually occurs in new pods shortly after they are created. The cluster is using the default Kubernetes DNS configuration, which as I understand it means that DNS lookups are not cached in the container/pod, so every DNS query is routed to the CoreDNS service as the nameserver, which might be running on a different node.

Our theory is that UDP packet loss might be responsible for failed/retried DNS queries when new pods are created, but we have not confirmed that. We are currently testing a private build to see if it helps, and I'm going to try to make a simpler way to test it in a more repeatable way.

@karelz karelz added this to the 9.0.0 milestone Oct 24, 2023
@antonfirsov
Copy link
Member Author

Currently #81023 is no longer blocking the customers who originally reported the issue. Closing the PR in favor of a broader investigation into potential DNS improvements for 9.0.

@antonfirsov antonfirsov closed this Jan 4, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Feb 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.