reopen issue #10117 - IIS app pool recycle throws 503 errors #41340

alex-jitbit · 2022-04-23T22:43:11Z

Is there an existing issue for this?

I have searched the existing issues

Describe the bug

IIS app pool throws 503 errors during recycles. This is a known issue with ANCM module that has been reported previously in #10117 - which has 33 likes and a 3-year discussion, it was never fixed, but it was automatically closed by a bot "as a clean-up due to lack of discussion".

P.S. This is not a deployment problem. There are many scenarios when IIS app pool is being recycled outside of our control (adding/removing SSL certificates, changing IP addresses to listen to, etc... basically, touching any IIS setting causes a recycle - and 503 errors are unacceptable for high-availability scenarios).

.NET Framework was free of this bug.

Expected Behavior

No errors during recycles.

Steps To Reproduce

see the issue linked #10117

Exceptions (if any)

No response

.NET Version

5.0, 6.0 7.0 8.0

Anything else?

No response

Tratcher · 2022-04-25T17:39:12Z

#10117 (comment)

Using separate app pools and a load balancer is our recommended approach for high-availability as it allows you full flexibility over deployment process and the ability to easily revert versions.

Trying to achieve high-availability with a single instance is not recommended.

alex-jitbit · 2022-04-25T20:08:59Z

@Tratcher like I indicated above, this is not about deployments. There are a lot of scenarios when IIS recycles the pool (see above)

Tratcher · 2022-04-25T20:16:42Z

Deployments are just one example that disrupt availability. A single instance is not advised for high-availability for many reasons.

ghost · 2022-04-25T22:32:29Z

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

benjamin-stern · 2022-04-26T11:14:54Z

@Tratcher Even having more than a single instance this would still strongly affect a service, as all the requests going to the server that's recycling would be returned the 503 error.

c0shea · 2022-04-26T12:24:42Z

We have two instances running behind a load balancer and have still experienced this issue intermittently. When the app pool inevitably recycles (due to deployment, config change, etc), it starts returning 503 instead of queuing up the requests.

The load balancer doesn't immediately treat the 503s as the server being down and take it out of the rotation. Instead, it uses a polling mechanism that calls an endpoint (i.e. /status) on each instance and checks for a successful response. While that status endpoint is monitored frequently enough, there is obviously plenty of time where a bunch of requests will fail with 503 while the recycle is happening. We can't have the load balancer take the instance out of the rotation if it sees 503s being returned because (1) it's not an available option in NetScaler and (2) if both happen to recycle at the same time, both servers would be taken out of the rotation and the service would be completely down until manual intervention told the load balancer that the requests aren't failing anymore.

RomBrz · 2022-04-26T16:35:12Z

In this scenario (using two or more instances behind a load balancer), since you're using a "/status" to check availability, i suggest that you make some routines during the deployment or maitenence to, before start doing anything, force the "/status" to throw an "unhealthy" status, so the load balance could remove the node from the balancer and then make the changes.

About the issue itself, the IIS default behavior on a recycle is to first start a new application pool, route the new requests to the new application pool, wait the default set time to the current requests ends, an than close/finish the current application pool, keeping only the new as the application pool.

Recycling an application pool, could be an ASPNET Core "expected behavior", but looking at IIS, throwing 503 during one recycle isn't a "normal behavior".

c0shea · 2022-04-26T17:42:34Z

The problem is that ASP.NET Core doesn't use the overlapping recycle behavior that .NET Framework did. While the old worker process is being shutdown (especially if there were a lot of inflight requests being handled by it), the new process isn't yet started and those requests in the middle get the 503.

alex-jitbit · 2022-06-18T16:10:34Z

TIL that StackOverflow also runs ASP.NET Core under IIS

luizfbicalho · 2022-07-05T18:06:15Z

Is there any way to minimize this problem, or ate least to detect what is causing it?
Any configuration in the application pool to minimize it?

peter-bertok · 2022-09-20T06:30:36Z

Deployments are just one example that disrupt availability. A single instance is not advised for high-availability for many reasons.

High availability and throwing 503s from otherwise "perfectly fine servers" are separate concerns.

Most load balancers do not hide HTTP errors! If the IIS process responds to a HTTP request with 503, then that's what the user will see. In particular, none of the Azure load balancer offerings hide errors from the users. They pass them on faithfully.

If a previously working server throws 503 errors then it will take significant time for the load balancers to detect this. Minutes even, or 10+ minutes if using CDN-type solutions such as Azure Front Door.

This behavior is triggered by many actions, not all of which are resolved via hosting on multiple server instances. Scheduled recycling, for example, has been mentioned by many people as a common trigger.

Similarly, many people have pointed out in the previous thread that it's not just 503 errors that are seen, but slow uploads are also unceremoniously terminated.

luizfbicalho · 2022-09-20T16:55:38Z

My problem seems to be on IIS recycle when takes too long to recycle, it's not related to cpu neither to memory, I asked the IT infrastructure to add more performance counters to grafana but they didn't add yet,

I'm inclined to think that the problem is with Network connections, what do you think that I should monitor in grafana?

alex-jitbit · 2022-11-01T22:10:39Z

TIL that fuget.org also uses IIS + ASP.NET Core, just caught them during a recycle.

How many more examples do we need?

AndreasJilvero · 2022-11-05T18:39:26Z

In my experience, this happens also when simply setting the physical path of a website. The response is somewhat difference though - a recycle renders the text "The service is unavailable" and setting the physical path just gives an empty 503 result.

https://stackoverflow.com/questions/74326315/iis-setting-physical-path-gives-503-status-for-a-few-seconds

luizfbicalho · 2022-11-05T18:44:57Z

In my case it's not for seconds, after the first 503 only iisreset solves the problem

kadamgreene · 2023-01-06T15:01:11Z

We are also experiencing this issue. Is there any expectation of a fix? This is going to keep us from moving forward with .NET migration / any new work in .NET 6+. It's not the deployments, blue/green can handle that, it's the "unexpected" recycles in the run of a day that are the issue.

cun-dp · 2023-04-24T10:28:40Z

We are also seeing this with our nightly apppool recycles in the production environment, and with all our aspnetcore microservices.

Example of IIS Logs (anonymized):
2023-04-23 22:47:03 W3SVC5 HOSTNAME1 [IP censored] GET /app/health 443 - HTTP/1.1 - - my.fqdn.example.com 503 0 1255 202 130 15

HTTP 503 with win32 substatus 1255. According to https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--1000-1299- code 1255 matches: 1255 (0x4E7) ERROR_SERVER_SHUTDOWN_IN_PROGRESS: The server machine is shutting down.

This is a critical issue for us, since we are running a 24/7 plattform.

cun-dp · 2023-05-31T10:29:36Z

I did a few tests with an almost empty aspnetcore 6 application and with some of our production applications on IIS10.
Conclusion: the problem always occurs, not matter how simple the application is.
Sync or Async controller actions as well as InProcess or OutOfProcess hosting make no difference in my tests.

Of all my tests the problem seemed to be exacerbated the most if the application is doing things during the Application Stopped event:

app.Lifetime.ApplicationStopped.Register(() => { Thread.Sleep(TimeSpan.FromMilliseconds(1000)); });

Using this snippet in my almost empty aspnetcore 6 application leads to 20-50 times the amount of HTTP503 responses during apppool recycle compared to using no Application Stopped event handler.

Tratcher · 2023-05-31T16:26:54Z

@cun-dp that makes sense, the application has stopped serving traffic and can't re-start until the current process exits.

cun-dp · 2023-05-31T16:52:24Z

@Tratcher It makes sense only in the way that my test further confirms the bug with aspnetcore app pool recycling:
I can see that a second W3SVC instance gets spawned the moment the app pool is asked to recycle, so the application is restarting. But routing of new requests to the new application instance simply does not work. Instead, the requests get routed to the old application that is shutting down and therefor is rejecting requests.

This supports the fact that IIS (or the aspnet core v2 module?) does not handle app pool recycles correctly by overlapping both application instances and routing the new requests to the new application instance the moment the application pool is asked to recycle, like IIS does with aspdotnet framework 4.x (and previous) applications.

And to make this abundantly clear, because it has been misinterpreted in #10117 a lot:
This is not about deployments of applications. This bug is triggered by only recycling an app pool running an existing, unchanged, aspnetcore application.

The behaviour can be reproduced by just running Restart-WebAppPool MyAppPool in PowerShell while continuously sending request to the app running in "MyAppPool".

luizfbicalho · 2023-05-31T17:01:04Z

@cun-dp that makes sense, the application has stopped serving traffic and can't re-start until the current process exits.

but the correct approach wasn't to start a new process, redirect the new connections to that process and let the old process die in peace as long as it takes?

Is there a way for me to see what's blocking the old process to die?

Is there a way to force all the threads in the old process do exit?

Is there any workaround to help with this problem?

gcbenjamin · 2023-06-07T01:24:09Z

And to make this abundantly clear, because it has been misinterpreted in #10117 a lot: This is not about deployments of applications. This bug is triggered by only recycling an app pool running an existing, unchanged, aspnetcore application.

The behaviour can be reproduced by just running Restart-WebAppPool MyAppPool in PowerShell while continuously sending request to the app running in "MyAppPool".

We see the exact same thing and easy to reproduce as explained by @cun-dp , just recycle while under load and requests will fail with 503's. This behaviour is not seen in any of our framework api's, only .net core. I've tried all different combinations of IIS/App pool settings and nothing has worked. Ran my .net core api continuously using K6 and always get hit with 503's when recycling under load (this api is a port from .net framework which never has this issue running same load tests).

While this doesn't fix the problem, it has helped, in the app pools advanced settings, setting Disable Overlapped Recycle to TRUE I get around 90% less 503's (in my tests from 60 to 7).

luizfbicalho · 2023-06-08T12:03:19Z

And to make this abundantly clear, because it has been misinterpreted in #10117 a lot: This is not about deployments of applications. This bug is triggered by only recycling an app pool running an existing, unchanged, aspnetcore application.
The behaviour can be reproduced by just running Restart-WebAppPool MyAppPool in PowerShell while continuously sending request to the app running in "MyAppPool".

We see the exact same thing and easy to reproduce as explained by @cun-dp , just recycle while under load and requests will fail with 503's. This behaviour is not seen in any of our framework api's, only .net core. I've tried all different combinations of IIS/App pool settings and nothing has worked. Ran my .net core api continuously using K6 and always get hit with 503's when recycling under load (this api is a port from .net framework which never has this issue running same load tests).

While this doesn't fix the problem, it has helped, in the app pools advanced settings, setting Disable Overlapped Recycle to TRUE I get around 90% less 503's (in my tests from 60 to 7).

Nice, I'll try that solution, is there a way to find what is locked in the old process? if is it a file lock, or another resource?

estebanorellana · 2023-06-08T15:52:17Z

In our work, the exact same thing happens to us when we install the .NET 6 application on a server with IIS, after the first 503 error it does not come out without an iisreset.

but if we raise this site with kestrel we have no problem.

alex-jitbit · 2023-06-17T16:47:49Z

Looking at the ANCM commit history I see it had 2 commits in 1 year I'm thinking MS priorities are elsewhere

(or the C++ guru who wrote it has left the company and now everyone's just afraid to touch it)

luizfbicalho · 2023-06-17T18:24:05Z

I saw that there is a @BrennanConroy that is commiting code, It would be great if we could get a better error message that is preventing the shutdown of the aspnet core app.

mtanksl · 2024-01-24T01:41:01Z

@BrennanConroy Perfect, got it. After testing the dll for a few hours in production, no 503 error appeared. It seems to be working perfectly. Thank you.

paddlepaw · 2024-02-02T13:37:09Z

The experimental dll works for us only if we extend the shutdownDelay. Before that our Umbraco CMS(ver 9) site would still return 503s on recycle so the dll seems to be working. Thanks.

Now it only goes down when we deploy the site and I don't have a fix for that yet because it's not happening every time. I tried installing Application Initialization and turning on preload but that seems to make it worse in that I have to recycle or shutdown/restart the app pool more times to bring it back online. We are making use of app_offline.htm on deploy and I can see a recycle event occur in the event log when it detects app_offline.htm. It'll show the app shutdown and restart successfully most of the time but sometimes I get an error that it failed to shutdown gracefully. The only recourse is to shut down the app pool and then restart it and normally when this error occurs I have to shutdown/restart multiple times. I'll keep working on it but just wanted to put my experience out there.

randyshoopman · 2024-02-02T21:13:45Z

@BrennanConroy It's great to hear this really old issue is finally seeing a resolution. Since work is being done on this is there anyway to improve the error:

Failed to gracefully shutdown application 'MACHINE/WEBROOT/APPHOST/XXXX'.
Process Id: 9112.
File Version: 17.0.23296.14. Description: IIS ASP.NET Core Module V2 Request Handler. Commit: 0a715692d8e2536c899faa0bb4f0cec2c2e33e36

Like adding something to indicate what request or ApplicationStopped callback is holding up the shutdown? It would really go a long way to help address problematic code. Request Route, stack trace, anything to know what code is blocking the shutdown.

Thanks!

Edit: Credit to @luizfbicalho for making the same request earlier in this thread

divil5000 · 2024-02-03T14:12:25Z

@randyshoopman what a brilliant idea. We've always suffered with that error, and despite wasting hours trying to figure out what's happening, have not been successful in finding the source. We just grudgingly accept it.

paddlepaw · 2024-02-05T15:03:59Z

An update to my last post. So I thought the DLL fix was working for us but it turns out that manual recycles no longer cause crashes but automatic recycles do. I left automatic recycling on with the default values for the last 3 days. Checking the IIS logs, the site went down 3 days ago until this morning when the normal daily 4 a.m. recycle brought it back online. We have another vanilla .net core site that is not crashing at all so this leads me to believe it's either something that we have implemented ourselves on this particular site or Umbraco itself.

Phlow2001 · 2024-03-16T19:51:22Z

Still waiting and still affecting production. What's the status of this? What do we have to do to get some priority and resources on it? Is it on the team's radar at all? Is there a plan to gain confidence in the approach in the modified v2 module and then to get it published officially in a hosting bundle update?

adityamandaleeka · 2024-03-18T15:27:38Z

Yes, as mentioned above, it's not only on our radar, but we intend to release this fix in .NET 9.

We're also trying to figure out a way to include the fix in a .NET 8 servicing release.

cun-dp · 2024-03-18T15:40:03Z

Yes, as mentioned above, it's not only on our radar, but we intend to release this fix in .NET 9.

We're also trying to figure out a way to include the fix in a .NET 8 servicing release.

Thanks, that sounds promising. We would really appreciate a fix for .NET8, as we are only able to run LTS releases.

MV10 · 2024-03-18T16:10:19Z

We're in the same situation, only LTS releases are permitted. What would happen if it is only released for .NET 9 and later, but a .NET 8 application was running behind the new version? If it isn't available for .NET 8, this is actually pretty likely to happen for us once .NET 10 is approved for internal use. I suppose the alternate question is, can .NET 9 and later run under the old (now current) ANCM until everything is updated to a version supported by the one with recycle improvements?

(Hopefully it'll make the cut and my questions will be completely irrelevant!)

adityamandaleeka · 2024-03-18T16:51:41Z

Good question @MV10. The particular component that the fix is in is actually shared across installs, so our approach to servicing it is to ensure that it is backwards compatible with older supported versions.

This comes with ups and downs obviously... on the bright side, fixes that we make can apply down-level easily. On the other hand, we have to be super careful to ensure that we don't break existing apps. We don't want anyone's existing app to break without them doing anything just because they installed a newer hosting bundle on their server.

This is why we're being careful and ensuring we have enough bake time and testing (and considering things like opt-in switches for down-level) before getting this out there. We have heard loud and clear that people want this fix and we can't wait to solve this problem for everyone.

bluntspoon · 2024-04-17T21:32:06Z

Thanks for starting the PR for this issue @BrennanConroy. Keen to know which versions of DotNet will get this fix once merged. Seeing updates to dotnet6+ would be awesome.

BrennanConroy · 2024-04-22T23:12:43Z

This will be opt-out in 9.0 and we're backporting as opt-in for 8.0. Since the change is in the IIS module which is installed globally on the machine and the module is compatible with all supported versions of ASP.NET Core you can use the 8.0 or 9.0 module with 6.0 and still get the fix.

When installing the hosting bundle (which is what installs the IIS module) you can choose to ignore the runtime that comes with the installer if you want to continue using 6.0 but get the newer module. See https://learn.microsoft.com/aspnet/core/host-and-deploy/iis/hosting-bundle?view=aspnetcore-8.0#options for details.

RomBrz · 2024-04-23T19:03:49Z

Great to see this problem being solved! As the Hosting Bundle packages, some time ago i tried that "choice to ignore the runtime" but for some reason "Install Hosting Bundle with runtimes and after that uninstall both runtimes" gave me a different setup than "Install Hosting Bundle only without runtime". Even the size of the Hosting Bundle was different. Seems to persist this "bug":
(After Install with OPT_NO_RUNTIME=1)

(After Full Install)

On our servers all the DotNet core apps only works with the 145MB Server Hosting (so i have to do a full install and after that remove the runtimes).

Considering that you are working with ANCM, some time ago i posted about this problem involving IIS, DotNet Core and Windows Update (#41377). The problem persists as for some reason i need to put IIS in manual mode, reboot server, than do the Windows Update full install (Windows security patches, Dotnet Framework security patches and DotNet Core patches).

MV10 · 2024-04-30T10:00:53Z

A little bit off-topic, but until we have this fix deployed, our monitoring is being configured to ignore Failed to gracefully shut down application XYZ error messages (which correlate to the recycle 503s, of which there are many since corporate policy still requires early-AM staggered-schedule recycles of all pools).

Is there a reference somewhere of the various warnings and errors ANCM can emit?

simon-biber · 2024-05-15T03:13:06Z

Did this fix get into .NET 8.0.5 hosting bundle?

MV10 · 2024-05-15T10:14:34Z

The way I read it, it'll be released with .NET 9 in November, and backported to .NET 8.x.x at that time...

cun-dp · 2024-05-15T12:53:26Z

According to milestone tags the fix ist slated for the next release 8.0.6: https://github.com/dotnet/aspnetcore/issues?q=milestone%3A8.0.6%20is%3Aclosed%20label%3Aservicing-approved

🤞

adityamandaleeka · 2024-05-15T17:13:56Z

@cun-dp is right, it'll ship in the 8.0.6 release in June.

BrennanConroy · 2024-05-21T20:32:07Z

This fix is out in 9.0-preview4 now. And will be available in 8.0.7 in July-2024.

I'm going to close and lock this issue so that this comment, which will include detailed instructions below, is the last one. For those of you who have been posting semi-unrelated issues, please file separate issues for them.

Instructions

When installing the hosting bundle (which is what installs the global IIS module) you can choose to ignore the runtime that comes with the installer if you want to continue using an older runtime but get the newer modules improvements. See https://learn.microsoft.com/aspnet/core/host-and-deploy/iis/hosting-bundle?view=aspnetcore-8.0#options for details. Otherwise, install the hosting bundle as normal to get the IIS module installed.

There are 2 ways to modify the behavior of this change.

Modify the shutdownDelay option in your web.config handlerSettings.

<aspNetCore processPath="dotnet" arguments="myapp.dll" stdoutLogEnabled="false" stdoutLogFile=".logsstdout">
  <handlerSettings>
    <!--
    Milliseconds to delay shutdown of the old app app instance while the new instance starts.
    Note: This doesn't delay the handling of incoming requests.
    -->
    <handlerSetting name="shutdownDelay" value="5000" />
  </handlerSettings>
</aspNetCore>

Set the ANCM_shutdownDelay environment variable, also in milliseconds.

If you set the value to 0, then it goes back to the old behavior.

For 9.0 >=, the fix is on by default and can be disabled by setting the config to 0.

For <= 8.0, the fix is off by default and can be enabled by setting the config to a value > 0.

The default is 1000 milliseconds (1 second). For busy/slow machines you may want to increase this value.

TanayParikh added the area-runtime label Apr 24, 2022

adityamandaleeka added the investigate label Apr 25, 2022

adityamandaleeka added this to the Backlog milestone Apr 25, 2022

AlexChongMicrosoft mentioned this issue Jul 12, 2023

IIS AspNetCore Module V2 returns 10s of 503 error and "Failed to gracefully shutdown applicaiton" warning when recycling a Hello world demo #49350

Closed

1 task

JasonElkin mentioned this issue Feb 20, 2024

Add custom Examine FileSystemDirectoryFactory using Umbraco SiteName umbraco/Umbraco-CMS#15571

Merged

adityamandaleeka modified the milestones: Backlog, .NET 9 Planning Mar 18, 2024

adityamandaleeka assigned BrennanConroy Mar 18, 2024

adityamandaleeka removed the investigate label Mar 18, 2024

BrennanConroy mentioned this issue Apr 12, 2024

Change how ANCM recycles app #52807

Merged

BrennanConroy mentioned this issue Apr 22, 2024

[release/8.0] Change how ANCM recycles app #55288

Merged

10 tasks

BrennanConroy modified the milestones: .NET 9 Planning, 9.0-preview4 May 21, 2024

BrennanConroy closed this as completed May 21, 2024

dotnet locked as resolved and limited conversation to collaborators May 21, 2024

reopen issue #10117 - IIS app pool recycle throws 503 errors #41340

reopen issue #10117 - IIS app pool recycle throws 503 errors #41340

Comments

alex-jitbit commented Apr 23, 2022 • edited Loading

Is there an existing issue for this?

Describe the bug

Expected Behavior

Steps To Reproduce

Exceptions (if any)

.NET Version

Anything else?

Tratcher commented Apr 25, 2022

alex-jitbit commented Apr 25, 2022

Tratcher commented Apr 25, 2022

ghost commented Apr 25, 2022

benjamin-stern commented Apr 26, 2022

c0shea commented Apr 26, 2022

RomBrz commented Apr 26, 2022

c0shea commented Apr 26, 2022

alex-jitbit commented Jun 18, 2022

luizfbicalho commented Jul 5, 2022 • edited Loading

peter-bertok commented Sep 20, 2022 • edited Loading

luizfbicalho commented Sep 20, 2022

alex-jitbit commented Nov 1, 2022

AndreasJilvero commented Nov 5, 2022

luizfbicalho commented Nov 5, 2022

kadamgreene commented Jan 6, 2023

cun-dp commented Apr 24, 2023

cun-dp commented May 31, 2023

Tratcher commented May 31, 2023

cun-dp commented May 31, 2023 • edited Loading

luizfbicalho commented May 31, 2023

gcbenjamin commented Jun 7, 2023 • edited Loading

luizfbicalho commented Jun 8, 2023

estebanorellana commented Jun 8, 2023

alex-jitbit commented Jun 17, 2023

luizfbicalho commented Jun 17, 2023

mtanksl commented Jan 24, 2024

paddlepaw commented Feb 2, 2024

randyshoopman commented Feb 2, 2024 • edited Loading

divil5000 commented Feb 3, 2024

paddlepaw commented Feb 5, 2024

Phlow2001 commented Mar 16, 2024

adityamandaleeka commented Mar 18, 2024

cun-dp commented Mar 18, 2024

MV10 commented Mar 18, 2024

adityamandaleeka commented Mar 18, 2024

bluntspoon commented Apr 17, 2024

BrennanConroy commented Apr 22, 2024

RomBrz commented Apr 23, 2024

MV10 commented Apr 30, 2024

simon-biber commented May 15, 2024

MV10 commented May 15, 2024

cun-dp commented May 15, 2024

adityamandaleeka commented May 15, 2024

BrennanConroy commented May 21, 2024 • edited Loading

Instructions

alex-jitbit commented Apr 23, 2022 •

edited

Loading

luizfbicalho commented Jul 5, 2022 •

edited

Loading

peter-bertok commented Sep 20, 2022 •

edited

Loading

cun-dp commented May 31, 2023 •

edited

Loading

gcbenjamin commented Jun 7, 2023 •

edited

Loading

randyshoopman commented Feb 2, 2024 •

edited

Loading

BrennanConroy commented May 21, 2024 •

edited

Loading