Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reopen issue #10117 - IIS app pool recycle throws 503 errors #41340

Closed
1 task done
alex-jitbit opened this issue Apr 23, 2022 · 84 comments
Closed
1 task done

reopen issue #10117 - IIS app pool recycle throws 503 errors #41340

alex-jitbit opened this issue Apr 23, 2022 · 84 comments
Assignees
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions feature-iis Includes: IIS, ANCM
Milestone

Comments

@alex-jitbit
Copy link
Contributor

alex-jitbit commented Apr 23, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

IIS app pool throws 503 errors during recycles. This is a known issue with ANCM module that has been reported previously in #10117 - which has 33 likes and a 3-year discussion, it was never fixed, but it was automatically closed by a bot "as a clean-up due to lack of discussion".

P.S. This is not a deployment problem. There are many scenarios when IIS app pool is being recycled outside of our control (adding/removing SSL certificates, changing IP addresses to listen to, etc... basically, touching any IIS setting causes a recycle - and 503 errors are unacceptable for high-availability scenarios).

.NET Framework was free of this bug.

Expected Behavior

No errors during recycles.

Steps To Reproduce

see the issue linked #10117

Exceptions (if any)

No response

.NET Version

5.0, 6.0 7.0 8.0

Anything else?

No response

@Tratcher
Copy link
Member

#10117 (comment)

Using separate app pools and a load balancer is our recommended approach for high-availability as it allows you full flexibility over deployment process and the ability to easily revert versions.

Trying to achieve high-availability with a single instance is not recommended.

@alex-jitbit
Copy link
Contributor Author

@Tratcher like I indicated above, this is not about deployments. There are a lot of scenarios when IIS recycles the pool (see above)

@Tratcher
Copy link
Member

Deployments are just one example that disrupt availability. A single instance is not advised for high-availability for many reasons.

@ghost
Copy link

ghost commented Apr 25, 2022

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

@benjamin-stern
Copy link

@Tratcher Even having more than a single instance this would still strongly affect a service, as all the requests going to the server that's recycling would be returned the 503 error.

@c0shea
Copy link

c0shea commented Apr 26, 2022

We have two instances running behind a load balancer and have still experienced this issue intermittently. When the app pool inevitably recycles (due to deployment, config change, etc), it starts returning 503 instead of queuing up the requests.

The load balancer doesn't immediately treat the 503s as the server being down and take it out of the rotation. Instead, it uses a polling mechanism that calls an endpoint (i.e. /status) on each instance and checks for a successful response. While that status endpoint is monitored frequently enough, there is obviously plenty of time where a bunch of requests will fail with 503 while the recycle is happening. We can't have the load balancer take the instance out of the rotation if it sees 503s being returned because (1) it's not an available option in NetScaler and (2) if both happen to recycle at the same time, both servers would be taken out of the rotation and the service would be completely down until manual intervention told the load balancer that the requests aren't failing anymore.

@RomBrz
Copy link

RomBrz commented Apr 26, 2022

In this scenario (using two or more instances behind a load balancer), since you're using a "/status" to check availability, i suggest that you make some routines during the deployment or maitenence to, before start doing anything, force the "/status" to throw an "unhealthy" status, so the load balance could remove the node from the balancer and then make the changes.

About the issue itself, the IIS default behavior on a recycle is to first start a new application pool, route the new requests to the new application pool, wait the default set time to the current requests ends, an than close/finish the current application pool, keeping only the new as the application pool.

Recycling an application pool, could be an ASPNET Core "expected behavior", but looking at IIS, throwing 503 during one recycle isn't a "normal behavior".

@c0shea
Copy link

c0shea commented Apr 26, 2022

The problem is that ASP.NET Core doesn't use the overlapping recycle behavior that .NET Framework did. While the old worker process is being shutdown (especially if there were a lot of inflight requests being handled by it), the new process isn't yet started and those requests in the middle get the 503.

@alex-jitbit
Copy link
Contributor Author

TIL that StackOverflow also runs ASP.NET Core under IIS

image

@luizfbicalho
Copy link

luizfbicalho commented Jul 5, 2022

Is there any way to minimize this problem, or ate least to detect what is causing it?
Any configuration in the application pool to minimize it?

@peter-bertok
Copy link

peter-bertok commented Sep 20, 2022

Deployments are just one example that disrupt availability. A single instance is not advised for high-availability for many reasons.

High availability and throwing 503s from otherwise "perfectly fine servers" are separate concerns.

Most load balancers do not hide HTTP errors! If the IIS process responds to a HTTP request with 503, then that's what the user will see. In particular, none of the Azure load balancer offerings hide errors from the users. They pass them on faithfully.

If a previously working server throws 503 errors then it will take significant time for the load balancers to detect this. Minutes even, or 10+ minutes if using CDN-type solutions such as Azure Front Door.

This behavior is triggered by many actions, not all of which are resolved via hosting on multiple server instances. Scheduled recycling, for example, has been mentioned by many people as a common trigger.

Similarly, many people have pointed out in the previous thread that it's not just 503 errors that are seen, but slow uploads are also unceremoniously terminated.

@luizfbicalho
Copy link

My problem seems to be on IIS recycle when takes too long to recycle, it's not related to cpu neither to memory, I asked the IT infrastructure to add more performance counters to grafana but they didn't add yet,

I'm inclined to think that the problem is with Network connections, what do you think that I should monitor in grafana?

@alex-jitbit
Copy link
Contributor Author

TIL that fuget.org also uses IIS + ASP.NET Core, just caught them during a recycle.
image

How many more examples do we need?

@AndreasJilvero
Copy link

In my experience, this happens also when simply setting the physical path of a website. The response is somewhat difference though - a recycle renders the text "The service is unavailable" and setting the physical path just gives an empty 503 result.

https://stackoverflow.com/questions/74326315/iis-setting-physical-path-gives-503-status-for-a-few-seconds

@luizfbicalho
Copy link

In my case it's not for seconds, after the first 503 only iisreset solves the problem

@kadamgreene
Copy link

We are also experiencing this issue. Is there any expectation of a fix? This is going to keep us from moving forward with .NET migration / any new work in .NET 6+. It's not the deployments, blue/green can handle that, it's the "unexpected" recycles in the run of a day that are the issue.

@cun-dp
Copy link

cun-dp commented Apr 24, 2023

We are also seeing this with our nightly apppool recycles in the production environment, and with all our aspnetcore microservices.

Example of IIS Logs (anonymized):
2023-04-23 22:47:03 W3SVC5 HOSTNAME1 [IP censored] GET /app/health 443 - HTTP/1.1 - - my.fqdn.example.com 503 0 1255 202 130 15

HTTP 503 with win32 substatus 1255. According to https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--1000-1299- code 1255 matches: 1255 (0x4E7) ERROR_SERVER_SHUTDOWN_IN_PROGRESS: The server machine is shutting down.

This is a critical issue for us, since we are running a 24/7 plattform.

@cun-dp
Copy link

cun-dp commented May 31, 2023

I did a few tests with an almost empty aspnetcore 6 application and with some of our production applications on IIS10.
Conclusion: the problem always occurs, not matter how simple the application is.
Sync or Async controller actions as well as InProcess or OutOfProcess hosting make no difference in my tests.

Of all my tests the problem seemed to be exacerbated the most if the application is doing things during the Application Stopped event:

app.Lifetime.ApplicationStopped.Register(() => { Thread.Sleep(TimeSpan.FromMilliseconds(1000)); });

Using this snippet in my almost empty aspnetcore 6 application leads to 20-50 times the amount of HTTP503 responses during apppool recycle compared to using no Application Stopped event handler.

@Tratcher
Copy link
Member

@cun-dp that makes sense, the application has stopped serving traffic and can't re-start until the current process exits.

@cun-dp
Copy link

cun-dp commented May 31, 2023

@Tratcher It makes sense only in the way that my test further confirms the bug with aspnetcore app pool recycling:
I can see that a second W3SVC instance gets spawned the moment the app pool is asked to recycle, so the application is restarting. But routing of new requests to the new application instance simply does not work. Instead, the requests get routed to the old application that is shutting down and therefor is rejecting requests.

This supports the fact that IIS (or the aspnet core v2 module?) does not handle app pool recycles correctly by overlapping both application instances and routing the new requests to the new application instance the moment the application pool is asked to recycle, like IIS does with aspdotnet framework 4.x (and previous) applications.

And to make this abundantly clear, because it has been misinterpreted in #10117 a lot:
This is not about deployments of applications. This bug is triggered by only recycling an app pool running an existing, unchanged, aspnetcore application.

The behaviour can be reproduced by just running Restart-WebAppPool MyAppPool in PowerShell while continuously sending request to the app running in "MyAppPool".

@luizfbicalho
Copy link

@cun-dp that makes sense, the application has stopped serving traffic and can't re-start until the current process exits.

but the correct approach wasn't to start a new process, redirect the new connections to that process and let the old process die in peace as long as it takes?

Is there a way for me to see what's blocking the old process to die?

Is there a way to force all the threads in the old process do exit?

Is there any workaround to help with this problem?

@gcbenjamin
Copy link

gcbenjamin commented Jun 7, 2023

And to make this abundantly clear, because it has been misinterpreted in #10117 a lot: This is not about deployments of applications. This bug is triggered by only recycling an app pool running an existing, unchanged, aspnetcore application.

The behaviour can be reproduced by just running Restart-WebAppPool MyAppPool in PowerShell while continuously sending request to the app running in "MyAppPool".

We see the exact same thing and easy to reproduce as explained by @cun-dp , just recycle while under load and requests will fail with 503's. This behaviour is not seen in any of our framework api's, only .net core. I've tried all different combinations of IIS/App pool settings and nothing has worked. Ran my .net core api continuously using K6 and always get hit with 503's when recycling under load (this api is a port from .net framework which never has this issue running same load tests).

While this doesn't fix the problem, it has helped, in the app pools advanced settings, setting Disable Overlapped Recycle to TRUE I get around 90% less 503's (in my tests from 60 to 7).

@luizfbicalho
Copy link

And to make this abundantly clear, because it has been misinterpreted in #10117 a lot: This is not about deployments of applications. This bug is triggered by only recycling an app pool running an existing, unchanged, aspnetcore application.
The behaviour can be reproduced by just running Restart-WebAppPool MyAppPool in PowerShell while continuously sending request to the app running in "MyAppPool".

We see the exact same thing and easy to reproduce as explained by @cun-dp , just recycle while under load and requests will fail with 503's. This behaviour is not seen in any of our framework api's, only .net core. I've tried all different combinations of IIS/App pool settings and nothing has worked. Ran my .net core api continuously using K6 and always get hit with 503's when recycling under load (this api is a port from .net framework which never has this issue running same load tests).

While this doesn't fix the problem, it has helped, in the app pools advanced settings, setting Disable Overlapped Recycle to TRUE I get around 90% less 503's (in my tests from 60 to 7).

Nice, I'll try that solution, is there a way to find what is locked in the old process? if is it a file lock, or another resource?

@estebanorellana
Copy link

In our work, the exact same thing happens to us when we install the .NET 6 application on a server with IIS, after the first 503 error it does not come out without an iisreset.

but if we raise this site with kestrel we have no problem.

@alex-jitbit
Copy link
Contributor Author

Looking at the ANCM commit history I see it had 2 commits in 1 year I'm thinking MS priorities are elsewhere

(or the C++ guru who wrote it has left the company and now everyone's just afraid to touch it)

@luizfbicalho
Copy link

I saw that there is a @BrennanConroy that is commiting code, It would be great if we could get a better error message that is preventing the shutdown of the aspnet core app.

@mtanksl
Copy link

mtanksl commented Jan 24, 2024

@BrennanConroy Perfect, got it. After testing the dll for a few hours in production, no 503 error appeared. It seems to be working perfectly. Thank you.

@paddlepaw
Copy link

The experimental dll works for us only if we extend the shutdownDelay. Before that our Umbraco CMS(ver 9) site would still return 503s on recycle so the dll seems to be working. Thanks.

Now it only goes down when we deploy the site and I don't have a fix for that yet because it's not happening every time. I tried installing Application Initialization and turning on preload but that seems to make it worse in that I have to recycle or shutdown/restart the app pool more times to bring it back online. We are making use of app_offline.htm on deploy and I can see a recycle event occur in the event log when it detects app_offline.htm. It'll show the app shutdown and restart successfully most of the time but sometimes I get an error that it failed to shutdown gracefully. The only recourse is to shut down the app pool and then restart it and normally when this error occurs I have to shutdown/restart multiple times. I'll keep working on it but just wanted to put my experience out there.

@randyshoopman
Copy link

randyshoopman commented Feb 2, 2024

@BrennanConroy It's great to hear this really old issue is finally seeing a resolution. Since work is being done on this is there anyway to improve the error:

Failed to gracefully shutdown application 'MACHINE/WEBROOT/APPHOST/XXXX'.
Process Id: 9112.
File Version: 17.0.23296.14. Description: IIS ASP.NET Core Module V2 Request Handler. Commit: 0a715692d8e2536c899faa0bb4f0cec2c2e33e36

Like adding something to indicate what request or ApplicationStopped callback is holding up the shutdown? It would really go a long way to help address problematic code. Request Route, stack trace, anything to know what code is blocking the shutdown.

Thanks!

Edit: Credit to @luizfbicalho for making the same request earlier in this thread

@divil5000
Copy link

@randyshoopman what a brilliant idea. We've always suffered with that error, and despite wasting hours trying to figure out what's happening, have not been successful in finding the source. We just grudgingly accept it.

@paddlepaw
Copy link

An update to my last post. So I thought the DLL fix was working for us but it turns out that manual recycles no longer cause crashes but automatic recycles do. I left automatic recycling on with the default values for the last 3 days. Checking the IIS logs, the site went down 3 days ago until this morning when the normal daily 4 a.m. recycle brought it back online. We have another vanilla .net core site that is not crashing at all so this leads me to believe it's either something that we have implemented ourselves on this particular site or Umbraco itself.

@Phlow2001
Copy link

Still waiting and still affecting production. What's the status of this? What do we have to do to get some priority and resources on it? Is it on the team's radar at all? Is there a plan to gain confidence in the approach in the modified v2 module and then to get it published officially in a hosting bundle update?

@adityamandaleeka
Copy link
Member

Yes, as mentioned above, it's not only on our radar, but we intend to release this fix in .NET 9.

We're also trying to figure out a way to include the fix in a .NET 8 servicing release.

@cun-dp
Copy link

cun-dp commented Mar 18, 2024

Yes, as mentioned above, it's not only on our radar, but we intend to release this fix in .NET 9.

We're also trying to figure out a way to include the fix in a .NET 8 servicing release.

Thanks, that sounds promising. We would really appreciate a fix for .NET8, as we are only able to run LTS releases.

@MV10
Copy link

MV10 commented Mar 18, 2024

We're in the same situation, only LTS releases are permitted. What would happen if it is only released for .NET 9 and later, but a .NET 8 application was running behind the new version? If it isn't available for .NET 8, this is actually pretty likely to happen for us once .NET 10 is approved for internal use. I suppose the alternate question is, can .NET 9 and later run under the old (now current) ANCM until everything is updated to a version supported by the one with recycle improvements?

(Hopefully it'll make the cut and my questions will be completely irrelevant!)

@adityamandaleeka
Copy link
Member

Good question @MV10. The particular component that the fix is in is actually shared across installs, so our approach to servicing it is to ensure that it is backwards compatible with older supported versions.

This comes with ups and downs obviously... on the bright side, fixes that we make can apply down-level easily. On the other hand, we have to be super careful to ensure that we don't break existing apps. We don't want anyone's existing app to break without them doing anything just because they installed a newer hosting bundle on their server.

This is why we're being careful and ensuring we have enough bake time and testing (and considering things like opt-in switches for down-level) before getting this out there. We have heard loud and clear that people want this fix and we can't wait to solve this problem for everyone.

@bluntspoon
Copy link

Thanks for starting the PR for this issue @BrennanConroy. Keen to know which versions of DotNet will get this fix once merged. Seeing updates to dotnet6+ would be awesome.

@BrennanConroy
Copy link
Member

This will be opt-out in 9.0 and we're backporting as opt-in for 8.0. Since the change is in the IIS module which is installed globally on the machine and the module is compatible with all supported versions of ASP.NET Core you can use the 8.0 or 9.0 module with 6.0 and still get the fix.

When installing the hosting bundle (which is what installs the IIS module) you can choose to ignore the runtime that comes with the installer if you want to continue using 6.0 but get the newer module. See https://learn.microsoft.com/aspnet/core/host-and-deploy/iis/hosting-bundle?view=aspnetcore-8.0#options for details.

@RomBrz
Copy link

RomBrz commented Apr 23, 2024

Great to see this problem being solved! As the Hosting Bundle packages, some time ago i tried that "choice to ignore the runtime" but for some reason "Install Hosting Bundle with runtimes and after that uninstall both runtimes" gave me a different setup than "Install Hosting Bundle only without runtime". Even the size of the Hosting Bundle was different. Seems to persist this "bug":
(After Install with OPT_NO_RUNTIME=1)
image

(After Full Install)
image

On our servers all the DotNet core apps only works with the 145MB Server Hosting (so i have to do a full install and after that remove the runtimes).

Considering that you are working with ANCM, some time ago i posted about this problem involving IIS, DotNet Core and Windows Update (#41377). The problem persists as for some reason i need to put IIS in manual mode, reboot server, than do the Windows Update full install (Windows security patches, Dotnet Framework security patches and DotNet Core patches).

@MV10
Copy link

MV10 commented Apr 30, 2024

A little bit off-topic, but until we have this fix deployed, our monitoring is being configured to ignore Failed to gracefully shut down application XYZ error messages (which correlate to the recycle 503s, of which there are many since corporate policy still requires early-AM staggered-schedule recycles of all pools).

Is there a reference somewhere of the various warnings and errors ANCM can emit?

@simon-biber
Copy link

Did this fix get into .NET 8.0.5 hosting bundle?

@MV10
Copy link

MV10 commented May 15, 2024

The way I read it, it'll be released with .NET 9 in November, and backported to .NET 8.x.x at that time...

@cun-dp
Copy link

cun-dp commented May 15, 2024

According to milestone tags the fix ist slated for the next release 8.0.6: https://github.com/dotnet/aspnetcore/issues?q=milestone%3A8.0.6%20is%3Aclosed%20label%3Aservicing-approved

🤞

@adityamandaleeka
Copy link
Member

@cun-dp is right, it'll ship in the 8.0.6 release in June.

@BrennanConroy
Copy link
Member

BrennanConroy commented May 21, 2024

This fix is out in 9.0-preview4 now. And will be available in 8.0.7 in July-2024.

I'm going to close and lock this issue so that this comment, which will include detailed instructions below, is the last one. For those of you who have been posting semi-unrelated issues, please file separate issues for them.

Instructions

When installing the hosting bundle (which is what installs the global IIS module) you can choose to ignore the runtime that comes with the installer if you want to continue using an older runtime but get the newer modules improvements. See https://learn.microsoft.com/aspnet/core/host-and-deploy/iis/hosting-bundle?view=aspnetcore-8.0#options for details. Otherwise, install the hosting bundle as normal to get the IIS module installed.

There are 2 ways to modify the behavior of this change.

  1. Modify the shutdownDelay option in your web.config handlerSettings.
<aspNetCore processPath="dotnet" arguments="myapp.dll" stdoutLogEnabled="false" stdoutLogFile=".logsstdout">
  <handlerSettings>
    <!--
    Milliseconds to delay shutdown of the old app app instance while the new instance starts.
    Note: This doesn't delay the handling of incoming requests.
    -->
    <handlerSetting name="shutdownDelay" value="5000" />
  </handlerSettings>
</aspNetCore>
  1. Set the ANCM_shutdownDelay environment variable, also in milliseconds.

If you set the value to 0, then it goes back to the old behavior.

For 9.0 >=, the fix is on by default and can be disabled by setting the config to 0.

For <= 8.0, the fix is off by default and can be enabled by setting the config to a value > 0.

The default is 1000 milliseconds (1 second). For busy/slow machines you may want to increase this value.

@dotnet dotnet locked as resolved and limited conversation to collaborators May 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions feature-iis Includes: IIS, ANCM
Projects
None yet
Development

No branches or pull requests