Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate very small percentage of requests returning 503 server errors #207

Closed
GUI opened this issue Mar 23, 2015 · 1 comment
Closed

Comments

@GUI
Copy link
Member

GUI commented Mar 23, 2015

While looking at some Regulations.gov API error logs, I noticed that we sometimes return a few 503 errors for the Regulations APIs per day. This amounts to a very small percentage of overall traffic (around 0.0001% of requests), but there a few odd things about these 503 errors that would be good to get to the bottom of.

In checking other 503 errors seen across our APIs, all the others look like 503 errors returned by the underlying API service, so those seem like "normal" errors (at least from api.data.gov's perspective). However, what's unique about these handful of 503 errors seen on Regulations is that these requests never make it to the API backend server. They make it as far in our API Umbrella stack as the Varnish caching server, but then they never actually make it to the internal nginx "api_backend_router" server that should then be responsible for forwarding the request to the underlying API backend. So it would appear that Varnish is dying while trying to connect to our internal nginx server.

That's as far as I've currently gotten in debugging, but we'll need to dive into this a bit more to get to the bottom of things. I can't find anything else obvious in the error logs, so we may need to enable some varnish debugging logs to try to figure out what's going on. We may also want to look at overall traffic patterns to see if there's any rhyme or reason to when these errors are occurring (for example, does it happen when we're experiencing high load overall, or perhaps does it coincide with nginx being reloaded due to nightly log rotation or periodic DNS changes?)

GUI added a commit to NREL/api-umbrella-router that referenced this issue Apr 4, 2015
See: 18F/api.data.gov#207

I'm not totally convinced this fixes things, but it seems like it might
make it even more rare. Still need to explore more. Varnish is holding
open keepalive connections and then erroneously trying to reuse those
connections after nginx gets reloaded.

See:
http://comments.gmane.org/gmane.comp.web.varnish.misc/3621
http://comments.gmane.org/gmane.comp.web.varnish.misc/8566
http://comments.gmane.org/gmane.comp.web.varnish.dev/2352

See more comments in lib/config_reloader/worker.js's reloadVarnish
method.
GUI added a commit to NREL/api-umbrella-router that referenced this issue Apr 5, 2015
Disabling the server-side keepalive connection between Varnish and the
nginx backend held the key to preventing Varnish from retrying to reuse
bad connections and sometimes generating 503s (after nginx was
reloaded).

This reverts some earlier attempts from
60d694d

This solution seems to actually fix this problem, and holds up to 2
hours of our long running "grunt multiLongConnectionDrops" task.

See 18F/api.data.gov#207
@GUI
Copy link
Member Author

GUI commented Apr 6, 2015

While rare, I did discover it was a reproducible problem for a very small percentage of requests (a few requests per day), so I got to the bottom of this. This should be fixed by NREL/api-umbrella-router@2e9007c

The culprit was how the Varnish server in our backend interacted with the nginx api routing server when the nginx backend server was being reloaded. Since we reload nginx with some frequency due to DNS changes, this is the error cropped up. The issue is that Varnish holds open persistent, server-side keep-alive connections to the nginx backend server, but when nginx is reloaded, it doesn't always cope with this correctly. Varnish is supposed to retry these failed connections, but I don't think that always works in all situations. But I at least don't think we're the only one experiencing these periodic problems with how Varnish handles this:

http://comments.gmane.org/gmane.comp.web.varnish.misc/8566
http://comments.gmane.org/gmane.comp.web.varnish.misc/3621
And here's a potential patch to address the underlying issue in Varnish: http://comments.gmane.org/gmane.comp.web.varnish.dev/2352

While this seems like an issue with Varnish's connection handling (and that last patch might help), the way I've sidestepped the issue is to disable Varnish's server-side keep-alive to the backend server. Since this is a localhost connection, this shouldn't be a huge loss. Clients can still hold open keep-alive connections to our server, and our backend nginx server also holds open keep-alive connections to the API backend servers (and these are the two areas where keep-alive is more important to us).

In terms of testing, we actually had a test in place that caught this, but since it happened so rarely, it only was triggered a couple times. But that test should at least help to make sure this doesn't regress in some obvious way. I've also added a more specific testing task (grunt multiLongConnectionDrops) to help pinpoint this, but you need to run that task for 30-60 minutes to fully test it.

@GUI GUI closed this as completed Apr 6, 2015
@GUI GUI added this to the Sprint 18 (3/23-4/3) milestone Apr 6, 2015
@GUI GUI self-assigned this Apr 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant