Investigate very small percentage of requests returning 503 server errors #207

GUI · 2015-03-23T17:57:03Z

While looking at some Regulations.gov API error logs, I noticed that we sometimes return a few 503 errors for the Regulations APIs per day. This amounts to a very small percentage of overall traffic (around 0.0001% of requests), but there a few odd things about these 503 errors that would be good to get to the bottom of.

In checking other 503 errors seen across our APIs, all the others look like 503 errors returned by the underlying API service, so those seem like "normal" errors (at least from api.data.gov's perspective). However, what's unique about these handful of 503 errors seen on Regulations is that these requests never make it to the API backend server. They make it as far in our API Umbrella stack as the Varnish caching server, but then they never actually make it to the internal nginx "api_backend_router" server that should then be responsible for forwarding the request to the underlying API backend. So it would appear that Varnish is dying while trying to connect to our internal nginx server.

That's as far as I've currently gotten in debugging, but we'll need to dive into this a bit more to get to the bottom of things. I can't find anything else obvious in the error logs, so we may need to enable some varnish debugging logs to try to figure out what's going on. We may also want to look at overall traffic patterns to see if there's any rhyme or reason to when these errors are occurring (for example, does it happen when we're experiencing high load overall, or perhaps does it coincide with nginx being reloaded due to nightly log rotation or periodic DNS changes?)

See: 18F/api.data.gov#207 I'm not totally convinced this fixes things, but it seems like it might make it even more rare. Still need to explore more. Varnish is holding open keepalive connections and then erroneously trying to reuse those connections after nginx gets reloaded. See: http://comments.gmane.org/gmane.comp.web.varnish.misc/3621 http://comments.gmane.org/gmane.comp.web.varnish.misc/8566 http://comments.gmane.org/gmane.comp.web.varnish.dev/2352 See more comments in lib/config_reloader/worker.js's reloadVarnish method.

Disabling the server-side keepalive connection between Varnish and the nginx backend held the key to preventing Varnish from retrying to reuse bad connections and sometimes generating 503s (after nginx was reloaded). This reverts some earlier attempts from 60d694d This solution seems to actually fix this problem, and holds up to 2 hours of our long running "grunt multiLongConnectionDrops" task. See 18F/api.data.gov#207

GUI · 2015-04-06T01:36:40Z

While rare, I did discover it was a reproducible problem for a very small percentage of requests (a few requests per day), so I got to the bottom of this. This should be fixed by NREL/api-umbrella-router@2e9007c

The culprit was how the Varnish server in our backend interacted with the nginx api routing server when the nginx backend server was being reloaded. Since we reload nginx with some frequency due to DNS changes, this is the error cropped up. The issue is that Varnish holds open persistent, server-side keep-alive connections to the nginx backend server, but when nginx is reloaded, it doesn't always cope with this correctly. Varnish is supposed to retry these failed connections, but I don't think that always works in all situations. But I at least don't think we're the only one experiencing these periodic problems with how Varnish handles this:

http://comments.gmane.org/gmane.comp.web.varnish.misc/8566
http://comments.gmane.org/gmane.comp.web.varnish.misc/3621
And here's a potential patch to address the underlying issue in Varnish: http://comments.gmane.org/gmane.comp.web.varnish.dev/2352

While this seems like an issue with Varnish's connection handling (and that last patch might help), the way I've sidestepped the issue is to disable Varnish's server-side keep-alive to the backend server. Since this is a localhost connection, this shouldn't be a huge loss. Clients can still hold open keep-alive connections to our server, and our backend nginx server also holds open keep-alive connections to the API backend servers (and these are the two areas where keep-alive is more important to us).

In terms of testing, we actually had a test in place that caught this, but since it happened so rarely, it only was triggered a couple times. But that test should at least help to make sure this doesn't regress in some obvious way. I've also added a more specific testing task (grunt multiLongConnectionDrops) to help pinpoint this, but you need to run that task for 30-60 minutes to fully test it.

GUI closed this as completed Apr 6, 2015

GUI added bug api-umbrella-v0.8 labels Apr 6, 2015

GUI added this to the Sprint 18 (3/23-4/3) milestone Apr 6, 2015

GUI self-assigned this Apr 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate very small percentage of requests returning 503 server errors #207

Investigate very small percentage of requests returning 503 server errors #207

GUI commented Mar 23, 2015

GUI commented Apr 6, 2015

Investigate very small percentage of requests returning 503 server errors #207

Investigate very small percentage of requests returning 503 server errors #207

Comments

GUI commented Mar 23, 2015

GUI commented Apr 6, 2015