Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Routing changes and Traffic Server upgrades #410

Merged
merged 39 commits into from
May 29, 2018
Merged

Conversation

GUI
Copy link
Member

@GUI GUI commented May 29, 2018

We've had a few different branches that have been in development for a while. In an attempt to clean things up and start to merge some of the changes in more piece-meal, this pull request integrates some of those changes from the in-progress branches.

This upgrades our bundled version of Traffic Server (the HTTP caching layer) from 5.3 to 7.1. As part of this change, we've made a semi-significant change to how we route API requests internally (although, this internal change shouldn't really be noticeable to API Umbrella users or administrators).

Previously, API requests were routed like this:

[ incoming request ] => [ nginx ] => [ trafficserver ] => [ nginx ] => [API backend ]

With these changes, they are now routed like this:

[ incoming request ] => [ nginx ] => [ trafficserver ] => [ API backend ]

Basically, we've eliminated the second hop back through nginx to handle the API proxying for requests we're passing along. Instead, we proxy directly to the API backend with Traffic Server.

In addition to simplifying things by removing an extra hop, this change also allows us to simplify our codebase by removing some complicated pieces related to DNS resolution and managing nginx upstreams, where we can better leverage features built into Traffic Server as a replacement. In order to properly support API backends with keepalive connections and with dynamic DNS entries, we previously jumped through several hoops to hook this up inside nginx (it involved ngx_dyups, lua-resty-dns-cache, and custom Lua code to resolve DNS entries, and keep nginx upstreams in sync). Traffic Server has DNS resolution/caching built-in, and also handles keepalive connections for all upstreams automatically, so by leveraging Traffic Server for the upstream proxying, all that code can go away in favor of some much simpler Traffic Server config.

Summary of changes:

  • Shift existing Traffic Server configuration to Lua code, thanks to Traffic Server also supporting Lua now (which aligns nicely with our more extensive Lua usage within nginx/OpenResty).
  • Remove code related to nginx DNS resolution and upstream management. We now proxy directly to the API backends with Traffic Server. Our configuration and routing still mostly lives in the nginx layer, but we now simply pass along which API backend server to connect to via an HTTP header to Traffic Server, and Traffic Server takes care of everything from there (DNS resolution, caching, keepalives, etc).
  • Functionally, proxying to API backends should remain the same, the one piece of functionality this removes is having more complicated load balancing algorithms (eg, least connections or IP hash). Now load balancing to multiple servers within an API backend will always use round robin balancing. In practice, the ability to use API Umbrella as a load balancer to individual API backends seemed rare (since the load balancing was more frequently handled by the API backend or at other layers), so simplifying this to only use round-robin seemed acceptable.
  • Simplify how routing to the "web-app" works, so that it's treated as just another API backend, rather than requiring special routing. To achieve this, we've added a couple options so this continues to be routed like it was before (eg, disabling analytics, since we don't necessarily want to gather analytics hits for the admin CSS/JS files), but by leveraging the existing API backend configuration, this simplifies our routing setup and reduces the need for code to specially handle our web-app routing. This also has nice benefits, like being able to leverage our rate limit functionality on the admin login endpoints to potentially prevent abuse.

GUI added 30 commits May 16, 2018 23:19
Rather than hopping back to nginx. Also migrate trafficserver rewrite
logic to Lua. Also experimenting/debugging trafficserver v7 collapsed
connections.
While the behavior might be a bit different from what we had in
TrafficServer v5 and the collapsed_connection plugin, that plugin is
deprecated (and seems to cause errors if used), and the options around
all this have changed a bit. These new tests also cover the connection
collapsing behavior in more detail, testing various situations we
didn't previously test.
Since Traffic Server has DNS resolving built in, using Traffic Server as
the proxy layer connecting to API backends in some ways becomes simpler
(since we can rely on all it's built-in functionality for DNS resolving,
caching, etc).
Since we're now routing to APIs directly with Traffic Server, switch to
a text log, instead of a binary log for easier searching, since this log
file might be more useful for debugging now.
- Treat the web-app as an API Backend, to simplify the routing so that
  everything goes through the same proxy flow, rather than having
  special routing just for the web-app (particularly as more of the
  web-app aside from some login things are just APIs). This is mostly
  taken from the "postgres-lapis" branch, where we had already done this
  refactoring to simplify the routing to that new version of the
  web-app.
- Also route website backends through trafficserver, since trafficserver
  handles dynamic DNS lookups with keepalive connections a bit more
  easily out of the box (and again, just to normalize the flow through
  the proxy stack). Previously, website backends didn't have keepalive
  connections enabled (since we were relying on nginx's own proxy_pass
  workaround to deal with dynamic DNS, which doesn't enable keepalive),
  so this should also improve website backend performance a bit (and
  also allow normal HTTP caching too).
- Cleanup some of the nginx config files and remove unused pieces.
- Remove some of the temporary HTTP headers used before sending requests
  to API backends.
- Tweak trafficserver config to match our existing retry behavior (don't
  retry API backend requests if connections timeout).
- Shift X-Cache header setup into Traffic Server's lua hooks, since this
  allows for easier and more efficient generation rather than doing this
  with regexes at the nginx layer.
- Set trafficserver's proxy_name setting to eliminate the need to
  rewrite Via headers at nginx's layer.
The trafficserver approach to setting X-Cache won't work in some
edge-cases, so revert back to parsing the Via header for this (but tweak
the regex implementation a bit for better optimization).
With the shift to using Traffic Server for DNS resolution and upstream
connections, we no longer need all of this custom code at our nginx
layer. This was some of the thornier things to deal with (DNS
resolution, caching, and configuration of upstreams via the ngx-dyups
plugin), so being able to rely on Traffic Server for this functionality
allows for some nice cleanup.
- Fix redirects performed by the "redirect_https" setting so they're not
  subject to redirect rewriting.
- Remove unused web_app_matcher middleware.
- Rework the https configuration tests to better account for the changes
  in how the web-app is now routed as an api backend with some
  configuration options.
Skipping redirect rewriting when the redirect_https setting was merely
present wasn't correct, since that skipped rewriting for all responses
when this setting was present, even if they came from the backend.

Instead, tweak things so the redirect rewriting is only skipped when
redirect_https actually triggers a redirect. Also shift how
redirect_https triggers the redirect, to treat it as an error, which
will short-circuite the rest of the middleware, rather than letting them
execute (since in the case of these redirects, we don't need to check
for rate limits, etc).
GUI added 9 commits May 27, 2018 17:27
Now that we're using Traffic Server for establishing backend
connections, tune it's configuration to better match our existing
keepalive settings.

- This removes the ability to set "keepalive_connections" on a per
  server basis, but this ability was never really exposed through the
  admin interface, so I don't think this is a huge loss.
- There's an idle timeout for Traffic Server's keepalive settings, which
  while different, actually seems preferable.
- Improve the "reload" signal to trafficserver to better work (while a
  SIGHUP should have been equivalent, it didn't seem to pick up changes,
  so using traffic_ctl seems more reliable).
require_https is normally on by default, but it had been disabled in the
test environment. Now that it's at least re-enabled for the web-app
backend, update the tests to account for this (which better matches
production behavior anyway).
With Traffic Server now proxying directly to the API backend, there have
been some subtle shifts in gzip behavior, since the nginx layer behind
Traffic Server no longer potentially gzips things. This actually seems
more predictable, but adjust our tests, particularly the ones
surrounding gzip behavior between the first and second request.
This was due to the changes in the caching helper that makes duplicate
requests so that it can be called multiple times within a single test.
This fixes it so that it can still be called multiple times, but we have
predictable URLs.
@GUI GUI merged commit 9102d42 into master May 29, 2018
@GUI GUI deleted the updates-trafficserver branch May 29, 2018 04:41
@GUI GUI added this to the v0.15.0 milestone May 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant