Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diego client tries all bbs domain IPs #3048

Merged
merged 4 commits into from
Nov 11, 2022

Conversation

will-gant
Copy link
Contributor

@will-gant will-gant commented Nov 4, 2022

A short explanation of the proposed change:

Diego::Client resolves the bbs URL directly and tries each IP in turn. I couldn't identify any (maintained) HTTP client that has the same behaviour in the event of a timeout when hitting the first IP that a domain resolves to.

I've switched from httpclient to the standard library's Net::HTTP, as these changes rely on the latter allowing you to set the SNI of your request. Without this, you'd hit a TLS name validation error when trying to hit a bare IP address over HTTPS.

Unfortunately there seems to be no way to validate this behaviour with unit tests when stubbing requests with Webmock (or any alternative library that I can find). Therefore the specs prove only that we're setting ipaddr on the Net::HTTP client in the required way prior to making each request. I have however run this through the cf-acceptance-tests (details below) and it's passing.

An explanation of the use cases your change solves

Replaces #3002 as a solution to the 'Runner unavailable' errors that occur during an outage that leaves at least one bbs server unreachable. See that PR for more details of this issue.

Links to any other associated PRs

#3002

  • I have reviewed the contributing guide

  • I have viewed, signed, and submitted the Contributor License Agreement

  • I have made this pull request to the main branch

  • I have run all the unit tests using bundle exec rake

  • I have run CF Acceptance Tests
    *Run against the following cf-acceptance-test suites with the PR branch swapped into CAPI 1.140.0: apps, container_networking, detect, docker, internet_dependent, routing, service_instance_sharing, services, ssh, sso, tasks, v3

Mitigates the risk of 'Runner is unavailable' errors when an outage makes some (but not all) bbs instances unreachable, by trying all IPs before re-resolving.

bbs operates as a cluster, with only one instance at a time 'active' and able to respond to API requests. At present, whenever the Diego::Client needs to communicate with bbs it hits the URL passed in from config.diego.bbs.url. The default value passed in for this from the capi-release is https://bbs.service.cf.internal:8889. This domain is defined by cf-deployment as a bosh-dns alias for the diego-api instance group with query q-s4 (i.e. resolving to all instances, whether healthy or unhealthy, in a random order).

The diego client previously made three attempts to reach bbs. Upon a network timeout its HTTP client would bail and a retry would be attempted with a fresh DNS resolution. This returned a list of bbs IPs in a random order, and with two instances there was a 1/8 chance to get the same unreachable instance's IP first in the list on all three attempts.
@will-gant
Copy link
Contributor Author

will-gant commented Nov 8, 2022

Just noting I've tested this on one of our dev foundations with two diego-api instances, both in a steady state and, to recreate the timeout issue that led to this PR, with one of the two instances behind a security rule blocking all ingress (including from its own subnet):

api/920f2ddf-e0a6-42c3-ac2e-eac87b4d78c0:~$ nslookup bbs.service.cf.internal
Server:		169.254.0.2
Address:	169.254.0.2#53

Name:	bbs.service.cf.internal
Address: 10.2.1.64
Name:	bbs.service.cf.internal
Address: 10.1.1.64
api/920f2ddf-e0a6-42c3-ac2e-eac87b4d78c0:~$ nc 10.2.1.64 8889 -v -w 10
nc: connect to 10.2.1.64 port 8889 (tcp) timed out: Operation now in progress
api/920f2ddf-e0a6-42c3-ac2e-eac87b4d78c0:~$ nc 10.1.1.64 8889 -v -w 10
Connection to 10.1.1.64 8889 port [tcp/*] succeeded!

Leaving one of the instances unreachable, I've been able to repeatedly cf push without 'Runner unavailable' errors, and the cf-acceptance-tests also all pass (list of specific acceptance test suites in the PR description).

Running the client from irb:

irb(main):162:0> client.ping
attempt 1: trying bbs endpoint /v1/ping on 10.1.1.64
=> nil

...

irb(main):170:0> client.ping
attempt 1: trying bbs endpoint /v1/ping on 10.2.1.64
attempt 1: failed to reach bbs server on 10.2.1.64, removing from list
attempt 1: trying bbs endpoint /v1/ping on 10.1.1.64
=> nil

First attempt gets the right IP by chance. Second attempt has a fresh client instance, as it'll only switch the IP it's targeting if the last request failed.

@will-gant will-gant marked this pull request as ready for review November 8, 2022 15:05
@will-gant
Copy link
Contributor Author

@sethboyles @jpalermo Opinions welcome ☺️

lib/diego/client.rb Outdated Show resolved Hide resolved
lib/diego/client.rb Outdated Show resolved Hide resolved
lib/diego/client.rb Show resolved Hide resolved
lib/diego/client.rb Show resolved Hide resolved
lib/diego/client.rb Outdated Show resolved Hide resolved
lib/diego/client.rb Show resolved Hide resolved
lib/diego/client.rb Outdated Show resolved Hide resolved
Copy link
Member

@sethboyles sethboyles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for revisiting this.

@will-gant
Copy link
Contributor Author

No problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants