-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diego client tries all bbs domain IPs #3048
Diego client tries all bbs domain IPs #3048
Conversation
Mitigates the risk of 'Runner is unavailable' errors when an outage makes some (but not all) bbs instances unreachable, by trying all IPs before re-resolving. bbs operates as a cluster, with only one instance at a time 'active' and able to respond to API requests. At present, whenever the Diego::Client needs to communicate with bbs it hits the URL passed in from config.diego.bbs.url. The default value passed in for this from the capi-release is https://bbs.service.cf.internal:8889. This domain is defined by cf-deployment as a bosh-dns alias for the diego-api instance group with query q-s4 (i.e. resolving to all instances, whether healthy or unhealthy, in a random order). The diego client previously made three attempts to reach bbs. Upon a network timeout its HTTP client would bail and a retry would be attempted with a fresh DNS resolution. This returned a list of bbs IPs in a random order, and with two instances there was a 1/8 chance to get the same unreachable instance's IP first in the list on all three attempts.
Just noting I've tested this on one of our dev foundations with two api/920f2ddf-e0a6-42c3-ac2e-eac87b4d78c0:~$ nslookup bbs.service.cf.internal
Server: 169.254.0.2
Address: 169.254.0.2#53
Name: bbs.service.cf.internal
Address: 10.2.1.64
Name: bbs.service.cf.internal
Address: 10.1.1.64
api/920f2ddf-e0a6-42c3-ac2e-eac87b4d78c0:~$ nc 10.2.1.64 8889 -v -w 10
nc: connect to 10.2.1.64 port 8889 (tcp) timed out: Operation now in progress
api/920f2ddf-e0a6-42c3-ac2e-eac87b4d78c0:~$ nc 10.1.1.64 8889 -v -w 10
Connection to 10.1.1.64 8889 port [tcp/*] succeeded! Leaving one of the instances unreachable, I've been able to repeatedly Running the client from irb(main):162:0> client.ping
attempt 1: trying bbs endpoint /v1/ping on 10.1.1.64
=> nil
...
irb(main):170:0> client.ping
attempt 1: trying bbs endpoint /v1/ping on 10.2.1.64
attempt 1: failed to reach bbs server on 10.2.1.64, removing from list
attempt 1: trying bbs endpoint /v1/ping on 10.1.1.64
=> nil First attempt gets the right IP by chance. Second attempt has a fresh client instance, as it'll only switch the IP it's targeting if the last request failed. |
5039700
to
f786f7a
Compare
f786f7a
to
a21a76b
Compare
@sethboyles @jpalermo Opinions welcome |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for revisiting this.
No problem! |
A short explanation of the proposed change:
Diego::Client resolves the bbs URL directly and tries each IP in turn. I couldn't identify any (maintained) HTTP client that has the same behaviour in the event of a timeout when hitting the first IP that a domain resolves to.
I've switched from
httpclient
to the standard library'sNet::HTTP
, as these changes rely on the latter allowing you to set the SNI of your request. Without this, you'd hit a TLS name validation error when trying to hit a bare IP address over HTTPS.Unfortunately there seems to be no way to validate this behaviour with unit tests when stubbing requests with Webmock (or any alternative library that I can find). Therefore the specs prove only that we're setting
ipaddr
on theNet::HTTP
client in the required way prior to making each request. I have however run this through the cf-acceptance-tests (details below) and it's passing.An explanation of the use cases your change solves
Replaces #3002 as a solution to the 'Runner unavailable' errors that occur during an outage that leaves at least one bbs server unreachable. See that PR for more details of this issue.
Links to any other associated PRs
#3002
I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the
main
branchI have run all the unit tests using
bundle exec rake
I have run CF Acceptance Tests
*Run against the following cf-acceptance-test suites with the PR branch swapped into CAPI 1.140.0: apps, container_networking, detect, docker, internet_dependent, routing, service_instance_sharing, services, ssh, sso, tasks, v3