Diego client tries all bbs domain IPs #3048

will-gant · 2022-11-04T18:32:36Z

A short explanation of the proposed change:

Diego::Client resolves the bbs URL directly and tries each IP in turn. I couldn't identify any (maintained) HTTP client that has the same behaviour in the event of a timeout when hitting the first IP that a domain resolves to.

I've switched from httpclient to the standard library's Net::HTTP, as these changes rely on the latter allowing you to set the SNI of your request. Without this, you'd hit a TLS name validation error when trying to hit a bare IP address over HTTPS.

Unfortunately there seems to be no way to validate this behaviour with unit tests when stubbing requests with Webmock (or any alternative library that I can find). Therefore the specs prove only that we're setting ipaddr on the Net::HTTP client in the required way prior to making each request. I have however run this through the cf-acceptance-tests (details below) and it's passing.

An explanation of the use cases your change solves

Replaces #3002 as a solution to the 'Runner unavailable' errors that occur during an outage that leaves at least one bbs server unreachable. See that PR for more details of this issue.

Links to any other associated PRs

#3002

I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the main branch
I have run all the unit tests using bundle exec rake
I have run CF Acceptance Tests
*Run against the following cf-acceptance-test suites with the PR branch swapped into CAPI 1.140.0: apps, container_networking, detect, docker, internet_dependent, routing, service_instance_sharing, services, ssh, sso, tasks, v3

Mitigates the risk of 'Runner is unavailable' errors when an outage makes some (but not all) bbs instances unreachable, by trying all IPs before re-resolving. bbs operates as a cluster, with only one instance at a time 'active' and able to respond to API requests. At present, whenever the Diego::Client needs to communicate with bbs it hits the URL passed in from config.diego.bbs.url. The default value passed in for this from the capi-release is https://bbs.service.cf.internal:8889. This domain is defined by cf-deployment as a bosh-dns alias for the diego-api instance group with query q-s4 (i.e. resolving to all instances, whether healthy or unhealthy, in a random order). The diego client previously made three attempts to reach bbs. Upon a network timeout its HTTP client would bail and a retry would be attempted with a fresh DNS resolution. This returned a list of bbs IPs in a random order, and with two instances there was a 1/8 chance to get the same unreachable instance's IP first in the list on all three attempts.

will-gant · 2022-11-08T13:14:16Z

Just noting I've tested this on one of our dev foundations with two diego-api instances, both in a steady state and, to recreate the timeout issue that led to this PR, with one of the two instances behind a security rule blocking all ingress (including from its own subnet):

api/920f2ddf-e0a6-42c3-ac2e-eac87b4d78c0:~$ nslookup bbs.service.cf.internal
Server:		169.254.0.2
Address:	169.254.0.2#53

Name:	bbs.service.cf.internal
Address: 10.2.1.64
Name:	bbs.service.cf.internal
Address: 10.1.1.64
api/920f2ddf-e0a6-42c3-ac2e-eac87b4d78c0:~$ nc 10.2.1.64 8889 -v -w 10
nc: connect to 10.2.1.64 port 8889 (tcp) timed out: Operation now in progress
api/920f2ddf-e0a6-42c3-ac2e-eac87b4d78c0:~$ nc 10.1.1.64 8889 -v -w 10
Connection to 10.1.1.64 8889 port [tcp/*] succeeded!

Leaving one of the instances unreachable, I've been able to repeatedly cf push without 'Runner unavailable' errors, and the cf-acceptance-tests also all pass (list of specific acceptance test suites in the PR description).

Running the client from irb:

irb(main):162:0> client.ping
attempt 1: trying bbs endpoint /v1/ping on 10.1.1.64
=> nil

...

irb(main):170:0> client.ping
attempt 1: trying bbs endpoint /v1/ping on 10.2.1.64
attempt 1: failed to reach bbs server on 10.2.1.64, removing from list
attempt 1: trying bbs endpoint /v1/ping on 10.1.1.64
=> nil

First attempt gets the right IP by chance. Second attempt has a fresh client instance, as it'll only switch the IP it's targeting if the last request failed.

will-gant · 2022-11-08T17:23:27Z

@sethboyles @jpalermo Opinions welcome ☺️

lib/diego/client.rb

sethboyles

LGTM. Thanks for revisiting this.

will-gant · 2022-11-11T08:55:33Z

No problem!

cf-gitbot added the unscheduled label Nov 4, 2022

Configure http client to target specific bbs IP

1af3604

will-gant force-pushed the try-each-bbs-ip branch from 5039700 to f786f7a Compare November 8, 2022 13:27

Add some logging to the diego client

a21a76b

will-gant force-pushed the try-each-bbs-ip branch from f786f7a to a21a76b Compare November 8, 2022 14:28

will-gant marked this pull request as ready for review November 8, 2022 15:05

philippthun reviewed Nov 10, 2022

View reviewed changes

Address PR feedback

dce76a9

will-gant requested a review from philippthun November 10, 2022 15:47

sethboyles approved these changes Nov 11, 2022

View reviewed changes

FloThinksPi merged commit 0013b4e into cloudfoundry:main Nov 11, 2022

cf-gitbot removed the unscheduled label Nov 11, 2022

will-gant deleted the try-each-bbs-ip branch December 12, 2022 19:08

thelangley mentioned this pull request Dec 16, 2022

After recent changes in bbs selection process, passive nodes randomly causing errors #3109

Closed

philippthun mentioned this pull request Dec 19, 2022

Revert "Diego client tries all bbs domain IPs" #3113

Merged

This was referenced Feb 1, 2023

Use Net::HTTP::Persistent in Diego client #3170

Merged

Patch HTTPClient to accept socket_connect_timeout #3174

Merged

johha mentioned this pull request Feb 7, 2023

Consolidate HTTP Clients #3178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diego client tries all bbs domain IPs #3048

Diego client tries all bbs domain IPs #3048

will-gant commented Nov 4, 2022 •

edited

Loading

will-gant commented Nov 8, 2022 •

edited

Loading

will-gant commented Nov 8, 2022

sethboyles left a comment

will-gant commented Nov 11, 2022

Diego client tries all bbs domain IPs #3048

Diego client tries all bbs domain IPs #3048

Conversation

will-gant commented Nov 4, 2022 • edited Loading

will-gant commented Nov 8, 2022 • edited Loading

will-gant commented Nov 8, 2022

sethboyles left a comment

Choose a reason for hiding this comment

will-gant commented Nov 11, 2022

will-gant commented Nov 4, 2022 •

edited

Loading

will-gant commented Nov 8, 2022 •

edited

Loading