website/docs/r/lb_target_group: Drop "TLS" from health-check protocols #6866

wking · 2018-12-16T06:07:52Z

It was added in 10532ba (#5246), which quotes an error message that includes TLS. But neither the AWS docs (here and here) nor our internal validation (e.g. see d2e38fd, #5367) include a TLS protocol.

It was added in 10532ba (Be explicit about protocol options for target group, 2018-07-19, hashicorp#5246), which quotes an error message that includes "TLS". But neither the AWS docs [1,2] nor our internal validation (e.g. see d2e38fd, elbv2: Prevent panics from unsafe dereferences, 2018-07-27, hashicorp#5367) include a TLS protocol. [1]: https://docs.aws.amazon.com/elasticloadbalancing/latest/APIReference/API_CreateTargetGroup.html#API_CreateTargetGroup_RequestParameters [2]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html#health-check-settings

As suggested by Dani Comnea [1]. When we switched to network load balancers in 16dfbb3 (data/aws: use nlbs instead of elbs, 2018-11-01, openshift#594), we replaced things like: resource "aws_elb" "api_internal" { ... health_check { healthy_threshold = 2 unhealthy_threshold = 2 timeout = 3 target = "SSL:6443" interval = 5 } ... } with: resource "aws_lb_target_group" "api_internal" { ... health_check { healthy_threshold = 3 unhealthy_threshold = 3 interval = 10 port = 6443 protocol = "TCP" } } This resulted in logs like [2]: [core@ip-10-0-11-88 ~]$ sudo crictl ps CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT 1bf4870ea6eea registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56 2 minutes ago Running machine-config-server 0 [core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea I1215 20:23:07.088210 1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty I1215 20:23:07.088554 1 api.go:54] launching server I1215 20:23:07.088571 1 api.go:54] launching server 2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF 2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF 2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF ... when the health check opens a TCP connection (in this case to the machine-config server on 49500) and then hangs up without completing the TLS handshake. Network load balancers [3,4] do not have an analog to the classic load balancers' SSL protocol [5,6,7], so we're using HTTPS. There's some discussion in [8] about the best way to perform unauthenticated liveness checks on the Kubernetes API server. For now, I'm assuming that both 200 and 401 responses to /healthz requests indicate a functional server, and we can evaluate other response status codes as necessary. Checking against a recent cluster: $ curl -i https://wking-api.devcluster.openshift.com:6443/healthz curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html curl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option. If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL). If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option. $ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz HTTP/1.1 200 OK Cache-Control: no-store Date: Sun, 16 Dec 2018 06:18:23 GMT Content-Length: 2 Content-Type: text/plain; charset=utf-8 I don't know if the network load balancer health checks care about certificate validity or not. I guess we'll see how CI testing handles this. Ignition is only exposed inside the cluster, and checking that from a master node: [core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/ curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html curl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option. If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL). If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option. [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/ HTTP/1.1 404 Not Found Content-Type: text/plain; charset=utf-8 X-Content-Type-Options: nosniff Date: Sun, 16 Dec 2018 06:30:14 GMT Content-Length: 19 404 page not found So I've allowed 200, 401, and 404 there just to be generous. [1]: openshift#923 [2]: https://groups.google.com/d/msg/openshift-4-dev-preview/Jmt6AK0EJR4/Ed3W7yZyBQAJ [3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html [4]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#protocol [5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html [6]: https://www.terraform.io/docs/providers/aws/r/elb.html#target [7]: hashicorp/terraform-provider-aws#6866 [8]: kubernetes/kubernetes#43784

As suggested by Dani Comnea [1]. When we switched to network load balancers in 16dfbb3 (data/aws: use nlbs instead of elbs, 2018-11-01, openshift#594), we replaced things like: resource "aws_elb" "api_internal" { ... health_check { healthy_threshold = 2 unhealthy_threshold = 2 timeout = 3 target = "SSL:6443" interval = 5 } ... } with: resource "aws_lb_target_group" "api_internal" { ... health_check { healthy_threshold = 3 unhealthy_threshold = 3 interval = 10 port = 6443 protocol = "TCP" } } This resulted in logs like [2]: [core@ip-10-0-11-88 ~]$ sudo crictl ps CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT 1bf4870ea6eea registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56 2 minutes ago Running machine-config-server 0 [core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea I1215 20:23:07.088210 1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty I1215 20:23:07.088554 1 api.go:54] launching server I1215 20:23:07.088571 1 api.go:54] launching server 2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF 2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF 2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF ... when the health check opens a TCP connection (in this case to the machine-config server on 49500) and then hangs up without completing the TLS handshake. Network load balancers [3,4] do not have an analog to the classic load balancers' SSL protocol [5,6,7], so we're using HTTPS. There's some discussion in [8] about the best way to perform unauthenticated liveness checks on the Kubernetes API server. For now, I'm assuming that both 200 and 401 responses to /healthz requests indicate a functional server, and we can evaluate other response status codes as necessary. Checking against a recent cluster: $ curl -i https://wking-api.devcluster.openshift.com:6443/healthz curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html curl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option. If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL). If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option. $ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz HTTP/1.1 200 OK Cache-Control: no-store Date: Sun, 16 Dec 2018 06:18:23 GMT Content-Length: 2 Content-Type: text/plain; charset=utf-8 I don't know if the network load balancer health checks care about certificate validity or not. I guess we'll see how CI testing handles this. Ignition is only exposed inside the cluster, and checking that from a master node: [core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/ curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html curl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option. If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL). If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option. [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/ HTTP/1.1 404 Not Found Content-Type: text/plain; charset=utf-8 X-Content-Type-Options: nosniff Date: Sun, 16 Dec 2018 06:30:14 GMT Content-Length: 19 404 page not found Unfortunately, setting matcher [9] is not allowed for network load balancers (e.g. see [10,11]). Setting it leads to errors like: ERROR * module.vpc.aws_lb_target_group.api_internal: 1 error occurred: ERROR * aws_lb_target_group.api_internal: Error creating LB Target Group: InvalidConfigurationRequest: Custom health check matchers are not supported for health checks for target groups with the TCP protocol ERROR status code: 400, request id: 25a53d63-00fe-11e9-80c5-59885e191c9c So I've left it unset here, and we'll just hope the 401s don't start happening. [1]: openshift#923 [2]: https://groups.google.com/d/msg/openshift-4-dev-preview/Jmt6AK0EJR4/Ed3W7yZyBQAJ [3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html [4]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#protocol [5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html [6]: https://www.terraform.io/docs/providers/aws/r/elb.html#target [7]: hashicorp/terraform-provider-aws#6866 [8]: kubernetes/kubernetes#43784 [9]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#matcher [10]: https://github.com/terraform-providers/terraform-provider-aws/pull/2906/files#diff-375aea487c27a6ada86edfd817ba2401R612 [11]: hashicorp/terraform-provider-aws#2708 (comment)

bflad

Good catch, @wking, thanks! 🚀

As suggested by Dani Comnea [1]. When we switched to network load balancers in 16dfbb3 (data/aws: use nlbs instead of elbs, 2018-11-01, openshift#594), we replaced things like: resource "aws_elb" "api_internal" { ... health_check { healthy_threshold = 2 unhealthy_threshold = 2 timeout = 3 target = "SSL:6443" interval = 5 } ... } with: resource "aws_lb_target_group" "api_internal" { ... health_check { healthy_threshold = 3 unhealthy_threshold = 3 interval = 10 port = 6443 protocol = "TCP" } } This resulted in logs like [2]: [core@ip-10-0-11-88 ~]$ sudo crictl ps CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT 1bf4870ea6eea registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56 2 minutes ago Running machine-config-server 0 [core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea I1215 20:23:07.088210 1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty I1215 20:23:07.088554 1 api.go:54] launching server I1215 20:23:07.088571 1 api.go:54] launching server 2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF 2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF 2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF ... when the health check opens a TCP connection (in this case to the machine-config server on 49500) and then hangs up without completing the TLS handshake. Network load balancers [3,4] do not have an analog to the classic load balancers' SSL protocol [5,6,7], so we're using HTTPS. There's some discussion in [8] about the best way to perform unauthenticated liveness checks on the Kubernetes API server that suggests 401s are possible in some configurations. Checking against a recent cluster: $ curl -i https://wking-api.devcluster.openshift.com:6443/healthz curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html curl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option. If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL). If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option. $ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz HTTP/1.1 200 OK Cache-Control: no-store Date: Sun, 16 Dec 2018 06:18:23 GMT Content-Length: 2 Content-Type: text/plain; charset=utf-8 ok I don't know if the network load balancer health checks care about certificate validity or not. I guess we'll see how CI testing handles this. Ignition is only exposed inside the cluster, and checking that from a master node: [core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/ curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html curl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option. If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL). If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option. [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/ HTTP/1.1 404 Not Found Content-Type: text/plain; charset=utf-8 X-Content-Type-Options: nosniff Date: Sun, 16 Dec 2018 06:30:14 GMT Content-Length: 19 404 page not found So we're checking the new /healthz from openshift/machine-config-operator@d0a7ae21 (server: Add /healthz, 2019-01-04, openshift/machine-config-operator#267) instead. Unfortunately, setting matcher [9] is not allowed for network load balancers (e.g. see [10,11]). Setting it leads to errors like: ERROR * module.vpc.aws_lb_target_group.api_internal: 1 error occurred: ERROR * aws_lb_target_group.api_internal: Error creating LB Target Group: InvalidConfigurationRequest: Custom health check matchers are not supported for health checks for target groups with the TCP protocol ERROR status code: 400, request id: 25a53d63-00fe-11e9-80c5-59885e191c9c So I've left it unset here, and we'll just hope the 401s don't start happening. [1]: openshift#923 [2]: https://groups.google.com/d/msg/openshift-4-dev-preview/Jmt6AK0EJR4/Ed3W7yZyBQAJ [3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html [4]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#protocol [5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html [6]: https://www.terraform.io/docs/providers/aws/r/elb.html#target [7]: hashicorp/terraform-provider-aws#6866 [8]: kubernetes/kubernetes#43784 [9]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#matcher [10]: https://github.com/terraform-providers/terraform-provider-aws/pull/2906/files#diff-375aea487c27a6ada86edfd817ba2401R612 [11]: hashicorp/terraform-provider-aws#2708 (comment)

ghost · 2020-03-31T17:13:32Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

ghost added size/XS Managed by automation to categorize the size of a PR. documentation Introduces or discusses updates to documentation. service/elbv2 Issues and PRs that pertain to the elbv2 service. labels Dec 16, 2018

wking force-pushed the drop-tls-health-check-protocol-from-docs branch from f2edc7c to e8709f4 Compare December 16, 2018 06:08

wking mentioned this pull request Dec 16, 2018

data/aws/vpc: Use HTTPS for load balancer health checks openshift/installer#924

Merged

bflad added this to the v1.53.0 milestone Dec 18, 2018

bflad approved these changes Dec 18, 2018

View reviewed changes

bflad merged commit b3d65c1 into hashicorp:master Dec 18, 2018

wking deleted the drop-tls-health-check-protocol-from-docs branch March 19, 2019 23:37

ghost locked and limited conversation to collaborators Mar 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

website/docs/r/lb_target_group: Drop "TLS" from health-check protocols #6866

website/docs/r/lb_target_group: Drop "TLS" from health-check protocols #6866

wking commented Dec 16, 2018

bflad left a comment

ghost commented Mar 31, 2020

website/docs/r/lb_target_group: Drop "TLS" from health-check protocols #6866

website/docs/r/lb_target_group: Drop "TLS" from health-check protocols #6866

Conversation

wking commented Dec 16, 2018

bflad left a comment

Choose a reason for hiding this comment

ghost commented Mar 31, 2020