Improve identity provider timeout polling #12120

rhmdnd · 2024-07-02T13:29:42Z

One of our tests for OpenShift uses a manual remediation to install an
identity provider before rescanning the environment. Something we've
noticed is that the remediation will timeout in e2e runs because the
authentication operator isn't ready yet after configuring the new
identity provider.

The default timeout is only 30 seconds, which likely isn't long enough
to restart the authentication operator.

We can make this remediation more robust by using oc adm wait-for-stable-cluster, which waits up to an hour for the
authentication operator to come up. It also reduces the number of things
we need to check for by encapsulating the checks into a single command.

One of our tests for OpenShift uses a manual remediation to install an identity provider before rescanning the environment. Something we've noticed is that the remediation will timeout in e2e runs because the authentication operator isn't ready yet after configuring the new identity provider. The default timeout is only 30 seconds, which likely isn't long enough to restart the authentication operator. We can make this remediation more robust by using `oc adm wait-for-stable-cluster`, which waits up to an hour for the authentication operator to come up. It also reduces the number of things we need to check for by encapsulating the checks into a single command.

rhmdnd · 2024-07-02T13:30:31Z

/test

openshift-ci · 2024-07-02T13:30:35Z

@rhmdnd: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test 4.13-e2e-aws-ocp4-bsi
/test 4.13-e2e-aws-ocp4-bsi-node
/test 4.13-e2e-aws-ocp4-cis
/test 4.13-e2e-aws-ocp4-cis-node
/test 4.13-e2e-aws-ocp4-e8
/test 4.13-e2e-aws-ocp4-high
/test 4.13-e2e-aws-ocp4-high-node
/test 4.13-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate-node
/test 4.13-e2e-aws-ocp4-pci-dss
/test 4.13-e2e-aws-ocp4-pci-dss-node
/test 4.13-e2e-aws-ocp4-stig
/test 4.13-e2e-aws-ocp4-stig-node
/test 4.13-e2e-aws-rhcos4-bsi
/test 4.13-e2e-aws-rhcos4-e8
/test 4.13-e2e-aws-rhcos4-high
/test 4.13-e2e-aws-rhcos4-moderate
/test 4.13-e2e-aws-rhcos4-stig
/test 4.13-images
/test 4.14-e2e-aws-ocp4-bsi
/test 4.14-e2e-aws-ocp4-bsi-node
/test 4.14-e2e-aws-rhcos4-bsi
/test 4.14-images
/test 4.15-e2e-aws-ocp4-bsi
/test 4.15-e2e-aws-ocp4-bsi-node
/test 4.15-e2e-aws-ocp4-cis
/test 4.15-e2e-aws-ocp4-cis-node
/test 4.15-e2e-aws-ocp4-e8
/test 4.15-e2e-aws-ocp4-high
/test 4.15-e2e-aws-ocp4-high-node
/test 4.15-e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate-node
/test 4.15-e2e-aws-ocp4-pci-dss
/test 4.15-e2e-aws-ocp4-pci-dss-node
/test 4.15-e2e-aws-ocp4-stig
/test 4.15-e2e-aws-ocp4-stig-node
/test 4.15-e2e-aws-rhcos4-bsi
/test 4.15-e2e-aws-rhcos4-e8
/test 4.15-e2e-aws-rhcos4-high
/test 4.15-e2e-aws-rhcos4-moderate
/test 4.15-e2e-aws-rhcos4-stig
/test 4.15-e2e-rosa-ocp4-cis-node
/test 4.15-e2e-rosa-ocp4-pci-dss-node
/test 4.15-images
/test 4.16-e2e-aws-ocp4-bsi
/test 4.16-e2e-aws-ocp4-bsi-node
/test 4.16-e2e-aws-ocp4-cis
/test 4.16-e2e-aws-ocp4-cis-node
/test 4.16-e2e-aws-ocp4-e8
/test 4.16-e2e-aws-ocp4-high
/test 4.16-e2e-aws-ocp4-high-node
/test 4.16-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate-node
/test 4.16-e2e-aws-ocp4-pci-dss
/test 4.16-e2e-aws-ocp4-pci-dss-node
/test 4.16-e2e-aws-ocp4-stig
/test 4.16-e2e-aws-ocp4-stig-node
/test 4.16-e2e-aws-rhcos4-bsi
/test 4.16-e2e-aws-rhcos4-e8
/test 4.16-e2e-aws-rhcos4-high
/test 4.16-e2e-aws-rhcos4-moderate
/test 4.16-e2e-aws-rhcos4-stig
/test 4.16-images
/test e2e-aws-ocp4-bsi
/test e2e-aws-ocp4-bsi-node
/test e2e-aws-ocp4-cis
/test e2e-aws-ocp4-cis-node
/test e2e-aws-ocp4-e8
/test e2e-aws-ocp4-high
/test e2e-aws-ocp4-high-node
/test e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate-node
/test e2e-aws-ocp4-pci-dss
/test e2e-aws-ocp4-pci-dss-node
/test e2e-aws-ocp4-stig
/test e2e-aws-ocp4-stig-node
/test e2e-aws-rhcos4-bsi
/test e2e-aws-rhcos4-e8
/test e2e-aws-rhcos4-high
/test e2e-aws-rhcos4-moderate
/test e2e-aws-rhcos4-stig
/test images

Use /test all to run the following jobs that were automatically triggered:

pull-ci-ComplianceAsCode-content-master-4.13-images
pull-ci-ComplianceAsCode-content-master-4.14-images
pull-ci-ComplianceAsCode-content-master-4.15-images
pull-ci-ComplianceAsCode-content-master-4.16-images
pull-ci-ComplianceAsCode-content-master-images

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rhmdnd · 2024-07-02T13:30:51Z

/test e2e-aws-ocp4-pci-dss
/test e2e-aws-ocp4-pci-dss-node

github-actions · 2024-07-02T13:31:16Z

Start a new ephemeral environment with changes proposed in this pull request:

Fedora Environment

Oracle Linux 8 Environment

github-actions · 2024-07-02T13:42:49Z

🤖 A k8s content image for this PR is available at:
ghcr.io/complianceascode/k8scontent:12120
This image was built from commit: 5936a3c

Click here to see how to deploy it

If you alread have Compliance Operator deployed:
utils/build_ds_container.py -i ghcr.io/complianceascode/k8scontent:12120

Otherwise deploy the content and operator together by checking out ComplianceAsCode/compliance-operator and:
CONTENT_IMAGE=ghcr.io/complianceascode/k8scontent:12120 make deploy-local

codeclimate · 2024-07-02T14:11:21Z

Code Climate has analyzed commit 5936a3c and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (50% is the threshold).

This pull request will bring the total coverage in the repository to 59.4% (0.0% change).

View more on Code Climate.

openshift-ci · 2024-07-02T14:59:19Z

@rhmdnd: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ocp4-pci-dss	`5936a3c`	link	true	`/test e2e-aws-ocp4-pci-dss`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

yuumasato

/lgtm

The test timed out, but this time on ocp4e2e:
https://github.com/ComplianceAsCode/ocp4e2e/blob/main/helpers.go#L56

Maybe we should extend it to 1h to align with oc adm wait-for-stable-cluster.
But I wonder if it is reasonable that it takes more than 30 minutes to get a stable cluster.

rhmdnd · 2024-07-15T19:59:06Z

/lgtm

The test timed out, but this time on ocp4e2e: https://github.com/ComplianceAsCode/ocp4e2e/blob/main/helpers.go#L56

Maybe we should extend it to 1h to align with oc adm wait-for-stable-cluster. But I wonder if it is reasonable that it takes more than 30 minutes to get a stable cluster.

I'd be surprised if it took more than an hour for the idp configs to apply consistently. Let's see if we keep experiencing a timeout at 30 minutes and reassess before we bump it up to an hour. Maybe we can find a more efficient/stable way to apply the remediation.

Lately, we've been experiencing issues with manual remediations timing out during functional testing. This manifests in the following error: === RUN TestE2e/Apply_manual_remediations <snip> helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh' helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh' helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out In this particular case, it looks like the remediation to add an Identity Provider to the cluster failed, but this is actually an unintended side-effect of another change that updated the idp_is_configured remediation to use a more robust technique for determining if the cluster applied the remediation successfully: ComplianceAsCode#12120 ComplianceAsCode#12184 Because we updated the remediation to use `oc adm wait-for-stable-cluster`, we're effectively checking all cluster operators to ensure they're healthy. This started causing timeouts because a separate, unrelated remediation was also getting applied in our testing that updated the default CA, but didn't include a ConfigMap that contained the CA bundle. As a result, one of the operators didn't come up because it was looking for a ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster` command was hanging on a legitimate issue in a separate remediation. This commit attempts to fix that issue by updating the trusted CA remediation by generating a certificate for testing purposes, then creates a ConfigMap called `trusted-ca-bundle`, before updating the trusted CA.

Lately, we've been experiencing issues with manual remediations timing out during functional testing. This manifests in the following error: === RUN TestE2e/Apply_manual_remediations <snip> helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh' helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh' helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out In this particular case, it looks like the remediation to add an Identity Provider to the cluster failed, but this is actually an unintended side-effect of another change that updated the idp_is_configured remediation to use a more robust technique for determining if the cluster applied the remediation successfully: ComplianceAsCode#12120 ComplianceAsCode#12184 Because we updated the remediation to use `oc adm wait-for-stable-cluster`, we're effectively checking all cluster operators to ensure they're healthy. This started causing timeouts because a separate, unrelated remediation was also getting applied in our testing that updated the default CA, but didn't include a ConfigMap that contained the CA bundle. As a result, one of the operators didn't come up because it was looking for a ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster` command was hanging on a legitimate issue in a separate remediation. This commit attempts to fix that issue by updating the trusted CA remediation by generating a certificate for testing purposes.

Lately, we've been experiencing issues with manual remediations timing out during functional testing. This manifests in the following error: === RUN TestE2e/Apply_manual_remediations <snip> helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh' helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh' helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out In this particular case, it looks like the remediation to add an Identity Provider to the cluster failed, but this is actually an unintended side-effect of another change that updated the idp_is_configured remediation to use a more robust technique for determining if the cluster applied the remediation successfully: ComplianceAsCode#12120 ComplianceAsCode#12184 Because we updated the remediation to use `oc adm wait-for-stable-cluster`, we're effectively checking all cluster operators to ensure they're healthy. This started causing timeouts because a separate, unrelated remediation was also getting applied in our testing that updated the default CA, but didn't include a ConfigMap that contained the CA bundle. As a result, one of the operators didn't come up because it was looking for a ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster` command was hanging on a legitimate issue in a separate remediation. This commit attempts to fix that issue by updating the trusted CA remediation by creating a configmap for the expected certificate bundle.

rhmdnd added the OpenShift OpenShift product related. label Jul 2, 2024

yuumasato self-assigned this Jul 15, 2024

yuumasato added this to the 0.1.74 milestone Jul 15, 2024

yuumasato approved these changes Jul 15, 2024

View reviewed changes

yuumasato merged commit 664055b into ComplianceAsCode:master Jul 15, 2024
94 of 96 checks passed

rhmdnd mentioned this pull request Jul 26, 2024

Generate a temp certificate for OCP4 Trusted CA remediation #12226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve identity provider timeout polling #12120

Improve identity provider timeout polling #12120

rhmdnd commented Jul 2, 2024

rhmdnd commented Jul 2, 2024

openshift-ci bot commented Jul 2, 2024

rhmdnd commented Jul 2, 2024

github-actions bot commented Jul 2, 2024

github-actions bot commented Jul 2, 2024

codeclimate bot commented Jul 2, 2024

openshift-ci bot commented Jul 2, 2024

yuumasato left a comment

rhmdnd commented Jul 15, 2024

Improve identity provider timeout polling #12120

Improve identity provider timeout polling #12120

Conversation

rhmdnd commented Jul 2, 2024

rhmdnd commented Jul 2, 2024

openshift-ci bot commented Jul 2, 2024

rhmdnd commented Jul 2, 2024

github-actions bot commented Jul 2, 2024

github-actions bot commented Jul 2, 2024

codeclimate bot commented Jul 2, 2024

openshift-ci bot commented Jul 2, 2024

yuumasato left a comment

Choose a reason for hiding this comment

rhmdnd commented Jul 15, 2024