Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve identity provider timeout polling #12120

Merged
merged 1 commit into from
Jul 15, 2024

Conversation

rhmdnd
Copy link
Collaborator

@rhmdnd rhmdnd commented Jul 2, 2024

One of our tests for OpenShift uses a manual remediation to install an
identity provider before rescanning the environment. Something we've
noticed is that the remediation will timeout in e2e runs because the
authentication operator isn't ready yet after configuring the new
identity provider.

The default timeout is only 30 seconds, which likely isn't long enough
to restart the authentication operator.

We can make this remediation more robust by using oc adm wait-for-stable-cluster, which waits up to an hour for the
authentication operator to come up. It also reduces the number of things
we need to check for by encapsulating the checks into a single command.

One of our tests for OpenShift uses a manual remediation to install an
identity provider before rescanning the environment. Something we've
noticed is that the remediation will timeout in e2e runs because the
authentication operator isn't ready yet after configuring the new
identity provider.

The default timeout is only 30 seconds, which likely isn't long enough
to restart the authentication operator.

We can make this remediation more robust by using `oc adm
wait-for-stable-cluster`, which waits up to an hour for the
authentication operator to come up. It also reduces the number of things
we need to check for by encapsulating the checks into a single command.
@rhmdnd rhmdnd added the OpenShift OpenShift product related. label Jul 2, 2024
@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 2, 2024

/test

Copy link

openshift-ci bot commented Jul 2, 2024

@rhmdnd: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test 4.13-e2e-aws-ocp4-bsi
  • /test 4.13-e2e-aws-ocp4-bsi-node
  • /test 4.13-e2e-aws-ocp4-cis
  • /test 4.13-e2e-aws-ocp4-cis-node
  • /test 4.13-e2e-aws-ocp4-e8
  • /test 4.13-e2e-aws-ocp4-high
  • /test 4.13-e2e-aws-ocp4-high-node
  • /test 4.13-e2e-aws-ocp4-moderate
  • /test 4.13-e2e-aws-ocp4-moderate-node
  • /test 4.13-e2e-aws-ocp4-pci-dss
  • /test 4.13-e2e-aws-ocp4-pci-dss-node
  • /test 4.13-e2e-aws-ocp4-stig
  • /test 4.13-e2e-aws-ocp4-stig-node
  • /test 4.13-e2e-aws-rhcos4-bsi
  • /test 4.13-e2e-aws-rhcos4-e8
  • /test 4.13-e2e-aws-rhcos4-high
  • /test 4.13-e2e-aws-rhcos4-moderate
  • /test 4.13-e2e-aws-rhcos4-stig
  • /test 4.13-images
  • /test 4.14-e2e-aws-ocp4-bsi
  • /test 4.14-e2e-aws-ocp4-bsi-node
  • /test 4.14-e2e-aws-rhcos4-bsi
  • /test 4.14-images
  • /test 4.15-e2e-aws-ocp4-bsi
  • /test 4.15-e2e-aws-ocp4-bsi-node
  • /test 4.15-e2e-aws-ocp4-cis
  • /test 4.15-e2e-aws-ocp4-cis-node
  • /test 4.15-e2e-aws-ocp4-e8
  • /test 4.15-e2e-aws-ocp4-high
  • /test 4.15-e2e-aws-ocp4-high-node
  • /test 4.15-e2e-aws-ocp4-moderate
  • /test 4.15-e2e-aws-ocp4-moderate-node
  • /test 4.15-e2e-aws-ocp4-pci-dss
  • /test 4.15-e2e-aws-ocp4-pci-dss-node
  • /test 4.15-e2e-aws-ocp4-stig
  • /test 4.15-e2e-aws-ocp4-stig-node
  • /test 4.15-e2e-aws-rhcos4-bsi
  • /test 4.15-e2e-aws-rhcos4-e8
  • /test 4.15-e2e-aws-rhcos4-high
  • /test 4.15-e2e-aws-rhcos4-moderate
  • /test 4.15-e2e-aws-rhcos4-stig
  • /test 4.15-e2e-rosa-ocp4-cis-node
  • /test 4.15-e2e-rosa-ocp4-pci-dss-node
  • /test 4.15-images
  • /test 4.16-e2e-aws-ocp4-bsi
  • /test 4.16-e2e-aws-ocp4-bsi-node
  • /test 4.16-e2e-aws-ocp4-cis
  • /test 4.16-e2e-aws-ocp4-cis-node
  • /test 4.16-e2e-aws-ocp4-e8
  • /test 4.16-e2e-aws-ocp4-high
  • /test 4.16-e2e-aws-ocp4-high-node
  • /test 4.16-e2e-aws-ocp4-moderate
  • /test 4.16-e2e-aws-ocp4-moderate-node
  • /test 4.16-e2e-aws-ocp4-pci-dss
  • /test 4.16-e2e-aws-ocp4-pci-dss-node
  • /test 4.16-e2e-aws-ocp4-stig
  • /test 4.16-e2e-aws-ocp4-stig-node
  • /test 4.16-e2e-aws-rhcos4-bsi
  • /test 4.16-e2e-aws-rhcos4-e8
  • /test 4.16-e2e-aws-rhcos4-high
  • /test 4.16-e2e-aws-rhcos4-moderate
  • /test 4.16-e2e-aws-rhcos4-stig
  • /test 4.16-images
  • /test e2e-aws-ocp4-bsi
  • /test e2e-aws-ocp4-bsi-node
  • /test e2e-aws-ocp4-cis
  • /test e2e-aws-ocp4-cis-node
  • /test e2e-aws-ocp4-e8
  • /test e2e-aws-ocp4-high
  • /test e2e-aws-ocp4-high-node
  • /test e2e-aws-ocp4-moderate
  • /test e2e-aws-ocp4-moderate-node
  • /test e2e-aws-ocp4-pci-dss
  • /test e2e-aws-ocp4-pci-dss-node
  • /test e2e-aws-ocp4-stig
  • /test e2e-aws-ocp4-stig-node
  • /test e2e-aws-rhcos4-bsi
  • /test e2e-aws-rhcos4-e8
  • /test e2e-aws-rhcos4-high
  • /test e2e-aws-rhcos4-moderate
  • /test e2e-aws-rhcos4-stig
  • /test images

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-ComplianceAsCode-content-master-4.13-images
  • pull-ci-ComplianceAsCode-content-master-4.14-images
  • pull-ci-ComplianceAsCode-content-master-4.15-images
  • pull-ci-ComplianceAsCode-content-master-4.16-images
  • pull-ci-ComplianceAsCode-content-master-images

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 2, 2024

/test e2e-aws-ocp4-pci-dss
/test e2e-aws-ocp4-pci-dss-node

Copy link

github-actions bot commented Jul 2, 2024

Start a new ephemeral environment with changes proposed in this pull request:

Fedora Environment
Open in Gitpod

Oracle Linux 8 Environment
Open in Gitpod

Copy link

github-actions bot commented Jul 2, 2024

🤖 A k8s content image for this PR is available at:
ghcr.io/complianceascode/k8scontent:12120
This image was built from commit: 5936a3c

Click here to see how to deploy it

If you alread have Compliance Operator deployed:
utils/build_ds_container.py -i ghcr.io/complianceascode/k8scontent:12120

Otherwise deploy the content and operator together by checking out ComplianceAsCode/compliance-operator and:
CONTENT_IMAGE=ghcr.io/complianceascode/k8scontent:12120 make deploy-local

Copy link

codeclimate bot commented Jul 2, 2024

Code Climate has analyzed commit 5936a3c and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (50% is the threshold).

This pull request will bring the total coverage in the repository to 59.4% (0.0% change).

View more on Code Climate.

Copy link

openshift-ci bot commented Jul 2, 2024

@rhmdnd: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ocp4-pci-dss 5936a3c link true /test e2e-aws-ocp4-pci-dss

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@yuumasato yuumasato self-assigned this Jul 15, 2024
@yuumasato yuumasato added this to the 0.1.74 milestone Jul 15, 2024
Copy link
Member

@yuumasato yuumasato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

The test timed out, but this time on ocp4e2e:
https://github.com/ComplianceAsCode/ocp4e2e/blob/main/helpers.go#L56

Maybe we should extend it to 1h to align with oc adm wait-for-stable-cluster.
But I wonder if it is reasonable that it takes more than 30 minutes to get a stable cluster.

@yuumasato yuumasato merged commit 664055b into ComplianceAsCode:master Jul 15, 2024
94 of 96 checks passed
@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 15, 2024

/lgtm

The test timed out, but this time on ocp4e2e: https://github.com/ComplianceAsCode/ocp4e2e/blob/main/helpers.go#L56

Maybe we should extend it to 1h to align with oc adm wait-for-stable-cluster. But I wonder if it is reasonable that it takes more than 30 minutes to get a stable cluster.

I'd be surprised if it took more than an hour for the idp configs to apply consistently. Let's see if we keep experiencing a timeout at 30 minutes and reassess before we bump it up to an hour. Maybe we can find a more efficient/stable way to apply the remediation.

rhmdnd added a commit to rhmdnd/content that referenced this pull request Jul 26, 2024
Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

   === RUN   TestE2e/Apply_manual_remediations
    <snip>
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
    helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

  ComplianceAsCode#12120
  ComplianceAsCode#12184

Because we updated the remediation to use `oc adm
wait-for-stable-cluster`, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster`
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by generating a certificate for testing purposes, then
creates a ConfigMap called `trusted-ca-bundle`, before updating the
trusted CA.
rhmdnd added a commit to rhmdnd/content that referenced this pull request Jul 26, 2024
Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

   === RUN   TestE2e/Apply_manual_remediations
    <snip>
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
    helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

  ComplianceAsCode#12120
  ComplianceAsCode#12184

Because we updated the remediation to use `oc adm
wait-for-stable-cluster`, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster`
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by generating a certificate for testing purposes.
rhmdnd added a commit to rhmdnd/content that referenced this pull request Jul 30, 2024
Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

   === RUN   TestE2e/Apply_manual_remediations
    <snip>
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
    helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

  ComplianceAsCode#12120
  ComplianceAsCode#12184

Because we updated the remediation to use `oc adm
wait-for-stable-cluster`, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster`
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by creating a configmap for the expected certificate bundle.
rhmdnd added a commit to rhmdnd/content that referenced this pull request Jul 31, 2024
Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

   === RUN   TestE2e/Apply_manual_remediations
    <snip>
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
    helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

  ComplianceAsCode#12120
  ComplianceAsCode#12184

Because we updated the remediation to use `oc adm
wait-for-stable-cluster`, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster`
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by creating a configmap for the expected certificate bundle.
vojtapolasek pushed a commit to vojtapolasek/content that referenced this pull request Aug 1, 2024
Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

   === RUN   TestE2e/Apply_manual_remediations
    <snip>
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
    helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

  ComplianceAsCode#12120
  ComplianceAsCode#12184

Because we updated the remediation to use `oc adm
wait-for-stable-cluster`, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster`
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by creating a configmap for the expected certificate bundle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OpenShift OpenShift product related.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants