Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prow tests are timing out #1551

Closed
jlebon opened this issue Jul 16, 2024 · 9 comments
Closed

Prow tests are timing out #1551

jlebon opened this issue Jul 16, 2024 · 9 comments
Labels

Comments

@jlebon
Copy link
Member

jlebon commented Jul 16, 2024

We're seeing this on all branches, so this is likely related to something on the infra side. E.g. see #1550, which is purely about running less tests and yet it hit:

 INFO[2024-07-16T05:35:40Z] Executing test rhcos-92-build-test-qemu      
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2024-07-16T09:35:38Z"}
INFO[2024-07-16T09:35:38Z] Received signal.                              signal=interrupt
INFO[2024-07-16T09:35:38Z] error: Process interrupted with signal interrupt, cancelling execution... 
INFO[2024-07-16T09:35:38Z] cleanup: Deleting test pod rhcos-92-build-test-qemu 
INFO[2024-07-16T09:35:38Z] Ran for 4h0m0s                               

The archived logs don't really show at what point of the test we timed out.

@jlebon
Copy link
Member Author

jlebon commented Jul 18, 2024

A neat thing about Prow is that you can access the temporary namespace that gets created in which the tests are running. Right now, it's not clear on what exactly it's timing out, and the logs appear incomplete. So to debug this, one can start a test (e.g. on an existing PR or a dummy PR), and then log into the cluster it's running on and attach to the test pods getting launched to follow what's happening more closely.

Our jobs run on build02: https://github.com/openshift/release/blob/e4b98d1804ed4cd88854a5977065209f9d1ebc83/ci-operator/config/openshift/os/openshift-os-master.yaml#L92-L119. More info on available clusters: https://docs.ci.openshift.org/docs/getting-started/useful-links/

@marmijo
Copy link
Contributor

marmijo commented Aug 1, 2024

Here's what the Event log shows in the pod:

0/39 nodes are available: 1 node(s) were unschedulable, 14 node(s) had untolerated taint {node-role.kubernetes.io/ci-tests-worker: }, 
2 node(s) had untolerated taint {node-role.kubernetes.io/ci-longtests-worker: ci-longtests-worker}, 
3 node(s) had untolerated taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}, 
3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 
3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 
5 Insufficient memory, 6 Insufficient devices.kubevirt.io/kvm, 
7 node(s) had untolerated taint {node-role.kubernetes.io/ci-builds-worker: ci-builds-worker}. 
no new claims to deallocate, 
preemption: 0/39 nodes are available: 33 Preemption is not helpful for scheduling, 
6 No preemption victims found for incoming pod.

xref: #1560 (comment)

I've asked the OCP Test Platform team for assistance in Slack.

@marmijo
Copy link
Contributor

marmijo commented Aug 6, 2024

The build02 cluster was updated to 4.17.0-ec.2 on July 22, 2024. These failures started sometimes between July 19 and July 22, so this upgrade may have affected our kvm hook. This could be something similar to #1028.

The DPTP team is going to work to deploy OpenShift Virt Operator/kvm-device-plugin to all Prow clusters so we aren't confined to build02. Build04 also has kvm-device-plugin installed, but it's out of rotation right now. https://issues.redhat.com/browse/DPTP-4126

I'll also continue to investigate what might have changed in 4.17.0-ec.2 that affected our kvm hook.

@marmijo
Copy link
Contributor

marmijo commented Aug 8, 2024

The DPTP team removed the kvm-device-plugin deployment on all Prow clusters in openshift/release#55365. OpenShift Virt Operator was also deployed to all Prow clusters in openshift/release#55366 and a HyperConverged custom resource was added in openshift/release#55369. Build02 was updated manually with all of these changes and now nodes are available with kvm and our jobs are being scheduled.

We should no longer be limited to building only on Build02 now that OpenShift Virt is available on all clusters.

@jlebon
Copy link
Member Author

jlebon commented Aug 8, 2024

We should no longer be limited to building only on Build02 now that OpenShift Virt is available on all clusters.

I think for that, we need to remove the build02 references in https://github.com/openshift/release/blob/b53658ecaa1f193022ddf2a174548b819be9d14c/ci-operator/config/openshift/os/openshift-os-master.yaml#L92-L119. Want to tackle that?

(Edit: and similarly for the coreos/coreos-assembler version of that file.)

@marmijo
Copy link
Contributor

marmijo commented Aug 8, 2024

Sure I can do that! So would that just allow our jobs to be scheduled on any available cluster?

@marmijo
Copy link
Contributor

marmijo commented Aug 8, 2024

/close

@openshift-ci openshift-ci bot closed this as completed Aug 8, 2024
Copy link
Contributor

openshift-ci bot commented Aug 8, 2024

@marmijo: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jlebon
Copy link
Member Author

jlebon commented Aug 9, 2024

So would that just allow our jobs to be scheduled on any available cluster?

Yeah exactly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants