Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failing e2e job kubeadm-kinder-kubelet-1-25-on-1-26 #2896

Closed
neolit123 opened this issue Jun 17, 2023 · 18 comments
Closed

failing e2e job kubeadm-kinder-kubelet-1-25-on-1-26 #2896

neolit123 opened this issue Jun 17, 2023 · 18 comments
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@neolit123
Copy link
Member

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-kubelet-1-25-on-1-26

i am seeing this in the kubelet logs:

kubelet.go:1380] "Failed to start cAdvisor" err="inotify_init: too many open files"

started on 16.06

other kubelet jobs seem ok

@neolit123 neolit123 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 17, 2023
@neolit123 neolit123 added this to the v1.28 milestone Jun 17, 2023
@SataQiu
Copy link
Member

SataQiu commented Jun 17, 2023

The sanboxImage of containerd is set as registry.k8s.io/pause:3.7, which is not preloaded by kinder.
If the bandwidth is slow, it may cause control plane can not start in the expected time.
But it's strange that only this CI failed.

@chendave
Copy link
Member

kubelet.go:1380] "Failed to start cAdvisor" err="inotify_init: too many open files"

this error has been changed now, https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-kubelet-1-25-on-1-26/1670022676342640640/build-log.txt, the earliest err I can see now is

[WARNING SystemVerification]: failed to parse kernel config: unable to load kernel module: "configs", output: "modprobe: FATAL: Module configs not found in directory /lib/modules/5.15.0-1030-aws\n", err: exit status 1
I0617 11:02:12.361345 313 checks.go:401] checking whether the given node name is valid and reachable using net.LookupHost
[preflight] The system verification failed. Printing the output from the verification:

this might be caused by the missing the of the boot config file.

Since this job is failed after the migration to the eks cluster, I'd suggest revert the change for this job to confirm this is an error caused by the cluster migration.

@neolit123
Copy link
Member Author

neolit123 commented Jun 17, 2023

The sanboxImage of containerd is set as registry.k8s.io/pause:3.7, which is not preloaded by kinder. If the bandwidth is slow, it may cause control plane can not start in the expected time. But it's strange that only this CI failed.

is it a problem (preloading) due to the k8s.gcr.io -> registry.k8s.io migration?

edit: never mind, we never preloaded the sandbox imagevbased on containerd config. it should be really fast too, as it's <1mb

it will be strange if the flakes are because of that.

@neolit123
Copy link
Member Author

this might be caused by the missing the of the boot config file.

Since this job is failed after the migration to the eks cluster, I'd suggest revert the change for this job to confirm this is an error caused by the cluster migration.

it seems we have to notify #sig-k8s-infra about the kernel config problem on eks nodes. @dims do you know who can help us?

i am not so sure this is the cause of the failure, though. it is just a warning, and the nodes are the same for other test jobs too.

but we can revert this test job to check for general eks node / kubelet incompatibility as you suggest.

@SataQiu
Copy link
Member

SataQiu commented Jun 17, 2023

The revert PR was sent, feel free to review/merge it if we want to check if it's the reason of the CI failure.

@SataQiu
Copy link
Member

SataQiu commented Jun 17, 2023

I didn't experience similar problems when testing locally.

@neolit123
Copy link
Member Author

The revert PR was sent, feel free to review/merge it if we want to check if it's the reason of the CI failure.

we do have more flakes after moving to eks, but perhaps we should only revert this problematic kubelet job?

the flakes can be investigated separately.

@dims
Copy link
Member

dims commented Jun 18, 2023

@SataQiu could we please have focused reverts? looking at this specific failure now.

@dims
Copy link
Member

dims commented Jun 18, 2023

$ rg inotify kinder-xony-control-plane-1/kubelet.log | cut -f 3- -d ']' | sort | uniq -c
     94  "Failed to start cAdvisor" err="inotify_init: too many open files"
     94  "Unable to read config path" err="unable to create inotify: too many open files" path="/etc/kubernetes/manifests"
     94  Registration of the raw container factory failed: inotify_init: too many open files

@dims
Copy link
Member

dims commented Jun 18, 2023

@dims
Copy link
Member

dims commented Jun 18, 2023

@dims
Copy link
Member

dims commented Jun 18, 2023

ok got one green https://prow.k8s.io/?job=ci-kubernetes-e2e-kubeadm-kinder-kubelet-1-25-on-1-26 let's watch these jobs over the weekend and see if we spot other issues

@dims
Copy link
Member

dims commented Jun 18, 2023

ok 3 greens in a row: https://prow.k8s.io/?job=ci-kubernetes-e2e-kubeadm-kinder-kubelet-1-25-on-1-26

@neolit123
Copy link
Member Author

@neolit123
Copy link
Member Author

closing this; we can close the revert prs too, @SataQiu

@chendave
Copy link
Member

ok 3 greens in a row

Great to hear that!

For the record only, for the err msg: "inotify_init: too many open files", we need to open the kubelet.log from the gcp buckets.

@pacoxu
Copy link
Member

pacoxu commented Jun 18, 2023

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-1-25-on-1-24
failed for the same reason.

Welcome to �[1mUbuntu 22.04.1 LTS�[0m!

�[0;1;31mFailed to create control group inotify object: Too many open files�[0m
�[0;1;31mFailed to allocate manager object: Too many open files�[0m
[�[0;1;31m!!!!!!�[0m] Failed to allocate manager object.
�[0;1;31mExiting PID 1...�[0m

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-1-25-on-1-24/1670258734548389888/artifacts/kinder-xony-control-plane-1/serial.log

The last run. Not sure if this will be fixed by kubernetes/k8s.io#5438

We may wait for the next run. The start time of this CI is similar to the time that kubernetes/k8s.io#5438 was merged.

@dims
Copy link
Member

dims commented Jun 18, 2023

@pacoxu see kubernetes/k8s.io#5439

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

5 participants