failing e2e job kubeadm-kinder-kubelet-1-25-on-1-26 #2896

neolit123 · 2023-06-17T11:47:34Z

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-kubelet-1-25-on-1-26

i am seeing this in the kubelet logs:

kubelet.go:1380] "Failed to start cAdvisor" err="inotify_init: too many open files"

started on 16.06

other kubelet jobs seem ok

The text was updated successfully, but these errors were encountered:

SataQiu · 2023-06-17T14:27:06Z

The sanboxImage of containerd is set as registry.k8s.io/pause:3.7, which is not preloaded by kinder.
If the bandwidth is slow, it may cause control plane can not start in the expected time.
But it's strange that only this CI failed.

chendave · 2023-06-17T14:37:45Z

kubelet.go:1380] "Failed to start cAdvisor" err="inotify_init: too many open files"

this error has been changed now, https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-kubelet-1-25-on-1-26/1670022676342640640/build-log.txt, the earliest err I can see now is

[WARNING SystemVerification]: failed to parse kernel config: unable to load kernel module: "configs", output: "modprobe: FATAL: Module configs not found in directory /lib/modules/5.15.0-1030-aws\n", err: exit status 1
I0617 11:02:12.361345 313 checks.go:401] checking whether the given node name is valid and reachable using net.LookupHost
[preflight] The system verification failed. Printing the output from the verification:

this might be caused by the missing the of the boot config file.

Since this job is failed after the migration to the eks cluster, I'd suggest revert the change for this job to confirm this is an error caused by the cluster migration.

neolit123 · 2023-06-17T14:50:03Z

The sanboxImage of containerd is set as registry.k8s.io/pause:3.7, which is not preloaded by kinder. If the bandwidth is slow, it may cause control plane can not start in the expected time. But it's strange that only this CI failed.

~~is it a problem (preloading) due to the k8s.gcr.io -> registry.k8s.io migration?~~

edit: never mind, we never preloaded the sandbox imagevbased on containerd config. it should be really fast too, as it's <1mb

it will be strange if the flakes are because of that.

neolit123 · 2023-06-17T14:56:17Z

this might be caused by the missing the of the boot config file.

Since this job is failed after the migration to the eks cluster, I'd suggest revert the change for this job to confirm this is an error caused by the cluster migration.

it seems we have to notify #sig-k8s-infra about the kernel config problem on eks nodes. @dims do you know who can help us?

i am not so sure this is the cause of the failure, though. it is just a warning, and the nodes are the same for other test jobs too.

but we can revert this test job to check for general eks node / kubelet incompatibility as you suggest.

SataQiu · 2023-06-17T15:17:09Z

The revert PR was sent, feel free to review/merge it if we want to check if it's the reason of the CI failure.

SataQiu · 2023-06-17T15:20:36Z

I didn't experience similar problems when testing locally.

neolit123 · 2023-06-17T15:47:36Z

The revert PR was sent, feel free to review/merge it if we want to check if it's the reason of the CI failure.

we do have more flakes after moving to eks, but perhaps we should only revert this problematic kubelet job?

the flakes can be investigated separately.

dims · 2023-06-18T00:06:46Z

@SataQiu could we please have focused reverts? looking at this specific failure now.

dims · 2023-06-18T00:26:25Z

$ rg inotify kinder-xony-control-plane-1/kubelet.log | cut -f 3- -d ']' | sort | uniq -c
     94  "Failed to start cAdvisor" err="inotify_init: too many open files"
     94  "Unable to read config path" err="unable to create inotify: too many open files" path="/etc/kubernetes/manifests"
     94  Registration of the raw container factory failed: inotify_init: too many open files

dims · 2023-06-18T00:37:17Z

we do have a bump for fs.inotify.max_user_watches here:
https://github.com/cblecker/k8s.io/blob/088775306e2dfddc6a3de9a0be9ad51f197f7508/infra/aws/terraform/prow-build-cluster/eks.tf#L104

dims · 2023-06-18T00:40:01Z

and we have a daemonset ... dunno if that got deployed:
https://github.com/kubernetes/test-infra/blob/master/config/prow/cluster/build/tune-sysctls_daemonset.yaml#L32

dims · 2023-06-18T02:03:30Z

ok got one green https://prow.k8s.io/?job=ci-kubernetes-e2e-kubeadm-kinder-kubelet-1-25-on-1-26 let's watch these jobs over the weekend and see if we spot other issues

dims · 2023-06-18T03:02:06Z

ok 3 greens in a row: https://prow.k8s.io/?job=ci-kubernetes-e2e-kubeadm-kinder-kubelet-1-25-on-1-26

neolit123 · 2023-06-18T04:14:25Z

ok 3 greens in a row: https://prow.k8s.io/?job=ci-kubernetes-e2e-kubeadm-kinder-kubelet-1-25-on-1-26

thanks, @dims

neolit123 · 2023-06-18T04:15:23Z

closing this; we can close the revert prs too, @SataQiu

chendave · 2023-06-18T04:25:54Z

ok 3 greens in a row

Great to hear that!

For the record only, for the err msg: "inotify_init: too many open files", we need to open the kubelet.log from the gcp buckets.

pacoxu · 2023-06-18T06:25:17Z

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-1-25-on-1-24
failed for the same reason.

Welcome to �[1mUbuntu 22.04.1 LTS�[0m!

�[0;1;31mFailed to create control group inotify object: Too many open files�[0m
�[0;1;31mFailed to allocate manager object: Too many open files�[0m
[�[0;1;31m!!!!!!�[0m] Failed to allocate manager object.
�[0;1;31mExiting PID 1...�[0m

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-1-25-on-1-24/1670258734548389888/artifacts/kinder-xony-control-plane-1/serial.log

The last run. Not sure if this will be fixed by kubernetes/k8s.io#5438

We may wait for the next run. The start time of this CI is similar to the time that kubernetes/k8s.io#5438 was merged.

dims · 2023-06-18T11:49:14Z

@pacoxu see kubernetes/k8s.io#5439

neolit123 added this to the v1.28 milestone Jun 17, 2023

This was referenced Jun 17, 2023

Revert "kinder: use eks cluster in test-infra jobs" #2897

Closed

Revert "sig-cl: migrate all jobs to eks cluster" kubernetes/test-infra#29846

Closed

dims mentioned this issue Jun 18, 2023

Add a daemonset for tuning fs.inotify.max_user_watches kubernetes/k8s.io#5438

Merged

neolit123 closed this as completed Jun 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failing e2e job kubeadm-kinder-kubelet-1-25-on-1-26 #2896

failing e2e job kubeadm-kinder-kubelet-1-25-on-1-26 #2896

neolit123 commented Jun 17, 2023

SataQiu commented Jun 17, 2023 •

edited

Loading

chendave commented Jun 17, 2023

neolit123 commented Jun 17, 2023 •

edited

Loading

neolit123 commented Jun 17, 2023

SataQiu commented Jun 17, 2023

SataQiu commented Jun 17, 2023

neolit123 commented Jun 17, 2023

dims commented Jun 18, 2023

dims commented Jun 18, 2023

dims commented Jun 18, 2023

dims commented Jun 18, 2023

dims commented Jun 18, 2023

dims commented Jun 18, 2023

neolit123 commented Jun 18, 2023

neolit123 commented Jun 18, 2023

chendave commented Jun 18, 2023

pacoxu commented Jun 18, 2023 •

edited

Loading

dims commented Jun 18, 2023

failing e2e job kubeadm-kinder-kubelet-1-25-on-1-26 #2896

failing e2e job kubeadm-kinder-kubelet-1-25-on-1-26 #2896

Comments

neolit123 commented Jun 17, 2023

SataQiu commented Jun 17, 2023 • edited Loading

chendave commented Jun 17, 2023

neolit123 commented Jun 17, 2023 • edited Loading

neolit123 commented Jun 17, 2023

SataQiu commented Jun 17, 2023

SataQiu commented Jun 17, 2023

neolit123 commented Jun 17, 2023

dims commented Jun 18, 2023

dims commented Jun 18, 2023

dims commented Jun 18, 2023

dims commented Jun 18, 2023

dims commented Jun 18, 2023

dims commented Jun 18, 2023

neolit123 commented Jun 18, 2023

neolit123 commented Jun 18, 2023

chendave commented Jun 18, 2023

pacoxu commented Jun 18, 2023 • edited Loading

dims commented Jun 18, 2023

SataQiu commented Jun 17, 2023 •

edited

Loading

neolit123 commented Jun 17, 2023 •

edited

Loading

pacoxu commented Jun 18, 2023 •

edited

Loading