Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kola/kubeadm: add kubernetes 1.22 test #196

Merged
merged 2 commits into from
Aug 13, 2021

Conversation

tormath1
Copy link
Contributor

@tormath1 tormath1 commented Jul 30, 2021

this commit brings a new release of kubernetes to test but it also fixes
a TODO:. We are now able to provide multiple kubernetes release to
test.
We just need to create a new map[string]interface{} holding the
params of the kubernetes release we want to test.

$ ./bin/kola list
kubeadm.v1.21.0.calico.base
kubeadm.v1.21.0.cilium.base
kubeadm.v1.21.0.flannel.base
kubeadm.v1.22.0.calico.base
kubeadm.v1.22.0.cilium.base
kubeadm.v1.22.0.flannel.base

Signed-off-by: Mathieu Tortuyaux mathieu@kinvolk.io

note for reviewers:

  • I'm a bit concerned by the number of tests we provide now for kubernetes: versions + CNIs are creating big test matrix which can slow down the tests overall... but I guess it's the price to pay in order to identify potential failures with kubernetes workloads

@tormath1 tormath1 self-assigned this Jul 30, 2021
@tormath1 tormath1 marked this pull request as ready for review August 5, 2021 09:19
@tormath1 tormath1 requested a review from a team August 5, 2021 09:19
@jepio
Copy link
Member

jepio commented Aug 9, 2021

The v1.22 tests fail on current release (2955) because kubeadm defaults to systemd cgroup driver whereas system docker uses cgroupfs.

@jepio
Copy link
Member

jepio commented Aug 9, 2021

I was testing this PR with the Docker 20/CgroupV2 images and both flannel tests were failing - though the original test on main was passing for the same image. Turns out the original test registration had the same flaw as the "ugly..." bit and flannel tests were using cillium instead. With this fix the tests fail for flannel on current alpha:

diff --git a/kola/tests/kubeadm/kubeadm.go b/kola/tests/kubeadm/kubeadm.go
index 44b47387..5b81da19 100644
--- a/kola/tests/kubeadm/kubeadm.go
+++ b/kola/tests/kubeadm/kubeadm.go
@@ -76,12 +76,13 @@ systemd:

 func init() {
        for _, CNI := range CNIs {
+               cni := CNI
                register.Register(&register.Test{
-                       Name:             fmt.Sprintf("kubeadm.%s.base", CNI),
+                       Name:             fmt.Sprintf("kubeadm.%s.base", cni),
                        Distros:          []string{"cl"},
                        ExcludePlatforms: []string{"esx"},
                        Run: func(c cluster.TestCluster) {
-                               kubeadmBaseTest(c, CNI)
+                               kubeadmBaseTest(c, cni)
                        },
                })
        }

But I'm able to deploy working flannel manually so we need to look into the setup code.

@tormath1
Copy link
Contributor Author

@jepio thanks for testing and your feedback

flannel tests were using cillium instead

This PR actually holds the "fix" to avoid passing a map reference to the test itself. I'll try to reproduce / fix the flannel thing - if you still have the traceback of the failing test could you share it ?

The v1.22 tests fail on current release (2955) because kubeadm defaults to systemd cgroup driver whereas system docker uses cgroupfs.

Ok, then I guess we should add a minimal release version like you did in here: 6d21aef.

@jepio
Copy link
Member

jepio commented Aug 10, 2021

I don't think this is very helpful (nodes don't get up), but here's the failure on main:

1..1 not ok - kubeadm.flannel.base --- Error: "cluster.go:117: I0809 16:11:16.758111 1238 version.go:254] remote version is much newer: v1.22.0; falling back to: stable-1.21
 cluster.go:117: [config/images] Pulled k8s.gcr.io/kube-apiserver:v1.21.3
 cluster.go:117: [config/images] Pulled k8s.gcr.io/kube-controller-manager:v1.21.3
 cluster.go:117: [config/images] Pulled k8s.gcr.io/kube-scheduler:v1.21.3
 cluster.go:117: [config/images] Pulled k8s.gcr.io/kube-proxy:v1.21.3
 cluster.go:117: [config/images] Pulled k8s.gcr.io/pause:3.4.1
 cluster.go:117: [config/images] Pulled k8s.gcr.io/etcd:3.4.13-0
 cluster.go:117: [config/images] Pulled k9s.gcr.io/coredns/coredns:v1.8.0
 cluster.go:117: I0809 16:11:43.137632 1627 version.go:254] remote version is much newer: v1.22.0; falling back to: stable-1.21
 cluster.go:117: [init] Using Kubernetes version: v1.21.3
 cluster.go:117: [preflight] Running pre-flight checks
 cluster.go:117: 	[WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
 cluster.go:117: 	[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/
 cluster.go:117: [preflight] Pulling images required for setting up a Kubernetes cluster
 cluster.go:117: [preflight] This might take a minute or two, depending on the speed of your internet connection
 cluster.go:117: [preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
 cluster.go:117: [certs] Using certificateDir folder \"/etc/kubernetes/pki\"
 cluster.go:117: [certs] Generating \"ca\" certificate and key
 cluster.go:117: [certs] Generating \"apiserver\" certificate and key
 cluster.go:117: [certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 10.0.0.3]
 cluster.go:117: [certs] Generating \"apiserver-kubelet-client\" certificate and key
 cluster.go:117: [certs] Generating \"front-proxy-ca\" certificate and key
 cluster.go:117: [certs] Generating \"front-proxy-client\" certificate and key
 cluster.go:117: [certs] External etcd mode: Skipping etcd/ca certificate authority generation
 cluster.go:117: [certs] External etcd mode: Skipping etcd/server certificate generation
 cluster.go:117: [certs] External etcd mode: Skipping etcd/peer certificate generation
 cluster.go:117: [certs] External etcd mode: Skipping etcd/healthcheck-client certificate generation
 cluster.go:117: [certs] External etcd mode: Skipping apiserver-etcd-client certificate generation
 cluster.go:117: [certs] Generating \"sa\" key and public key
 cluster.go:117: [kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"
 cluster.go:117: [kubeconfig] Writing \"admin.conf\" kubeconfig file
 cluster.go:117: [kubeconfig] Writing \"kubelet.conf\" kubeconfig file
 cluster.go:117: [kubeconfig] Writing \"controller-manager.conf\" kubeconfig file
 cluster.go:117: [kubeconfig] Writing \"scheduler.conf\" kubeconfig file
 cluster.go:117: [kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"
 cluster.go:117: [kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"
 cluster.go:117: [kubelet-start] Starting the kubelet
 cluster.go:117: [control-plane] Using manifest folder \"/etc/kubernetes/manifests\"
 cluster.go:117: [control-plane] Creating static Pod manifest for \"kube-apiserver\"
 cluster.go:117: [control-plane] Creating static Pod manifest for \"kube-controller-manager\"
 cluster.go:117: [control-plane] Creating static Pod manifest for \"kube-scheduler\"
 cluster.go:117: [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory \"/etc/kubernetes/manifests\". This can take up to 4m0s
 cluster.go:117: [kubelet-check] Initial timeout of 40s passed.
 cluster.go:117: [apiclient] All control plane components are healthy after 48.002759 seconds
 cluster.go:117: [upload-config] Storing the configuration used in ConfigMap \"kubeadm-config\" in the \"kube-system\" Namespace
 cluster.go:117: [kubelet] Creating a ConfigMap \"kubelet-config-1.21\" in namespace kube-system with the configuration for the kubelets in the cluster
 cluster.go:117: [upload-certs] Skipping phase. Please see --upload-certs
 cluster.go:117: [mark-control-plane] Marking the node localhost as control-plane by adding the labels: [node-role.kubernetes.io/master(deprecated) node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
 cluster.go:117: [mark-control-plane] Marking the node localhost as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]
 cluster.go:117: [bootstrap-token] Using token: czrbu0.wy9pvo5ww1lptemq
 cluster.go:117: [bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
 cluster.go:117: [bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to get nodes
 cluster.go:117: [bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
 cluster.go:117: [bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
 cluster.go:117: [bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
 cluster.go:117: [bootstrap-token] Creating the \"cluster-info\" ConfigMap in the \"kube-public\" namespace
 cluster.go:117: [kubelet-finalize] Updating \"/etc/kubernetes/kubelet.conf\" to point to a rotatable kubelet client certificate and key
 cluster.go:117: [addons] Applied essential addon: CoreDNS
 cluster.go:117: [addons] Applied essential addon: kube-proxy
 cluster.go:117: 
 cluster.go:117: Your Kubernetes control-plane has initialized successfully!
 cluster.go:117: 
 cluster.go:117: To start using your cluster, you need to run the following as a regular user:
 cluster.go:117: 
 cluster.go:117: mkdir -p $HOME/.kube
 cluster.go:117: sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
 cluster.go:117: sudo chown $(id -u):$(id -g) $HOME/.kube/config
 cluster.go:117: 
 cluster.go:117: Alternatively, if you are the root user, you can run:
 cluster.go:117: 
 cluster.go:117: export KUBECONFIG=/etc/kubernetes/admin.conf
 cluster.go:117: 
 cluster.go:117: You should now deploy a pod network to the cluster.
 cluster.go:117: Run \"kubectl apply -f [podnetwork].yaml\" with one of the options listed at:
 cluster.go:117: https://kubernetes.io/docs/concepts/cluster-administration/addons/
 cluster.go:117: 
 cluster.go:117: Then you can join any number of worker nodes by running the following on each as root:
 cluster.go:117: 
 cluster.go:117: kubeadm join 10.0.0.3:6443 --token czrbu0.wy9pvo5ww1lptemq \
 cluster.go:117: 	--discovery-token-ca-cert-hash sha256:6c1b36d528297ac123ce8ddf85c0423dc95fac4ffc42082e47afba1cd2b0d31c 
 cluster.go:117: Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
 cluster.go:117: podsecuritypolicy.policy/psp.flannel.unprivileged created
 cluster.go:117: clusterrole.rbac.authorization.k8s.io/flannel created
 cluster.go:117: clusterrolebinding.rbac.authorization.k8s.io/flannel created
 cluster.go:117: serviceaccount/flannel created
 cluster.go:117: configmap/kube-flannel-cfg created
 cluster.go:117: daemonset.apps/kube-flannel-ds created
 cluster.go:117: Created symlink /etc/systemd/system/multi-user.target.wants/kubelet.service → /etc/systemd/system/kubelet.service.
 cluster.go:117: 	[WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
 cluster.go:117: 	[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/
 --- FAIL: kubeadm.flannel.base/node_readiness (92.85s)
 kubeadm.go:114: nodes are not ready: ready nodes should be equal to 2: 0
 --- FAIL: kubeadm.flannel.base/nginx_deployment (92.05s)
 kubeadm.go:132: nginx is not deployed: ready replicas should be equal to 1: null" ...

Here are the logs (I see some suspicious avc denied):
kola_logs.tar.gz

@tormath1
Copy link
Contributor Author

tormath1 commented Aug 10, 2021

@jepio thanks for the logs. SELinux could be (again) the issue - for the tests, SELinux is always set to enforce mode but we don't have a fully labelled system so I'm not sure how it behaves yet. We could keep SELinux in permissive mode with this register flag: https://github.com/kinvolk/mantle/blob/flatcar-master/kola/register/register.go#L33.

But I'm able to deploy working flannel manually so we need to look into the setup code.

SELinux seems to be the only the difference between things done manually and things done with kola.

Aug  9 16:13:05.620588 kubelet[2776]: E0809 16:13:05.620536    2776 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"install-cni\" with CrashLoopBackOff: \"back-off 10s restarting failed container=install-cni pod=kube-flannel-ds-48krl_kube-system(8cf54eca-e2dc-4e6a-82ec-c6d5abaa37f9)\"" pod="kube-system/kube-flannel-ds-48krl" podUID=8cf54eca-e2dc-4e6a-82ec-c6d5abaa37f9

That would make sense then ⬆️

EDIT:

core@localhost ~ $ docker logs 6a774c560ff1 (the flannel init container)
cp: can't create '/etc/cni/net.d/10-flannel.conflist': Permission denied

it confirms the SELinux thing - I'll provide a patch as soon as possible.

(see also: flatcar/Flatcar#476, flatcar-archive/coreos-overlay#1181)

@tormath1
Copy link
Contributor Author

@jepio alright, now the inter-mission is almost finished let's get back on the PR 😂

The v1.22 tests fail on current release (2955) because kubeadm defaults to systemd cgroup driver whereas system docker uses cgroupfs.

How do you think we should proceed ? The only way I see is to split the tests into two groups to ensure we keep running v1.21 tests on releases < 2955 and to run tests v1.22 on releases > 2955.
What do you think ?

@jepio
Copy link
Member

jepio commented Aug 11, 2021

@jepio alright, now the inter-mission is almost finished let's get back on the PR 😂

The v1.22 tests fail on current release (2955) because kubeadm defaults to systemd cgroup driver whereas system docker uses cgroupfs.

How do you think we should proceed ? The only way I see is to split the tests into two groups to ensure we keep running v1.21 tests on releases < 2955 and to run tests v1.22 on releases > 2955.
What do you think ?

I think as long as the test can work, we should make it work. We need to pass a bit more configuration to kubeadm (for master an workers)

 ---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: cgroupfs

Best would be to detect what docker is using on the node (docker info | grep cgroup) and pass in either cgroupfs or systemd.

@tormath1
Copy link
Contributor Author

@jepio done: cgroup driver is fetched using docker info and the config is passed to kubelet via kubeadm.

Mathieu Tortuyaux added 2 commits August 12, 2021 21:17
this commit brings a new release of kubernetes to test but it also fixes
a `TODO:`. We are now able to provide multiple kubernetes release to
test.
We just need to create a new `map[string]interface{}` holding the
params of the kubernetes release we want to test.

```
$ ./bin/kola list
kubeadm.v1.21.0.calico.base
kubeadm.v1.21.0.cilium.base
kubeadm.v1.21.0.flannel.base
kubeadm.v1.22.0.calico.base
kubeadm.v1.22.0.cilium.base
kubeadm.v1.22.0.flannel.base
```

Signed-off-by: Mathieu Tortuyaux <mathieu@kinvolk.io>
we now handle the `cgroup` driver
@tormath1
Copy link
Contributor Author

commits squashed and rebased onto flatcar-master

Copy link
Member

@jepio jepio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love what you did with the test - it previously mutated the global params struct for each of these tests... Much better now 🥇

@tormath1 tormath1 merged commit 62656e0 into flatcar-master Aug 13, 2021
@tormath1 tormath1 deleted the tormath1/kubernetes-1-22 branch August 13, 2021 08:45
@tormath1
Copy link
Contributor Author

@jepio still not perfect ! I wish we have a proper Params struct rather than a big map[string]interface{} ... let's iterate. Anyway, thanks for your test and your review :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants