Skip to content
This repository has been archived by the owner on Mar 31, 2023. It is now read-only.

flake: TestMultimasterSetup() gets wrong number of workers #241

Open
bboreham opened this issue Jun 25, 2020 · 4 comments
Open

flake: TestMultimasterSetup() gets wrong number of workers #241

bboreham opened this issue Jun 25, 2020 · 4 comments
Labels
chore Related to fix/refinement/improvement of end user or new/existing developer functionality

Comments

@bboreham
Copy link
Contributor

I've seen this a few times, e.g.:
https://circleci.com/gh/weaveworks/wksctl/6191

Cluster gets to 4 nodes, which should be 3 control-plane and 1 worker, but:
should have 3 item(s), but has 2
should have 1 item(s), but has 2

@bboreham bboreham added the chore Related to fix/refinement/improvement of end user or new/existing developer functionality label Jun 25, 2020
@bboreham
Copy link
Contributor Author

bboreham commented Jul 2, 2020

Log file from a run where this happened https://circleci.com/gh/weaveworks/wksctl/6498
wks-controller.txt

Looks like the key part is on line 3215; kubeadm hit an error:

[etcd] Creating static Pod manifest for "etcd"
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
{"level":"warn","ts":"2020-07-02T12:38:58.486Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"passthrough:///https://172.17.0.6:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
error execution phase control-plane-join/update-status: error uploading configuration: unable to create ConfigMap: etcdserver: request timed out, possibly due to connection lost
To see the stack trace of this error execute with --v=5 or higher
time="2020-07-02T12:39:18Z" level=error msg="failed to join cluster" stdouterr="[preflight] Running pre-flight checks\n\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\n\t[WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist\n[preflight] The system verification failed. Printing the output from the verification:\n\x1b[0;37mKERNEL_VERSION\x1b[0m: \x1b[0;32m4.4.0-96-generic\x1b[0m\n\x1b[0;37mDOCKER_VERSION\x1b[0m: \x1b[0;32m19.03.8\x1b[0m\n\x1b[0;37mOS\x1b[0m: \x1b[0;32mLinux\x1b[0m\n\x1b[0;37mCGROUPS_CPU\x1b[0m: \x1b[0;32menabled\x1b[0m\n\x1b[0;37mCGROUPS_CPUACCT\x1b[0m: \x1b[0;32menabled\x1b[0m\n\x1b[0;37mCGROUPS_CPUSET\x1b[0m: \x1b[0;32menabled\x1b[0m\n\x1b[0;37mCGROUPS_DEVICES\x1b[0m: \x1b[0;32menabled\x1b[0m\n\x1b[0;37mCGROUPS_FREEZER\x1b[0m: \x1b[0;32menabled\x1b[0m\n\x1b[0;37mCGROUPS_MEMORY\x1b[0m: \x1b[0;32menabled\x1b[0m\n\t[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 19.03.8. Latest validated version: 18.09\n\t[WARNING SystemVerification]: failed to parse kernel config: unable to load kernel module: \"configs\", output: \"\", err: exit status 1\n[preflight] Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'\n[preflight] Running pre-flight checks before initializing the new control plane instance\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'\n[download-certs] Downloading the certificates in Secret \"kubeadm-certs\" in the \"kube-system\" Namespace\n[certs] Using certificateDir folder \"/etc/kubernetes/pki\"\n[certs] Generating \"apiserver-etcd-client\" certificate and key\n[certs] Generating \"etcd/peer\" certificate and key\n[certs] etcd/peer serving cert is signed for DNS names [node2 localhost] and IPs [172.17.0.6 127.0.0.1 ::1]\n[certs] Generating \"etcd/server\" certificate and key\n[certs] etcd/server serving cert is signed for DNS names [node2 localhost] and IPs [172.17.0.6 127.0.0.1 ::1]\n[certs] Generating \"etcd/healthcheck-client\" certificate and key\n[certs] Generating \"front-proxy-client\" certificate and key\n[certs] Generating \"apiserver\" certificate and key\n[certs] apiserver serving cert is signed for DNS names [node2 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 172.17.0.6 172.17.0.4 127.0.0.1 172.17.0.4]\n[certs] Generating \"apiserver-kubelet-client\" certificate and key\n[certs] Valid certificates and keys now exist in \"/etc/kubernetes/pki\"\n[certs] Using the existing \"sa\" key\n[kubeconfig] Generating kubeconfig files\n[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"\n[kubeconfig] Writing \"admin.conf\" kubeconfig file\n[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file\n[kubeconfig] Writing \"scheduler.conf\" kubeconfig file\n[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"\n[control-plane] Creating static Pod manifest for \"kube-apiserver\"\n[control-plane] Creating static Pod manifest for \"kube-controller-manager\"\n[control-plane] Creating static Pod manifest for \"kube-scheduler\"\n[check-etcd] Checking that the etcd cluster is healthy\n[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.16\" ConfigMap in the kube-system namespace\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Activating the kubelet service\n[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...\n[etcd] Announced new etcd member joining to the existing etcd cluster\n[etcd] Creating static Pod manifest for \"etcd\"\n[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s\n{\"level\":\"warn\",\"ts\":\"2020-07-02T12:38:58.486Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"passthrough:///https://172.17.0.6:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\n[upload-config] Storing the configuration used in ConfigMap \"kubeadm-config\" in the \"kube-system\" Namespace\nerror execution phase control-plane-join/update-status: error uploading configuration: unable to create ConfigMap: etcdserver: request timed out, possibly due to connection lost\nTo see the stack trace of this error execute with --v=5 or higher\n"
time="2020-07-02T12:39:18Z" level=error msg=Failed resource="kubeadm:join"
failed to join cluster: command exited with 1
time="2020-07-02T12:39:18Z" level=info msg="State of Resource 'top' is Invalid.\nExplanation:\n{\n \"resource\": \"top\",\n \"status\": \"Invalid\"\n}\n"
time="2020-07-02T12:39:18Z" level=error msg="Apply of Plan failed:\nApply failed because a child failed\n"
time="2020-07-02T12:39:18Z" level=error msg="failed to update machine: failed to set up machine master-3: Apply failed because a child failed" cluster=test-multimaster machine=master-3 name=weavek8sops/master-3

@luxas
Copy link
Contributor

luxas commented Jul 2, 2020

etcd problems again :(

I think we can/should retry and hope for eventual consistency here. It seems like the upstream kubeadm bootstrapper retires 5 times: https://github.com/kubernetes-sigs/cluster-api/blob/master/bootstrap/kubeadm/internal/cloudinit/kubeadm-bootstrap-script.sh#L85

@bboreham
Copy link
Contributor Author

bboreham commented Jul 2, 2020

Thanks @luxas - maybe this is an incentive to integrate the upstream bootstrapper ?

@luxas
Copy link
Contributor

luxas commented Jul 2, 2020

I would say it is. Maybe next week I'll have time for a PoC with that...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
chore Related to fix/refinement/improvement of end user or new/existing developer functionality
Projects
None yet
Development

No branches or pull requests

2 participants