Skip to content

性能测试环境ovs ovn Pod出现Crash定位

oilbeater edited this page Jun 27, 2022 · 2 revisions

Wiki 下的中文文档将不在维护,请访问我们最新的中文文档网站,获取最新的文档更新。

在性能测试环境上,运行大量Pod,数量约10k。查看Pod信息,ovs-ovn出现较多crash。但是在部分Node上的ovs-ovn Pod状态是正常的Running状态。

[root@node10 ~]#kubectl get pod -n kube-system -o wide
ovs-ovn-572sv                          1/1     Running             0          16m     10.0.128.224     10.0.128.224        <none>           <none>
ovs-ovn-5hbkf                          1/1     Running             0          29m     10.0.128.93      10.0.128.93         <none>           <none>
ovs-ovn-7dzsz                          0/1     CrashLoopBackOff    205        12d     10.0.128.83      10.0.128.83         <none>           <none>
ovs-ovn-7tzzm                          0/1     CrashLoopBackOff    203        12d     10.0.128.201     10.0.128.201        <none>           <none>
ovs-ovn-8tvwk                          0/1     Running             123        12d     10.0.128.243     10.0.128.243        <none>           <none>
ovs-ovn-cb88d                          0/1     Running             136        8d      10.0.128.198     10.0.128.198        <none>           <none>
ovs-ovn-ck8s9                          0/1     CrashLoopBackOff    199        12d     10.0.128.103     10.0.128.103        <none>           <none>
ovs-ovn-cm65q                          0/1     CrashLoopBackOff    127        12d     10.0.129.29      10.0.129.29         <none>           <none>
ovs-ovn-fqq86                          0/1     CrashLoopBackOff    90         7d23h   10.0.128.113     10.0.128.113        <none>           <none>
ovs-ovn-kpn6h                          1/1     Running             221        12d     10.0.129.12      10.0.129.12         <none>           <none>
ovs-ovn-mclpn                          0/1     Running             124        12d     10.0.128.35      10.0.128.35         <none>           <none>
ovs-ovn-nnwbd                          1/1     Running             275        12d     10.0.129.20      10.0.129.20         <none>           <none>
ovs-ovn-stxxc                          0/1     CrashLoopBackOff    234        12d     10.0.129.77      10.0.129.77         <none>           <none>
ovs-ovn-v4sz9                          1/1     Running             120        12d     10.0.129.154     10.0.129.154        <none>           <none>
ovs-ovn-vhnpn                          0/1     Running             225        12d     10.0.129.4       10.0.129.4          <none>           <none>
ovs-ovn-xksl6                          1/1     Running             0          20m     10.0.129.148     10.0.129.148        <none>           <none>

选取其中一个crash的Pod,查看log信息,提示ovn-sb连接有问题

[root@node10 ~]# kubectl logs ovs-ovn-7dzsz -n kube-system
sleep 10 seconds, waiting for ovn-sb 192.170.173.26:6642 ready
sleep 10 seconds, waiting for ovn-sb 192.170.173.26:6642 ready
sleep 10 seconds, waiting for ovn-sb 192.170.173.26:6642 ready
sleep 10 seconds, waiting for ovn-sb 192.170.173.26:6642 ready
[root@node10 ~]#

其中192.170.173.26:6642 地址,是ovn-sb提供的Service地址

[root@node10 ~]# kubectl get svc -n kube-system
NAME                                               TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)                        AGE
kube-dns                                           ClusterIP   192.170.0.10      <none>        53/UDP,53/TCP,9153/TCP         12d
kube-ovn-cni                                       ClusterIP   192.170.119.10    <none>        10665/TCP                      12d
kube-ovn-controller                                ClusterIP   192.170.164.145   <none>        10660/TCP                      12d
kube-ovn-pinger                                    ClusterIP   192.170.22.21     <none>        8080/TCP                       12d
kube-prometheus-exporter-coredns                   ClusterIP   None              <none>        9153/TCP                       12d
kube-prometheus-exporter-kube-controller-manager   ClusterIP   None              <none>        10252/TCP                      12d
kube-prometheus-exporter-kube-etcd                 ClusterIP   None              <none>        2379/TCP                       12d
kube-prometheus-exporter-kube-proxy                ClusterIP   None              <none>        10249/TCP                      12d
kube-prometheus-exporter-kube-scheduler            ClusterIP   None              <none>        10251/TCP                      12d
kubelet                                            ClusterIP   None              <none>        10250/TCP,10255/TCP,4194/TCP   12d
ovn-nb                                             ClusterIP   192.170.235.211   <none>        6641/TCP                       12d
ovn-sb                                             ClusterIP   192.170.173.26    <none>        6642/TCP                       12d
[root@node10 ~]#

kubectl describe 查看Pod信息,存在大量probe检查失败信息

2021-01-07T02:22:01Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/kube-ovn/ovs-healthcheck.sh: line 6:  8310 Alarm clock             ovn-sbctl --db=tcp:["${OVN_SB_SERVICE_HOST}"]:"${OVN_SB_SERVICE_PORT}" --timeout=3 show
  Warning  Unhealthy  63m  kubelet, 10.0.128.83  Liveness probe failed: Connecting OVN SB 192.170.173.26:6642
2021-01-07T02:22:01Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/kube-ovn/ovs-healthcheck.sh: line 6:  8271 Alarm clock             ovn-sbctl --db=tcp:["${OVN_SB_SERVICE_HOST}"]:"${OVN_SB_SERVICE_PORT}" --timeout=3 show
  Warning  Unhealthy  63m  kubelet, 10.0.128.83  Readiness probe failed: Connecting OVN SB 192.170.173.26:6642
2021-01-07T02:22:06Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/kube-ovn/ovs-healthcheck.sh: line 6:  8697 Alarm clock             ovn-sbctl --db=tcp:["${OVN_SB_SERVICE_HOST}"]:"${OVN_SB_SERVICE_PORT}" --timeout=3 show
  Warning  Unhealthy  63m  kubelet, 10.0.128.83  Liveness probe failed: Connecting OVN SB 192.170.173.26:6642
2021-01-07T02:22:06Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/kube-ovn/ovs-healthcheck.sh: line 6:  8664 Alarm clock             ovn-sbctl --db=tcp:["${OVN_SB_SERVICE_HOST}"]:"${OVN_SB_SERVICE_PORT}" --timeout=3 show
  Warning  Unhealthy  62m (x17385 over 12d)  kubelet, 10.0.128.83  (combined from similar events): Readiness probe failed: Connecting OVN SB 192.170.173.26:6642
2021-01-07T02:23:36Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
/kube-ovn/ovs-healthcheck.sh: line 6: 16888 Alarm clock             ovn-sbctl --db=tcp:["${OVN_SB_SERVICE_HOST}"]:"${OVN_SB_SERVICE_PORT}" --timeout=3 show
  Normal   Pulled   36m (x11 over 63m)  kubelet, 10.0.128.83  Container image "platforma-225.alauda.cn:60080/acp/kube-ovn:v1.3.0" already present on machine
  Normal   Killing  16m (x19 over 63m)  kubelet, 10.0.128.83  Container openvswitch failed liveness probe, will be restarted
  Warning  BackOff  2m (x229 over 59m)  kubelet, 10.0.128.83  Back-off restarting failed container

这种情况,可以看下是否是kube-proxy消息处理不及时导致的问题。

过滤Pod信息

[root@node10 ~]# kubectl get pod -n kube-system -o wide| grep 10.0.128.83
kube-ovn-cni-snv5m                     1/1     Running             0          12d     10.0.128.83      10.0.128.83         <none>           <none>
kube-ovn-pinger-f27q5                  1/1     Running             0          17h     192.172.0.82     10.0.128.83         <none>           <none>
kube-proxy-9qdbm                       1/1     Running             0          12d     10.0.128.83      10.0.128.83         <none>           <none>
ovs-ovn-7dzsz                          0/1     CrashLoopBackOff    207        12d     10.0.128.83      10.0.128.83         <none>           <none>
[root@node10 ~]#

查看kube-proxy Pod 日志信息,存在大量OVS Pod重启信息

I0107 02:05:57.018826       1 graceful_termination.go:93] lw: remote out of the list: 192.170.173.26:6642/TCP/10.0.129.77:6642
W0107 02:18:20.362225       1 reflector.go:301] k8s.io/client-go/informers/factory.go:134: watch of *v1.Service ended with: too old resource version: 99775911 (100430249)
I0107 02:25:57.022763       1 graceful_termination.go:93] lw: remote out of the list: 192.170.173.26:6642/TCP/10.0.129.20:6642
I0107 02:25:57.022815       1 graceful_termination.go:93] lw: remote out of the list: 192.170.173.26:6642/TCP/10.0.129.12:6642

手动删除kube-proxy 和 ovs-ovn Pod,等待系统重启Pod,查看状态

[root@node10 ~]# kubectl get pod -n kube-system -o wide|egrep '10\.0\.12[8-9]\.[0-9]{1,3}' | grep 10.0.128.83
kube-ovn-cni-snv5m                     1/1     Running             0          12d     10.0.128.83      10.0.128.83         <none>           <none>
kube-ovn-pinger-f27q5                  1/1     Running             0          17h     192.172.0.82     10.0.128.83         <none>           <none>
kube-proxy-r9zqj                       1/1     Running             0          33s     10.0.128.83      10.0.128.83         <none>           <none>
ovs-ovn-99gbv                          1/1     Running             2          111s    10.0.128.83      10.0.128.83         <none>           <none>
[root@node10 ~]#

对于部分ovs-ovn Pod状态正常,部分出现crash的情况,因为是同一镜像版本的环境,ovs-ovn本身存在问题的可能性比较小,所以可以排查下kube-proxy的问题。

Clone this wiki locally