cilium v1.14.3, kubernetes v1.28.2 Endpoint unreachability for only one node #28984

Rammurthy5 · 2023-11-04T23:31:52Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

I have 2 EKS clusters setup in two different regions, and they are VPC peered.
Both clusters have 2 nodes each, running cilium v1.14.3, kubernetes v1.28.2.
CoreDNS pods are running, restarted them - did not help.
clustermesh status and cilium status on both clusters seem perfectly alright.

On cluster1, all 4 endpoints are reachable. But on cluster2 only 3 endpoints are reachable. The unreachable endpoint is from one of the two nodes in other region (1 node endpoint from other region is reachable).

I had tried this with cilium v1.14.2 on kubernetes v1.28, v1.27.
cilium v1.14.3 on kubernetes v1.27, v1.28.

cluster2

Cluster health:                                                3/4 reachable   (2023-11-04T23:13:37Z)
  Name                                                         IP              Node        Endpoints
  clus2/ip-10-0-1-114.eu-west-2.compute.internal (localhost)   10.0.1.114      reachable   reachable
  clus2/ip-10-0-2-8.eu-west-2.compute.internal                 10.0.2.8        reachable   reachable
  clustr1/ip-10-1-129-173.eu-west-1.compute.internal           10.1.129.173    reachable   reachable
  clustr1/ip-10-1-159-37.eu-west-1.compute.internal            10.1.159.37     reachable   unreachable

cluster1

Cluster health:                                                    4/4 reachable   (2023-11-04T23:13:55Z)
  Name                                                             IP              Node        Endpoints
  clustr1/ip-10-1-129-173.eu-west-1.compute.internal (localhost)   10.1.129.173    reachable   reachable
  clus2/ip-10-0-1-114.eu-west-2.compute.internal                   10.0.1.114      reachable   reachable
  clus2/ip-10-0-2-8.eu-west-2.compute.internal                     10.0.2.8        reachable   reachable
  clustr1/ip-10-1-159-37.eu-west-1.compute.internal                10.1.159.37     reachable   reachable

cilium-health status on cluster2 cilium agent:

# cilium-health status
Probe time:   2023-11-04T23:25:37Z
Nodes:
  clus2/ip-10-0-1-114.eu-west-2.compute.internal (localhost):
    Host connectivity to 10.0.1.114:
      ICMP to stack:   OK, RTT=272.479µs
      HTTP to agent:   OK, RTT=134.233µs
    Endpoint connectivity to 10.0.1.181:
      ICMP to stack:   OK, RTT=301.788µs
      HTTP to agent:   OK, RTT=277.739µs
  clus2/ip-10-0-2-8.eu-west-2.compute.internal:
    Host connectivity to 10.0.2.8:
      ICMP to stack:   OK, RTT=1.044299ms
      HTTP to agent:   OK, RTT=1.026904ms
    Endpoint connectivity to 10.0.2.117:
      ICMP to stack:   OK, RTT=1.053062ms
      HTTP to agent:   OK, RTT=992.025µs
  clustr1/ip-10-1-129-173.eu-west-1.compute.internal:
    Host connectivity to 10.1.129.173:
      ICMP to stack:   OK, RTT=13.268929ms
      HTTP to agent:   OK, RTT=12.361398ms
    Endpoint connectivity to 10.1.142.89:
      ICMP to stack:   OK, RTT=10.573817ms
      HTTP to agent:   OK, RTT=11.697483ms
  clustr1/ip-10-1-159-37.eu-west-1.compute.internal:
    Host connectivity to 10.1.159.37:
      ICMP to stack:   OK, RTT=10.896389ms
      HTTP to agent:   OK, RTT=11.620258ms
    Endpoint connectivity to 10.1.153.86:
      ICMP to stack:   Connection timed out
      HTTP to agent:   Get "http://10.1.153.86:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Cilium Version

1.14.3

Kernel Version

aarch64 aarch64 aarch64 GNU/Linux

Kubernetes Version

1.28.2

Sysdump

cluster2 sysdump
cilium-sysdump-20231104-232847.zip

cluster1 sysdump
cilium-sysdump-20231104-233044.zip

Relevant log output

No response

Anything else?

Helm Command used to install cilium

helm install cilium cilium/cilium --kube-context $CONTEXT1 \
                                  --namespace kube-system \
                                  --set eni.enabled=true \
                                  --set ipam.mode=eni \
                                  --set egressMasqueradeInterfaces=eth0 \
                                  --set kubeProxyReplacement=strict \
                                  --set tunnel=disabled \
                                  --set k8sServiceHost=${API_SERVER_IP} \
                                  --set k8sServicePort=${API_SERVER_PORT} \
                                  --set cluster.name=clustr1 \
                                  --set cluster.id=1 \
                                  --set encryption.enabled=true \
                                  --set encryption.type=wireguard \
                                  --set l7Proxy=false

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

dalekurt · 2023-11-06T19:54:58Z

I'm experiencing something similar

Kubernetes: v1.28.2+k3s1
cilium: v1.14.3

Cilium pod log output

level=info msg="attempting to acquire leader lease kube-system/cilium-l2announce-cilium-test-echo-same-node..." subsys=klog
level=info msg="successfully acquired lease kube-system/cilium-l2announce-cilium-test-echo-same-node" subsys=klog
level=info msg="attempting to acquire leader lease kube-system/cilium-l2announce-cilium-test-echo-other-node..." subsys=klog
level=info msg="successfully acquired lease kube-system/cilium-l2announce-cilium-test-echo-other-node" subsys=klog
level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
level=info msg="regenerating all endpoints" reason= subsys=endpoint-manager
level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.42.0.0/24 Src: <nil> Gw: 172.16.30.20 Flags: [] Table: 0 Realm: 0}" error="route to destination 172.16.30.20 contains gateway 172.16.30.1, must be directly reachable" subsys=linux-datapath
level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.42.1.0/24 Src: <nil> Gw: 172.16.30.21 Flags: [] Table: 0 Realm: 0}" error="route to destination 172.16.30.21 contains gateway 172.16.30.1, must be directly reachable" subsys=linux-datapath
level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.42.3.0/24 Src: <nil> Gw: 172.16.30.23 Flags: [] Table: 0 Realm: 0}" error="route to destination 172.16.30.23 contains gateway 172.16.30.1, must be directly reachable" subsys=linux-datapath
level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.42.0.0/24 Src: <nil> Gw: 172.16.30.20 Flags: [] Table: 0 Realm: 0}" error="route to destination 172.16.30.20 contains gateway 172.16.30.1, must be directly reachable" subsys=linux-datapath
level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.42.1.0/24 Src: <nil> Gw: 172.16.30.21 Flags: [] Table: 0 Realm: 0}" error="route to destination 172.16.30.21 contains gateway 172.16.30.1, must be directly reachable" subsys=linux-datapath
level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.42.3.0/24 Src: <nil> Gw: 172.16.30.23 Flags: [] Table: 0 Realm: 0}" error="route to destination 172.16.30.23 contains gateway 172.16.30.1, must be directly reachable" subsys=linux-datapath

Connectivity test

cilium connectivity test
ℹ️  Monitor aggregation detected, will skip some flow validation steps
✨ [default] Creating namespace cilium-test for connectivity check...
✨ [default] Deploying echo-same-node service...
✨ [default] Deploying DNS test server configmap...
✨ [default] Deploying same-node deployment...
✨ [default] Deploying client deployment...
✨ [default] Deploying client2 deployment...
✨ [default] Deploying echo-other-node service...
✨ [default] Deploying other-node deployment...
✨ [host-netns] Deploying default daemonset...
✨ [host-netns-non-cilium] Deploying default daemonset...
ℹ️  Skipping tests that require a node Without Cilium
⌛ [default] Waiting for deployment cilium-test/client to become ready...
⌛ [default] Waiting for deployment cilium-test/client2 to become ready...
⌛ [default] Waiting for deployment cilium-test/echo-same-node to become ready...
⌛ [default] Waiting for deployment cilium-test/echo-other-node to become ready...
⌛ [default] Waiting for CiliumEndpoint for pod cilium-test/client-78f9dffc84-ntwrv to appear...
⌛ [default] Waiting for CiliumEndpoint for pod cilium-test/client2-59b578d4bb-74zlm to appear...
⌛ [default] Waiting for pod cilium-test/client-78f9dffc84-ntwrv to reach DNS server on cilium-test/echo-same-node-7c4889767f-dvlpl pod...
⌛ [default] Waiting for pod cilium-test/client2-59b578d4bb-74zlm to reach DNS server on cilium-test/echo-same-node-7c4889767f-dvlpl pod...
⌛ [default] Waiting for pod cilium-test/client-78f9dffc84-ntwrv to reach DNS server on cilium-test/echo-other-node-79bdc8d8df-jtqh5 pod...
connectivity test failed: timeout reached waiting for lookup for localhost from pod cilium-test/client-78f9dffc84-ntwrv to server on pod cilium-test/echo-other-node-79bdc8d8df-jtqh5 to succeed (last error: context deadline exceeded)

Rammurthy5 · 2023-11-06T21:13:54Z

@dalekurt , whats the output of cilium-health status from cilium agent pod ?

dalekurt · 2023-11-13T17:43:29Z

@Rammurthy5 This is the output from a cilium pod

# cilium-health status
Probe time:   2023-11-13T17:42:12Z
Nodes:
  home-cluster/worker-1 (localhost):
    Host connectivity to 172.16.30.21:
      ICMP to stack:   OK, RTT=299.734µs
      HTTP to agent:   OK, RTT=341.395µs
    Endpoint connectivity to 10.42.1.243:
      ICMP to stack:   OK, RTT=269.358µs
      HTTP to agent:   OK, RTT=316.849µs
  home-cluster/control-plane:
    Host connectivity to 172.16.30.20:
      ICMP to stack:   OK, RTT=598.958µs
      HTTP to agent:   OK, RTT=1.880238ms
    Endpoint connectivity to 10.42.0.41:
      ICMP to stack:   Connection timed out
      HTTP to agent:   Get "http://10.42.0.41:4240/hello": dial tcp 10.42.0.41:4240: connect: network is unreachable
  home-cluster/worker-2:
    Host connectivity to 172.16.30.22:
      ICMP to stack:   OK, RTT=716.995µs
      HTTP to agent:   OK, RTT=1.968185ms
    Endpoint connectivity to 10.42.2.170:
      ICMP to stack:   Connection timed out
      HTTP to agent:   Get "http://10.42.2.170:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  home-cluster/worker-3:
    Host connectivity to 172.16.30.23:
      ICMP to stack:   OK, RTT=721.465µs
      HTTP to agent:   OK, RTT=2.327739ms
    Endpoint connectivity to 10.42.3.228:
      ICMP to stack:   Connection timed out
      HTTP to agent:   Get "http://10.42.3.228:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      ```

bradmorgtile · 2023-11-13T20:07:08Z

helm install cilium cilium/cilium --kube-context $CONTEXT1
--namespace kube-system
--set eni.enabled=true
--set ipam.mode=eni
--set egressMasqueradeInterfaces=eth0
--set kubeProxyReplacement=strict
--set tunnel=disabled
--set k8sServiceHost=${API_SERVER_IP}
--set k8sServicePort=${API_SERVER_PORT}
--set cluster.name=clustr1
--set cluster.id=1
--set encryption.enabled=true
--set encryption.type=wireguard
--set l7Proxy=false

Not sure if this is 100% of your issue, but kubeProxyReplacement=strict should be kubeProxyReplacement=true in 1.14.3. I noticed my cilium pods kept crashing until I fixed this.

cilium/install/kubernetes/cilium/values.yaml.tmpl

Line 1712 in 71908e7

    
           # Valid options are "true", "false", "disabled" (deprecated), "partial" (deprecated), "strict" (deprecated).

giorio94 · 2023-12-07T17:41:30Z

Hey @Rammurthy5, it seems that you hit #25804. TL;DR; you need to cleanup stale AWS iptables chains on your nodes:

cilium/Documentation/installation/k8s-install-helm.rst

Lines 210 to 217 in b2ff077

    
                       2. Flush iptables rules added by VPC CNI 
        
                          .. code-block:: shell-session 
        
                             iptables -t nat -F AWS-SNAT-CHAIN-0 \\ 
        
                                && iptables -t nat -F AWS-SNAT-CHAIN-1 \\ 
        
                                && iptables -t nat -F AWS-CONNMARK-CHAIN-0 \\ 
        
                                && iptables -t nat -F AWS-CONNMARK-CHAIN-1

@dalekurt According to the logs, you configured Cilium in direct routing mode using autoDirectNodeRoutes, but some of the nodes are not directly reachable:

level=warning msg="Unable to install direct node route {Ifindex: 0 Dst: 10.42.0.0/24 Src: Gw: 172.16.30.20 Flags: [] Table: 0 Realm: 0}" error="route to destination 172.16.30.20 contains gateway 172.16.30.1, must be directly reachable" subsys=linux-datapath

This is not possible. The error is also not related to clustermesh (and the original report), as all nodes in the health status report belong to the same cluster.

I'm closing this issue as #25804 has been fixed (#29448 is also following up on another related issue).

Rammurthy5 added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Nov 4, 2023

Rammurthy5 mentioned this issue Nov 5, 2023

Endpoints unreachable from other clusters #28027

Closed

2 tasks

j-land mentioned this issue Nov 9, 2023

config initContainer fails to contact the kube-apiserver #29094

Open

2 tasks

lmb added sig/agent Cilium agent related. area/clustermesh Relates to multi-cluster routing functionality in Cilium. labels Nov 21, 2023

giorio94 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cilium v1.14.3, kubernetes v1.28.2 Endpoint unreachability for only one node #28984

cilium v1.14.3, kubernetes v1.28.2 Endpoint unreachability for only one node #28984

Rammurthy5 commented Nov 4, 2023 •

edited

Loading

dalekurt commented Nov 6, 2023

Rammurthy5 commented Nov 6, 2023

dalekurt commented Nov 13, 2023

bradmorgtile commented Nov 13, 2023

giorio94 commented Dec 7, 2023

cilium v1.14.3, kubernetes v1.28.2 Endpoint unreachability for only one node #28984

cilium v1.14.3, kubernetes v1.28.2 Endpoint unreachability for only one node #28984

Comments

Rammurthy5 commented Nov 4, 2023 • edited Loading

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

dalekurt commented Nov 6, 2023

Cilium pod log output

Connectivity test

Rammurthy5 commented Nov 6, 2023

dalekurt commented Nov 13, 2023

bradmorgtile commented Nov 13, 2023

giorio94 commented Dec 7, 2023

Rammurthy5 commented Nov 4, 2023 •

edited

Loading