Pod crash loop back-off when deploying on OCP 4.10 on top of IBM Rocks #34

andresmareca-ibm · 2022-06-20T13:52:27Z

I'm deploying this agent in an openshift managed cluster by IBM Cloud. During the deploy face I have an issue with some of the pods.

First I have deployed the operator network-observability-operator following the instructions in the README file.

After that I try to deploy what is on netobserv-ebpf-agent. And here it fails for the first time due to lack of serviceAccount permissions. After adding on the serviceAccount I was able to start the actual pods:

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 name: netobserv-clusterrole
rules:
 - apiGroups: [""] # "" indicates the core API group
   resources:
   - nodes
   - nodes/proxy
   - services
   - endpoints
   - pods
   verbs: ["get", "watch", "list"]
 - apiGroups:
    - security.openshift.io
   resourceNames:
    - hostmount-anyuid
    - privileged
   resources:
    - securitycontextconstraints
   verbs:
    - use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: netobserv-rolebinding
subjects:
  - kind: ServiceAccount
    name: netobserv-account
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: netobserv-clusterrole

This is where it fails a second time. The output of the pods are:

Starting flowlogs-pipeline:
=====
Build Version: -c418a01
Build Date: 2022-06-20 07:55
Using configuration:
{
"PipeLine": "[{\"name\":\"ingest\"},{\"follows\":\"ingest\",\"name\":\"decode\"},{\"follows\":\"decode\",\"name\":\"enrich\"},{\"follows\":\"enrich\",\"name\":\"loki\"}]",
"Parameters": "[{\"ingest\":{\"grpc\":{\"port\":9999},\"type\":\"grpc\"},\"name\":\"ingest\"},{\"decode\":{\"type\":\"protobuf\"},\"name\":\"decode\"},{\"name\":\"enrich\",\"transform\":{\"network\":{\"rules\":[{\"input\":\"SrcAddr\",\"output\":\"SrcK8S\",\"type\":\"add_kubernetes\"},{\"input\":\"DstAddr\",\"output\":\"DstK8S\",\"type\":\"add_kubernetes\"}]},\"type\":\"network\"}},{\"name\":\"loki\",\"write\":{\"loki\":{\"labels\":[\"SrcK8S_Namespace\",\"SrcK8S_OwnerName\",\"DstK8S_Namespace\",\"DstK8S_OwnerName\",\"FlowDirection\"],\"staticLabels\":{\"app\":\"netobserv-flowcollector\"},\"timestampLabel\":\"TimeFlowEndMs\",\"timestampScale\":\"1ms\",\"type\":\"loki\",\"url\":\"http://loki:3100/\"},\"type\":\"loki\"}}]",
"Health": {
"Port": "8080"
}
}
time=2022-06-20T11:14:34Z level=debug msg=config.Opt.PipeLine = [{"name":"ingest"},{"follows":"ingest","name":"decode"},{"follows":"decode","name":"enrich"},{"follows":"enrich","name":"loki"}]
time=2022-06-20T11:14:34Z level=debug msg=stages = [{ingest } {decode ingest} {enrich decode} {loki enrich}]
time=2022-06-20T11:14:34Z level=debug msg=params = [{ingest 0xc0001ebe00 <nil> <nil> <nil> <nil>} {decode <nil> <nil> <nil> <nil> <nil>} {enrich <nil> 0xc0001ebe30 <nil> <nil> <nil>} {loki <nil> <nil> <nil> <nil> 0xc00004f800}]
time=2022-06-20T11:14:34Z level=debug msg=entering SetupElegantExit
time=2022-06-20T11:14:34Z level=debug msg=registered exit signal channel
time=2022-06-20T11:14:34Z level=debug msg=exiting SetupElegantExit
time=2022-06-20T11:14:34Z level=debug msg=entering NewPipeline
time=2022-06-20T11:14:34Z level=debug msg=stages = [{ingest } {decode ingest} {enrich decode} {loki enrich}]
time=2022-06-20T11:14:34Z level=debug msg=configParams = [{ingest 0xc0001ebe00 <nil> <nil> <nil> <nil>} {decode <nil> <nil> <nil> <nil> <nil>} {enrich <nil> 0xc0001ebe30 <nil> <nil> <nil>} {loki <nil> <nil> <nil> <nil> 0xc00004f800}]
time=2022-06-20T11:14:34Z level=debug msg=stage = ingest
time=2022-06-20T11:14:34Z level=debug msg=findStageType: stage = ingest
time=2022-06-20T11:14:34Z level=debug msg=pipeline = [0xc0002a42a0]
time=2022-06-20T11:14:34Z level=debug msg=stage = decode
time=2022-06-20T11:14:34Z level=debug msg=findStageType: stage = decode
time=2022-06-20T11:14:34Z level=fatal msg=failed to initialize pipeline invalid stage type: unknown

Any ideas on how to fix it?

The text was updated successfully, but these errors were encountered:

eranra · 2022-06-20T14:13:54Z

@jotak can you take a look ??? is this connected to the latest PR that removed the decode stage ???

eranra · 2022-06-20T14:46:00Z

@andresmareca-ibm thanks for looking into this .... I think that this might be more connected to FLP and NOO repo's but having the issue here is also ok.

jotak · 2022-06-20T22:10:24Z

Hi @andresmareca-ibm ,
What is the image versions of the operator, and of flowlogs-pipeline ? Are they both main ?
This is likely due to a version mismatch, as @eranra mentioned there was a recent breaking change that has a corresponding update on the operator side. I guess you don't have the operator patch? If you built and deployed the operator from source, maybe you wasn't up to date? Or maybe you still have an old image in your cluster, in which case you may double-check if the operator's image pull policy is set to Always.

Another option is to prefer using released versions rather than main (but of course you won't get the very last updates)

jotak · 2022-06-20T22:18:20Z

By the way,

After that I try to deploy what is on netobserv-ebpf-agent. And here it fails for the first time due to lack of serviceAccount permissions. After adding on the serviceAccount I was able to start the actual pods

@mariomac , any idea about that?

mariomac · 2022-06-21T07:45:26Z

@andresmareca-ibm could I see your FlowCollector deployment file?

Also, if possible, can I see the netobserv-controller-manager pod logs?

mariomac · 2022-06-21T08:35:52Z

I just deployed the main version and it worked. Are you using any other version?

mariomac · 2022-06-21T08:47:03Z

I've been digging into the default permissions that we grant to the eBPF agent. The netobserv-manager-role grants security context constraints for the host network but there isn't any rolebinding that assigns the permissions for the netobserv agent to use security context constraints. By any reason, in our installations it's granted by default but not in yours.

In order to try to reproduce the issue, and verify that we provide a patch that will actually work, what version of OpenShift are you using?

andresmareca-ibm · 2022-06-21T08:55:08Z

I just deployed the main version and it worked. Are you using any other version?

I just pull both repos yesterday at 7:30 UTC and run the make commands on the main branch

andresmareca-ibm · 2022-06-21T08:57:40Z

I've been digging into the default permissions that we grant to the eBPF agent. The netobserv-manager-role grants security context constraints for the host network but there isn't any rolebinding that assigns the permissions for the netobserv agent to use security context constraints. By any reason, in our installations it's granted by default but not in yours.

In order to try to reproduce the issue, and verify that we provide a patch that will actually work, what version of OpenShift are you using?

I deploy the cluster using the IBM Cloud procedure. The version is: 4.10.16_1521

andresmareca-ibm · 2022-06-21T09:19:43Z

I'm going to create a new cluster and reapply the scripts in the following order:

Clone https://github.com/netobserv/network-observability-operator and run make ocp-deploy
Clone https://github.com/netobserv/netobserv-ebpf-agent and run make ocp-deploy

At the end I should have the same pods as you have on the pictures from above, right??

mariomac · 2022-06-21T09:31:03Z

@andresmareca-ibm you don't need to clone the netobserv-ebpf-agent repo, as the newtork-observability-operator will directly refer to the latest deployed image in quay.

In the NOO repo, you should do:

make deploy ocp-deploy

Then you can deploy the example flowcollector:

oc apply -f config/samples/flows_v1alpha1_flowcollector.yaml

If you want the eBPF agent to be deployed, you should set the agent: ebpf property in the descriptor

eranra · 2022-06-22T11:45:38Z

@mariomac BTW: I use make ocp-run which does all of those including deployment of a sample workload etc ... then I just change the CR to use eBPF ./... I this ends up with the same thing.

eranra · 2022-06-22T14:13:15Z

FYI:: @ctrath ^^^

andresmareca-ibm · 2022-06-23T10:45:07Z

I'm getting this error during the ebpf container creation.
Error: unknown capability "CAP_BPF" to add

mariomac · 2022-06-23T13:03:24Z

@andresmareca-ibm this means that the kernel does not support this capability. By curiosity, which Linux distribution and Kernel version are you using?

Anyway, you can workaround this issue by adding privileged: true property to the ebpf: section of the flowCollector: https://github.com/netobserv/network-observability-operator/blob/main/config/samples/flows_v1alpha1_flowcollector.yaml#L21

andresmareca-ibm · 2022-06-23T13:48:50Z

@andresmareca-ibm this means that the kernel does not support this capability. By curiosity, which Linux distribution and Kernel version are you using?

Anyway, you can workaround this issue by adding privileged: true property to the ebpf: section of the flowCollector: https://github.com/netobserv/network-observability-operator/blob/main/config/samples/flows_v1alpha1_flowcollector.yaml#L21

The error that the pods do not start we have already solved it. We have had to create a service account with the necessary permissions as shown below.

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: netobserv-ebpf-agent-test
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 name: netobserv-clusterrole
rules:
 - apiGroups: [""] # "" indicates the core API group
   resources:
   - nodes
   - nodes/proxy
   - services
   - endpoints
   - pods
   verbs: ["get", "watch", "list"]
 - apiGroups:
    - security.openshift.io
   resourceNames:
    - hostmount-anyuid
    - privileged
    - cgroup
   resources:
    - securitycontextconstraints
   verbs:
    - use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: netobserv-rolebinding
subjects:
  - kind: ServiceAccount
    name: netobserv-ebpf-agent-test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: netobserv-clusterrole

I close this issue but there will be another one because another error has come out. Thank you very much!!

andresmareca-ibm closed this as completed Jun 23, 2022

andresmareca-ibm mentioned this issue Jun 23, 2022

"The agent could not be able to start eBPF programs" on OCP 4.10 on top of IBM Rocks #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod crash loop back-off when deploying on OCP 4.10 on top of IBM Rocks #34

Pod crash loop back-off when deploying on OCP 4.10 on top of IBM Rocks #34

andresmareca-ibm commented Jun 20, 2022

eranra commented Jun 20, 2022

eranra commented Jun 20, 2022

jotak commented Jun 20, 2022 •

edited

Loading

jotak commented Jun 20, 2022

mariomac commented Jun 21, 2022 •

edited

Loading

mariomac commented Jun 21, 2022

mariomac commented Jun 21, 2022

andresmareca-ibm commented Jun 21, 2022 •

edited

Loading

andresmareca-ibm commented Jun 21, 2022

andresmareca-ibm commented Jun 21, 2022

mariomac commented Jun 21, 2022

eranra commented Jun 22, 2022

eranra commented Jun 22, 2022

andresmareca-ibm commented Jun 23, 2022

mariomac commented Jun 23, 2022

andresmareca-ibm commented Jun 23, 2022

Pod crash loop back-off when deploying on OCP 4.10 on top of IBM Rocks #34

Pod crash loop back-off when deploying on OCP 4.10 on top of IBM Rocks #34

Comments

andresmareca-ibm commented Jun 20, 2022

eranra commented Jun 20, 2022

eranra commented Jun 20, 2022

jotak commented Jun 20, 2022 • edited Loading

jotak commented Jun 20, 2022

mariomac commented Jun 21, 2022 • edited Loading

mariomac commented Jun 21, 2022

mariomac commented Jun 21, 2022

andresmareca-ibm commented Jun 21, 2022 • edited Loading

andresmareca-ibm commented Jun 21, 2022

andresmareca-ibm commented Jun 21, 2022

mariomac commented Jun 21, 2022

eranra commented Jun 22, 2022

eranra commented Jun 22, 2022

andresmareca-ibm commented Jun 23, 2022

mariomac commented Jun 23, 2022

andresmareca-ibm commented Jun 23, 2022

jotak commented Jun 20, 2022 •

edited

Loading

mariomac commented Jun 21, 2022 •

edited

Loading

andresmareca-ibm commented Jun 21, 2022 •

edited

Loading