Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod crash loop back-off when deploying on OCP 4.10 on top of IBM Rocks #34

Closed
andresmareca-ibm opened this issue Jun 20, 2022 · 16 comments

Comments

@andresmareca-ibm
Copy link

I'm deploying this agent in an openshift managed cluster by IBM Cloud. During the deploy face I have an issue with some of the pods.

First I have deployed the operator network-observability-operator following the instructions in the README file.

After that I try to deploy what is on netobserv-ebpf-agent. And here it fails for the first time due to lack of serviceAccount permissions. After adding on the serviceAccount I was able to start the actual pods:

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 name: netobserv-clusterrole
rules:
 - apiGroups: [""] # "" indicates the core API group
   resources:
   - nodes
   - nodes/proxy
   - services
   - endpoints
   - pods
   verbs: ["get", "watch", "list"]
 - apiGroups:
    - security.openshift.io
   resourceNames:
    - hostmount-anyuid
    - privileged
   resources:
    - securitycontextconstraints
   verbs:
    - use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: netobserv-rolebinding
subjects:
  - kind: ServiceAccount
    name: netobserv-account
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: netobserv-clusterrole

This is where it fails a second time. The output of the pods are:

Starting flowlogs-pipeline:
=====
Build Version: -c418a01
Build Date: 2022-06-20 07:55
Using configuration:
{
"PipeLine": "[{\"name\":\"ingest\"},{\"follows\":\"ingest\",\"name\":\"decode\"},{\"follows\":\"decode\",\"name\":\"enrich\"},{\"follows\":\"enrich\",\"name\":\"loki\"}]",
"Parameters": "[{\"ingest\":{\"grpc\":{\"port\":9999},\"type\":\"grpc\"},\"name\":\"ingest\"},{\"decode\":{\"type\":\"protobuf\"},\"name\":\"decode\"},{\"name\":\"enrich\",\"transform\":{\"network\":{\"rules\":[{\"input\":\"SrcAddr\",\"output\":\"SrcK8S\",\"type\":\"add_kubernetes\"},{\"input\":\"DstAddr\",\"output\":\"DstK8S\",\"type\":\"add_kubernetes\"}]},\"type\":\"network\"}},{\"name\":\"loki\",\"write\":{\"loki\":{\"labels\":[\"SrcK8S_Namespace\",\"SrcK8S_OwnerName\",\"DstK8S_Namespace\",\"DstK8S_OwnerName\",\"FlowDirection\"],\"staticLabels\":{\"app\":\"netobserv-flowcollector\"},\"timestampLabel\":\"TimeFlowEndMs\",\"timestampScale\":\"1ms\",\"type\":\"loki\",\"url\":\"http://loki:3100/\"},\"type\":\"loki\"}}]",
"Health": {
"Port": "8080"
}
}
time=2022-06-20T11:14:34Z level=debug msg=config.Opt.PipeLine = [{"name":"ingest"},{"follows":"ingest","name":"decode"},{"follows":"decode","name":"enrich"},{"follows":"enrich","name":"loki"}]
time=2022-06-20T11:14:34Z level=debug msg=stages = [{ingest } {decode ingest} {enrich decode} {loki enrich}]
time=2022-06-20T11:14:34Z level=debug msg=params = [{ingest 0xc0001ebe00 <nil> <nil> <nil> <nil>} {decode <nil> <nil> <nil> <nil> <nil>} {enrich <nil> 0xc0001ebe30 <nil> <nil> <nil>} {loki <nil> <nil> <nil> <nil> 0xc00004f800}]
time=2022-06-20T11:14:34Z level=debug msg=entering SetupElegantExit
time=2022-06-20T11:14:34Z level=debug msg=registered exit signal channel
time=2022-06-20T11:14:34Z level=debug msg=exiting SetupElegantExit
time=2022-06-20T11:14:34Z level=debug msg=entering NewPipeline
time=2022-06-20T11:14:34Z level=debug msg=stages = [{ingest } {decode ingest} {enrich decode} {loki enrich}]
time=2022-06-20T11:14:34Z level=debug msg=configParams = [{ingest 0xc0001ebe00 <nil> <nil> <nil> <nil>} {decode <nil> <nil> <nil> <nil> <nil>} {enrich <nil> 0xc0001ebe30 <nil> <nil> <nil>} {loki <nil> <nil> <nil> <nil> 0xc00004f800}]
time=2022-06-20T11:14:34Z level=debug msg=stage = ingest
time=2022-06-20T11:14:34Z level=debug msg=findStageType: stage = ingest
time=2022-06-20T11:14:34Z level=debug msg=pipeline = [0xc0002a42a0]
time=2022-06-20T11:14:34Z level=debug msg=stage = decode
time=2022-06-20T11:14:34Z level=debug msg=findStageType: stage = decode
time=2022-06-20T11:14:34Z level=fatal msg=failed to initialize pipeline invalid stage type: unknown

Any ideas on how to fix it?

@eranra
Copy link
Collaborator

eranra commented Jun 20, 2022

@jotak can you take a look ??? is this connected to the latest PR that removed the decode stage ???

@eranra
Copy link
Collaborator

eranra commented Jun 20, 2022

@andresmareca-ibm thanks for looking into this .... I think that this might be more connected to FLP and NOO repo's but having the issue here is also ok.

@jotak
Copy link
Member

jotak commented Jun 20, 2022

Hi @andresmareca-ibm ,
What is the image versions of the operator, and of flowlogs-pipeline ? Are they both main ?
This is likely due to a version mismatch, as @eranra mentioned there was a recent breaking change that has a corresponding update on the operator side. I guess you don't have the operator patch? If you built and deployed the operator from source, maybe you wasn't up to date? Or maybe you still have an old image in your cluster, in which case you may double-check if the operator's image pull policy is set to Always.

Another option is to prefer using released versions rather than main (but of course you won't get the very last updates)

@jotak
Copy link
Member

jotak commented Jun 20, 2022

By the way,

After that I try to deploy what is on netobserv-ebpf-agent. And here it fails for the first time due to lack of serviceAccount permissions. After adding on the serviceAccount I was able to start the actual pods

@mariomac , any idea about that?

@mariomac
Copy link

mariomac commented Jun 21, 2022

@andresmareca-ibm could I see your FlowCollector deployment file?

Also, if possible, can I see the netobserv-controller-manager pod logs?

@mariomac
Copy link

I just deployed the main version and it worked. Are you using any other version?
image

@mariomac
Copy link

I've been digging into the default permissions that we grant to the eBPF agent. The netobserv-manager-role grants security context constraints for the host network but there isn't any rolebinding that assigns the permissions for the netobserv agent to use security context constraints. By any reason, in our installations it's granted by default but not in yours.

In order to try to reproduce the issue, and verify that we provide a patch that will actually work, what version of OpenShift are you using?

@andresmareca-ibm
Copy link
Author

andresmareca-ibm commented Jun 21, 2022

I just deployed the main version and it worked. Are you using any other version? image

I just pull both repos yesterday at 7:30 UTC and run the make commands on the main branch

@andresmareca-ibm
Copy link
Author

I've been digging into the default permissions that we grant to the eBPF agent. The netobserv-manager-role grants security context constraints for the host network but there isn't any rolebinding that assigns the permissions for the netobserv agent to use security context constraints. By any reason, in our installations it's granted by default but not in yours.

In order to try to reproduce the issue, and verify that we provide a patch that will actually work, what version of OpenShift are you using?

I deploy the cluster using the IBM Cloud procedure. The version is: 4.10.16_1521

@andresmareca-ibm
Copy link
Author

I'm going to create a new cluster and reapply the scripts in the following order:

  1. Clone https://github.com/netobserv/network-observability-operator and run make ocp-deploy
  2. Clone https://github.com/netobserv/netobserv-ebpf-agent and run make ocp-deploy

At the end I should have the same pods as you have on the pictures from above, right??

@mariomac
Copy link

@andresmareca-ibm you don't need to clone the netobserv-ebpf-agent repo, as the newtork-observability-operator will directly refer to the latest deployed image in quay.

In the NOO repo, you should do:

make deploy ocp-deploy

Then you can deploy the example flowcollector:

oc apply -f config/samples/flows_v1alpha1_flowcollector.yaml

If you want the eBPF agent to be deployed, you should set the agent: ebpf property in the descriptor

@eranra
Copy link
Collaborator

eranra commented Jun 22, 2022

@mariomac BTW: I use make ocp-run which does all of those including deployment of a sample workload etc ... then I just change the CR to use eBPF ./... I this ends up with the same thing.

@eranra
Copy link
Collaborator

eranra commented Jun 22, 2022

FYI:: @ctrath ^^^

@andresmareca-ibm
Copy link
Author

I'm getting this error during the ebpf container creation.
Error: unknown capability "CAP_BPF" to add

@mariomac
Copy link

@andresmareca-ibm this means that the kernel does not support this capability. By curiosity, which Linux distribution and Kernel version are you using?

Anyway, you can workaround this issue by adding privileged: true property to the ebpf: section of the flowCollector: https://github.com/netobserv/network-observability-operator/blob/main/config/samples/flows_v1alpha1_flowcollector.yaml#L21

@andresmareca-ibm
Copy link
Author

@andresmareca-ibm this means that the kernel does not support this capability. By curiosity, which Linux distribution and Kernel version are you using?

Anyway, you can workaround this issue by adding privileged: true property to the ebpf: section of the flowCollector: https://github.com/netobserv/network-observability-operator/blob/main/config/samples/flows_v1alpha1_flowcollector.yaml#L21

The error that the pods do not start we have already solved it. We have had to create a service account with the necessary permissions as shown below.

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: netobserv-ebpf-agent-test
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 name: netobserv-clusterrole
rules:
 - apiGroups: [""] # "" indicates the core API group
   resources:
   - nodes
   - nodes/proxy
   - services
   - endpoints
   - pods
   verbs: ["get", "watch", "list"]
 - apiGroups:
    - security.openshift.io
   resourceNames:
    - hostmount-anyuid
    - privileged
    - cgroup
   resources:
    - securitycontextconstraints
   verbs:
    - use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: netobserv-rolebinding
subjects:
  - kind: ServiceAccount
    name: netobserv-ebpf-agent-test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: netobserv-clusterrole

I close this issue but there will be another one because another error has come out. Thank you very much!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants