Node CPU/Memory metrics differ from kubectl describe #27262

jefchien · 2023-09-28T22:20:58Z

Component(s)

receiver/awscontainerinsight

What happened?

Description

In certain scenarios where there are terminated pods that are not cleaned up, the node_cpu_request and node_cpu_reserved_capacity do not match the outputs of kubectl describe node <node_name>. The main reason is due to the differences in the way the node_cpu_requests are calculated (node_cpu_reserved_capacity is derived from the node_cpu_request). In the receiver, the CPU request is an aggregate of all pods on the node.

opentelemetry-collector-contrib/receiver/awscontainerinsightreceiver/internal/stores/podstore.go

Lines 268 to 271 in 6f1fcba

    
           tmpCPUReq, _ := getResourceSettingForPod(&pod, p.nodeInfo.getCPUCapacity(), cpuKey, getRequestForContainer) 
        
           cpuRequest += tmpCPUReq 
        
           tmpMemReq, _ := getResourceSettingForPod(&pod, p.nodeInfo.getMemCapacity(), memoryKey, getRequestForContainer) 
        
           memRequest += tmpMemReq

Whereas in kubectl, describe node filters out terminated pods.

fieldSelector, err := fields.ParseSelector("spec.nodeName=" + name + ",status.phase!=" + string(corev1.PodSucceeded) + ",status.phase!=" + string(corev1.PodFailed))

https://github.com/kubernetes/kubectl/blob/302f330c8712e717ee45bbeff27e1d3008da9f00/pkg/describe/describe.go#L3624

The same behavior is present for memory requests metric.

Steps to Reproduce

Create a k8s cluster and run kubectl apply -f cpu-test.yaml where cpu-test.yaml is

apiVersion: v1
kind: Pod
metadata:
  name: cpu-test
  namespace: default
spec:
  containers:
  - name: cpu-test
    image: progrium/stress
    resources:
      requests:
        cpu: "0.5"
      limits:
        cpu: "1"
    args: ["--cpu", "2", "--timeout", "60"]
  restartPolicy: Never

This will create a pod that will request CPU and then timeout and terminate.

With kubectl describe node <node-name>:

While the cpu-test is Running

Non-terminated Pods:          (7 in total)
  Namespace                   Name                             CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                             ------------  ----------  ---------------  -------------  ---
  amazon-cloudwatch           cloudwatch-agent-kqdcl           200m (10%)    200m (10%)  200Mi (6%)       200Mi (6%)     3h38m
  default                     busybox                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         135m
  default                     cpu-test                         500m (25%)    1 (51%)     0 (0%)           0 (0%)         2m4s
  default                     memory-test                      0 (0%)        0 (0%)      100Mi (3%)       200Mi (6%)     11m
  kube-system                 aws-node-m58l6                   25m (1%)      0 (0%)      0 (0%)           0 (0%)         4h34m
  kube-system                 kube-proxy-c7p5v                 100m (5%)     0 (0%)      0 (0%)           0 (0%)         4h34m
  kube-system                 metrics-server-5b4fc487-25ckg    100m (5%)     0 (0%)      200Mi (6%)       0 (0%)         157m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         925m (47%)   1200m (62%)
  memory                      500Mi (15%)  400Mi (12%)

After the cpu-test has Completed

Non-terminated Pods:          (6 in total)
  Namespace                   Name                             CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                             ------------  ----------  ---------------  -------------  ---
  amazon-cloudwatch           cloudwatch-agent-kqdcl           200m (10%)    200m (10%)  200Mi (6%)       200Mi (6%)     3h39m
  default                     busybox                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         136m
  default                     memory-test                      0 (0%)        0 (0%)      100Mi (3%)       200Mi (6%)     12m
  kube-system                 aws-node-m58l6                   25m (1%)      0 (0%)      0 (0%)           0 (0%)         4h35m
  kube-system                 kube-proxy-c7p5v                 100m (5%)     0 (0%)      0 (0%)           0 (0%)         4h35m
  kube-system                 metrics-server-5b4fc487-25ckg    100m (5%)     0 (0%)      200Mi (6%)       0 (0%)         157m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         425m (22%)   200m (10%)
  memory                      500Mi (15%)  400Mi (12%)

Expected Result

The expectation is that the metric would match kubectl and drop back down to 425/22% after the pod has completed.

Actual Result

In this case, the metric remains at 925/47%.

    "node_cpu_request": 925,
    "node_cpu_reserved_capacity": 46.25,

Collector version

v0.77.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") AL2
Compiler(if manually compiled): (e.g., "go 14.2") go 1.20

Basic EKS cluster created with eksctl.

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2023-09-28T22:21:16Z

Pinging code owners:

receiver/awscontainerinsight: @Aneurysm9 @pxaws

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jefchien · 2023-09-28T22:21:21Z

Please assign this to me.

github-actions · 2023-11-28T03:29:15Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/awscontainerinsight: @Aneurysm9 @pxaws

See Adding Labels via Comments if you do not have permissions to add labels yourself.

…st metrics. (#27299) **Description:** The `node_<cpu|memory>_request` metrics and metrics derived from them (`node_<cpu|memory>_reserved_capacity`) differ from the output of `kubectl describe node <node_name>`. This is because kubectl [filters out terminated pods](https://github.com/kubernetes/kubectl/blob/302f330c8712e717ee45bbeff27e1d3008da9f00/pkg/describe/describe.go#L3624). See linked issue for more details. Adds a filter for terminated (succeeded/failed state) pods. **Link to tracking Issue:** #27262 **Testing:** Added unit test to validate pod state filtering. Built and deployed changes to cluster. Deployed `cpu-test` pod. ![image](https://github.com/amazon-contributing/opentelemetry-collector-contrib/assets/84729962/b557be2d-e14e-428a-895a-761f7724d9bd) The gap is when the change was deployed. The metric drops after the deployment due to the filter. The metric can be seen spiking up while the `cpu-test` pod is running (~19:15) and then returns to the previous request size after it has terminated. **Documentation:** N/A

jefchien added bug Something isn't working needs triage New item requiring triage labels Sep 28, 2023

github-actions bot added the receiver/awscontainerinsight label Sep 28, 2023

bryan-aguilar assigned jefchien Sep 28, 2023

bryan-aguilar added priority:p2 Medium and removed needs triage New item requiring triage labels Sep 28, 2023

This was referenced Sep 29, 2023

Filter terminated pods from node request metrics. amazon-contributing/opentelemetry-collector-contrib#104

Merged

[receiver/awscontainerinsight] Filter terminated pods from node request metrics. #27299

Merged

This was referenced Oct 3, 2023

Weekly Report: 2023-09-26 - 2023-10-03 #27402

Closed

Weekly Report: 2023-09-26 - 2023-10-03 kevinslin/opentelemetry-collector-contrib#28

Open

jefchien mentioned this issue Oct 5, 2023

Filter terminated pods from node request metrics. aws/amazon-cloudwatch-agent#882

Merged

github-actions bot added the Stale label Nov 28, 2023

jefchien closed this as completed Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node CPU/Memory metrics differ from kubectl describe #27262

Node CPU/Memory metrics differ from kubectl describe #27262

jefchien commented Sep 28, 2023

github-actions bot commented Sep 28, 2023

jefchien commented Sep 28, 2023

github-actions bot commented Nov 28, 2023