Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support collecting FsUsageMetrics for containerd #2872

Closed
wants to merge 1 commit into from

Conversation

yyrdl
Copy link

@yyrdl yyrdl commented May 19, 2021

What's added?

Collect container's disk usage metrics for containerd .

Related Issue: Disk usage metrics for containerd #2785

Note:

This pr only report the usage and inode information powered by containerd's standard API. Some disk usage information is missing until there is way to get the id of the snapshot or containerd enhances it's API.But for now, we can report what we can get .

We urgently need the disk usage of the container,these information is ok for our scenario in Tencent ,although it's not complete.

For k8s users, cadvisor is integrated in kubelet , you need to apply these changes to your kubelet (modify cadvisor's code in vendor ), and check the New method in kubernetes/pkg/kubelet/cadvisor/cadvisor_linux.go, if there is an conditional statement like below:

if usingLegacyStats  {
    includedMetrics[cadvisormetrics.DiskUsageMetrics] = struct{}{}
}

please add an new condition here,Related discussion

if usingLegacyStats || utilfeature.DefaultFeatureGate.Enabled(kubefeatures.LocalStorageCapacityIsolation) {
    includedMetrics[cadvisormetrics.DiskUsageMetrics] = struct{}{}
}

@google-cla
Copy link

google-cla bot commented May 19, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added the cla: no label May 19, 2021
@k8s-ci-robot
Copy link
Collaborator

Hi @yyrdl. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yyrdl
Copy link
Author

yyrdl commented May 19, 2021

@googlebot I signed it!

@google-cla
Copy link

google-cla bot commented May 19, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@yyrdl
Copy link
Author

yyrdl commented May 19, 2021

@googlebot I fixed it.

@Creatone
Copy link
Collaborator

/ok-to-test

@bobbypage
Copy link
Collaborator

bobbypage commented May 21, 2021

Thanks for PR!

Can you clarify what cAdvisor prometheus metrics this PR aims to add to for containerd? Is it just container_fs_usage_bytes? What other filesystem metrics did you mention are still missing on containerd?

@@ -37,15 +36,18 @@ type FsUsage struct {
InodeUsage uint64
}

type FsUsageProvider interface {
Usage() (*FsUsage, error)
Targets() []string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add godoc on what "Targets" is referring to

Copy link
Author

@yyrdl yyrdl May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added .

this method is used to get where the fs usage is collected.

}

// Wait to handle errors until after all operartions are run.
// Wait to handle errors until after all operations are run.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still relevant here? I think it needs to be moved to Usage() in container/common/fsHandler.go

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -125,6 +130,21 @@ func (c *client) Version(ctx context.Context) (string, error) {
return response.Version, nil
}

func (c *client) ContainerFsUsage(ctx context.Context, snapshotter, snapshotkey string) (*common.FsUsage, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does ContainerFsUsage need to be exported?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes , this client is delivered as interface, if this method is not exported, we can't get the fs usage.

func newContainerdContainerHandler(
	client ContainerdClient,//  Here
	name string,
	machineInfoFactory info.MachineInfoFactory,
	fsInfo fs.FsInfo,
	cgroupSubsystems *containerlibcontainer.CgroupSubsystems,
	inHostNamespace bool,
	metadataEnvs []string,
	includedMetrics container.MetricSet,
) (container.ContainerHandler, error)

@@ -125,6 +130,21 @@ func (c *client) Version(ctx context.Context) (string, error) {
return response.Version, nil
}

func (c *client) ContainerFsUsage(ctx context.Context, snapshotter, snapshotkey string) (*common.FsUsage, error) {
usage, err := c.snapshotsService.Usage(ctx, &snaptshotapi.UsageRequest{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how expensive is this?

will containerd snapshotter have to recaculate the usage every time this is called or does it cache the usage internally?

Copy link
Author

@yyrdl yyrdl Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a cache in it's CRI implemention . And Usage method will recaculate the usage every time, I have change it to ContainerStats,use CRI instead.

}
return &common.FsUsage{
BaseUsageBytes: uint64(usage.Size_),
TotalUsageBytes: uint64(usage.Size_),
Copy link
Collaborator

@bobbypage bobbypage May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it correct that these are both set to usage.Size?

BaseUsageBytes is "Number of bytes consumed by a container through its root filesystem." vs TotalusageBytes is "Total Number of bytes consumed by container." ?

Copy link
Author

@yyrdl yyrdl May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ,for containerd ,there is no log file(used to store logs from container's stdout or stderr) ( At least ,I didn't find it (T_T) ) . But for docker and crio , there are log files, so when collect fs usage, the usage of log file is counted.

By the way, kubelet storge container's logs in another place (/var/log/pods), and it is counted by kubelet when it caculate the usage. see makeContainerStats in kubelet/stats/cri_stats_provider.go

// NOTE: This doesn't support the old pod log path, `/var/log/pods/UID`. For containers
	// using old log path, empty log stats are returned. This is fine, because we don't
	// officially support in-place upgrade anyway.
	var (
		containerLogPath = kuberuntime.BuildContainerLogsDirectory(meta.GetNamespace(),
			meta.GetName(), types.UID(meta.GetUid()), container.GetMetadata().GetName())
		err error
	)
	result.Logs, err = p.getPathFsStats(containerLogPath, rootFsInfo)
	if err != nil {
		klog.Errorf("Unable to fetch container log stats for path %s: %v ", containerLogPath, err)
	}
	return result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't cadvisor take the log size of /var/log/pods in account when counting TotalusageBytes?

@bobbypage
Copy link
Collaborator

/cc @crosbymichael

There was earlier slack thread about of FS Usage containerd metrics on cadvisor xref

@yyrdl
Copy link
Author

yyrdl commented May 26, 2021

Thanks for PR!

Can you clarify what cAdvisor prometheus metrics this PR aims to add to for containerd? Is it just container_fs_usage_bytes? What other filesystem metrics did you mention are still missing on containerd?

There are many metrics defined in fs.FsStats,at current ,we only get InodeBaseUsage and Usage.

For prometheus metrics defined in container.DiskUsageMetrics ,we only lost container_fs_inodes_free and container_fs_limit_bytes .

@yyrdl
Copy link
Author

yyrdl commented May 26, 2021

And thanks for your reply :)

@yyrdl
Copy link
Author

yyrdl commented Jun 17, 2021

/test pull-cadvisor-e2e

@bobbypage
Copy link
Collaborator

Can you please rebase? There appears to be some go.mod changes which is causing pull-cadvisor-e2e to fail

"google.golang.org/grpc"
"google.golang.org/grpc/backoff"
runtimeapi "k8s.io/cri-api/pkg/apis/runtime/v1alpha2"
Copy link
Collaborator

@bobbypage bobbypage Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can depend on k8s repos since cadvisor itself is imported into k8s which can result in an import loop.

See #2726 and #2726 (comment) for example where this broke us before.

Do we need to import criapi here? Can we just talk to containerd directly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok , this package is just pb files.

And there are two way to get the diskusage metrics :

  • call snapshots.Usage()
  • call cri.ContainerStats()

They are both grpc interface implemented by containerd , the difference is that snapshots.Usage didn‘t cache any thing, while containerd cri server will sync the disk usage metrics periodically, see snapshotsSyncer.sync ,in order to avoid collecting disk usage metric repeatedly , we should use the cri method.

Copy link
Collaborator

@bobbypage bobbypage Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the background. Unfortunately, I don't think the fact that this package is just pb files matters; I think it is general issue with dependency management in k8s =>

as @dims mentioned here

"third party software can use staging modules IFF the third party software is not vendored back into k/k :)"

"k8s.io/cri-api/" is staging module and cadvisor itself is vendored back into k/k.

@dims, please keep me honest here, but I don't think anything has changed here.

Copy link
Collaborator

@bobbypage bobbypage Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we be able to switch to snapshots.Usage() instead? Can we maybe cache the results internally / only collect them during some periodic house keeping period?

Copy link
Author

@yyrdl yyrdl Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just consider this feature in cadvisor independently , we can use snapshots.Usage, but for the sake of performance ,we have to use cri.ContainerStats instead. there is a picture which shows the relationship between cadvisor ,kubelet and containerd in this situation :

diskmetrics

If we use snapshots.Usage in cadvisor, it will collect the metrics from filesystem repeatedly,which will cause unnecessary performance loss.

Copy link
Collaborator

@bobbypage bobbypage Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yyrdl!

Would you be able to draft a PR of the kubelet changes required prior to merging this PR?

I would just like to ensure we know exactly what changes are required to kubelet first, prior to merging this PR. That way it's clear if we need to make any other cAdvisor/kubelet changes to fully integrate these changes since most important consumer of cAdvisor will be kubelet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes time to redesign the StatsProvider in kubelet , I am a little busy , but I will do it as soon as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yyrdl FYI, I work with @bobbypage and we are solving another cadvisor-containerd issue which needs CRI as well. To work around the circular dependency issue, current plan is integrating cadvisor with CRI in a separate branch. #2923

Copy link
Author

@yyrdl yyrdl Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiutongs Good job! So I should submit this PR to this branch too ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yyrdl I sent you a message on the slack and we can discuss more.

extraDir string
fsInfo fs.FsInfo
lastUpdate time.Time
usage FsUsage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is thisusage the cache?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@vsxen
Copy link

vsxen commented Aug 30, 2021

ping...

@qiutongs
Copy link
Contributor

@yyrdl On top of your PR, the fix was merged in containerd-cri branch. And #2956 will merge the fix to master.

Feel free to close this one.

@yyrdl yyrdl closed this Oct 12, 2021
@yyrdl
Copy link
Author

yyrdl commented Oct 12, 2021

@yyrdl On top of your PR, the fix was merged in containerd-cri branch. And #2956 will merge the fix to master.

Feel free to close this one.

good job

@qingguee
Copy link

qingguee commented Dec 3, 2021

Hi @yyrdl and @qiutongs ,
Has the fix merged in v0.43.0? I deployed v0.43.0, but container_fs_use_bytes still not work on containerd POD.

@vsxen
Copy link

vsxen commented Dec 3, 2021

Do you mean sanbox container(pause:3.2)?

@qingguee
Copy link

qingguee commented Dec 3, 2021

@vsxen No. I was cloned cadvisor code, ,change to tag v0.43.0, then docker build the cadvisor image. I deployed that image as daemonset on our K8s to check if the container_fs_use_bytes metrics works with containerd.
The result I see, it still not works. But I thought this PR will fix this issue on containerd.
We are trying to monitor disk usage for container, but this currently not work for containerd

@vsxen
Copy link

vsxen commented Dec 3, 2021

what is you containerd version?
I use containerd 1.4.8 & cadvisor 0.43.0, it works.

@qingguee
Copy link

qingguee commented Dec 3, 2021

it's containerd://1.4.4
Maybe this is the problem. Thanks a lot for the info.

@yyrdl
Copy link
Author

yyrdl commented Dec 3, 2021

Does your prometheus pull metrics from the casvisor you deployed or still from kubelet ?@qingguee

@qingguee
Copy link

qingguee commented Dec 6, 2021

Hi @yyrdl , I have both. My v0.43.0 cadvisor was deployed as daemonset, and I use endpoints to collect those metrics. I use {job="cadvisor"} to filter out the kubelet advisor data.
Anyway, I will check with my IT team, see if we can update to containerd 1.4.8, then I will double check. This may take time, but when it ready, I will update.

@qingguee
Copy link

qingguee commented Mar 3, 2022

Hi @yyrdl
After upgrade our cluster with containerd://1.4.12, I still can't see the fs usage for each container, I can only see some usage for node device.

cadvisor:v0.43.0

$ kubectl  get node -o wide
NAME          STATUS   ROLES       AGE   VERSION   KERNEL-VERSION      CONTAINER-RUNTIME
node-demo_0   Ready    edge,node   9d    v1.22.4   5.3.18-57-default   containerd://1.4.12
node-demo_1   Ready    edge,node   9d    v1.22.4   5.3.18-57-default   containerd://1.4.12
node-demo_2   Ready    node        9d    v1.22.4   5.3.18-57-default   containerd://1.4.12
node-demo_3   Ready    node        9d    v1.22.4   5.3.18-57-default   containerd://1.4.12

I deployed cadvisor as daemonset, my Prometheus collect metrics from those endpoint/POD.

apiVersion: apps/v1 # for Kubernetes versions before 1.9.0 use apps/v1beta2
kind: DaemonSet
metadata:
  name: cadvisor
  namespace: cadvisor
spec:
  template:
    spec:
      containers:
      - name: cadvisor
        args:
          - --housekeeping_interval=10s                           # kubernetes default args
          - --max_housekeeping_interval=15s
          - --event_storage_event_limit=default=0
          - --event_storage_age_limit=default=0
          - --enable_metrics=disk,diskIO,cpu                                   # only show stats for docker containers
          - --store_container_labels=true
          # - --whitelisted_container_labels=io.kubernetes.container.name, io.kubernetes.pod.name,io.kubernetes.pod.namespace
          # - --enable_metrics=app,cpu,disk,diskIO,memory,network,process
# HELP container_fs_usage_bytes Number of bytes that are consumed by the container on this filesystem.
# TYPE container_fs_usage_bytes gauge
container_fs_usage_bytes{container_label_adpbrlabelkey="",container_label_app="",container_label_app_kubernetes_io_component="",container_label_app_kubernetes_io_instance="",container_label_app_kubernetes_io_managed_by="",container_label_app_kubernetes_io_name="",container_label_app_kubernetes_io_version="",container_label_controller_revision_hash="",container_label_controller_uid="",container_label_demo_cm_mediator_access="",container_label_demo_data_message_bus_kf_client_access="",container_label_demo_sec_key_management_access="",container_label_helm_sh_chart="",container_label_io_cri_containerd_kind="",container_label_io_kubernetes_container_name="",container_label_io_kubernetes_pod_name="",container_label_io_kubernetes_pod_namespace="",container_label_io_kubernetes_pod_uid="",container_label_job_name="",container_label_k8s_app="",container_label_name="",container_label_pod_template_generation="",container_label_pod_template_hash="",container_label_statefulset_kubernetes_io_pod_name="",device="/dev",id="/",image="",name=""} 0 1646271995262
container_fs_usage_bytes{container_label_adpbrlabelkey="",container_label_app="",container_label_app_kubernetes_io_component="",container_label_app_kubernetes_io_instance="",container_label_app_kubernetes_io_managed_by="",container_label_app_kubernetes_io_name="",container_label_app_kubernetes_io_version="",container_label_controller_revision_hash="",container_label_controller_uid="",container_label_demo_cm_mediator_access="",container_label_demo_data_message_bus_kf_client_access="",container_label_demo_sec_key_management_access="",container_label_helm_sh_chart="",container_label_io_cri_containerd_kind="",container_label_io_kubernetes_container_name="",container_label_io_kubernetes_pod_name="",container_label_io_kubernetes_pod_namespace="",container_label_io_kubernetes_pod_uid="",container_label_job_name="",container_label_k8s_app="",container_label_name="",container_label_pod_template_generation="",container_label_pod_template_hash="",container_label_statefulset_kubernetes_io_pod_name="",device="/dev/rbd0",id="/",image="",name=""} 4.2053632e+07 1646271995262
container_fs_usage_bytes{container_label_adpbrlabelkey="",container_label_app="",container_label_app_kubernetes_io_component="",container_label_app_kubernetes_io_instance="",container_label_app_kubernetes_io_managed_by="",container_label_app_kubernetes_io_name="",container_label_app_kubernetes_io_version="",container_label_controller_revision_hash="",container_label_controller_uid="",container_label_demo_cm_mediator_access="",container_label_demo_data_message_bus_kf_client_access="",container_label_demo_sec_key_management_access="",container_label_helm_sh_chart="",container_label_io_cri_containerd_kind="",container_label_io_kubernetes_container_name="",container_label_io_kubernetes_pod_name="",container_label_io_kubernetes_pod_namespace="",container_label_io_kubernetes_pod_uid="",container_label_job_name="",container_label_k8s_app="",container_label_name="",container_label_pod_template_generation="",container_label_pod_template_hash="",container_label_statefulset_kubernetes_io_pod_name="",device="/dev/shm",id="/",image="",name=""} 0 1646271995262

So, is this issue fixed in 0.43.0? Or my configuration has issue?
This #2785 (comment) , mentioned that it's not working. But as there is a lot different issue reporting this topic, it's quite confuse for me. Could you help to provide the latest info?

@yyrdl
Copy link
Author

yyrdl commented Mar 3, 2022

@qingguee cadvisor is integrated in kubelet ,at current ,the fix of this PR is not merged for some package dependency problem , you can copy the code changed in this pr , and build a custom version of kubelet or cadvisor to solve your problem.

@qingguee
Copy link

qingguee commented Mar 3, 2022

@yyrdl Thank you so much for the info.
So, I get more clear for me, we need both kubelet fix and cadvisor fix. But I don't think we could update our kubelet as we rent the cluster.

Could you help to provide the link for package dependency problem, I'd like to follow up. When that solved, we will do upgrade accordingly.

@yyrdl
Copy link
Author

yyrdl commented Mar 3, 2022

@qingguee #2872 (comment)

KeyOfSpectator added a commit to AliyunContainerService/cadvisor that referenced this pull request Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants