add metrics

kubernetes · Jul 19, 2024 · 9a26894 · 9a26894
1 parent f1b13c9
commit 9a26894
Show file tree

Hide file tree

Showing 3 changed files with 64 additions and 179 deletions.
diff --git a/keps/prod-readiness/sig-node/4727.yaml b/keps/prod-readiness/sig-node/4727.yaml
@@ -0,0 +1,3 @@
+kep-number: 4580
+alpha:
+  approver: "@wojtek-t"
diff --git a/keps/sig-node/4727-reasonable-image-gc/README.md b/keps/sig-node/4727-reasonable-image-gc/README.md
@@ -1,7 +1,7 @@
 
 # KEP-4727: reasonable --image-gc-high-threshold according to imagefs.available hard evict option
 
-
+<!-- toc -->
 - [Release Signoff Checklist](#release-signoff-checklist)
 - [Summary](#summary)
 - [Motivation](#motivation)
@@ -20,6 +20,9 @@
       - [Integration tests](#integration-tests)
       - [e2e tests](#e2e-tests)
   - [Graduation Criteria](#graduation-criteria)
+    - [Alpha](#alpha)
+    - [Beta](#beta)
+    - [GA](#ga)
   - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
   - [Version Skew Strategy](#version-skew-strategy)
 - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -66,17 +69,21 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 
 ## Summary
 
-the default value of `--image-gc-high-threshold` option is 85%, and the default value of `imagefs.available` option is 
-less than 15%, this will result in image garbage collection not taking effect until node gets disk pressure 
+Add an feature gate ImageGCBeforeStorageEviction, which denote image gc must occur before kubelet evict.
+When ImageGCBeforeStorageEviction is false or is not configured, keep the current behavior.
 
 
 ## Motivation
 
+The default value of `--image-gc-high-threshold` option is 85%, and the default value of `imagefs.available` option is
+less than 15%, this will result in image garbage collection not taking effect until node gets disk pressure.
+
 There is no standard to judge whether the value of `--image-gc-high-threshold` is reasonable in different scenarios
 
 ### Goals
 
-discuss reasonable values of `--image-gc-high-threshold` for different scenarios, and constrain them by some means.
+Discuss reasonable values of `--image-gc-high-threshold` for different scenarios, and constrain them by some means.
+Eventually protect users from inopportune configurations, and fix the defaults of `--image-gc-high-threshold` and `imagefs.available` to make they more sense.
 
 
 ### Non-Goals
@@ -91,29 +98,28 @@ discuss reasonable values of `--image-gc-high-threshold` for different scenarios
 
 #### Story 1
 
-In big data computing scenarios, we often run some computing tasks. 
+In big data computing scenarios, user often run some computing tasks . 
 When these tasks are completed, a large number of images are stored on the node. 
-At this time, we want to perform image garbage collection before the node disk pressure occurs.
+At this time, user want to perform image garbage collection before the node disk pressure occurs. 
+There should be validation to protect that expectation
 
 #### Story 2
 
-Keep the current usage that turn off image garbage collection by setting `--image-gc-high-threshold` 100%.
+Cluster administrator misconfigures `--image-gc-high-threshold` and `imagefs.available`.
 
 
 ### Notes/Constraints/Caveats (Optional)
 
 
-
 ### Risks and Mitigations
 
 
 ## Design Details
 
-Add `ImageGCHighThresholdAccurate` feature gate to kubelet. 
+Add `ImageGCBeforeStorageEviction` feature gate to kubelet. 
 When the feature  is turned on, the value of `--image-gc-high-threshold` must be smaller than  value of `100 - imagefs.available`. 
 When the feature is turned off, keep the previous usage.
 
-If `ImageGCHighThresholdAccurate` is turned on, 
 
 ### Test Plan
 
@@ -159,6 +165,8 @@ to implement this enhancement.
 
 ### Upgrade / Downgrade Strategy
 
+This option is purely contained within the Kubelet, so the only concern is the flag is added to the configuration of the newer
+Kubelet and then downgraded.
 
 ### Version Skew Strategy
 
@@ -168,82 +176,64 @@ to implement this enhancement.
 
 ### Feature Enablement and Rollback
 
-<!--
-This section must be completed when targeting alpha to a release.
--->
-
 ###### How can this feature be enabled / disabled in a live cluster?
 
-<!--
-Pick one of these and delete the rest.
-
-Documentation is available on [feature gate lifecycle] and expectations, as
-well as the [existing list] of feature gates.
-
-[feature gate lifecycle]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
-[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
--->
-
 - [x] Feature gate (also fill in values in `kep.yaml`)
-  - Feature gate name: ImageGCHighThresholdAccurate
+  - Feature gate name: ImageGCBeforeStorageEviction
   - Components depending on the feature gate: kubelet
 
 
 ###### Does enabling the feature change any default behavior?
 
-Yes.
+Yes
 
 ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
 
 Yes, we can just disable the feature gate.
 
 ###### What happens if we reenable the feature if it was previously rolled back?
 
-The constrains are respected again.
+The constraints are respected again.
 
 ###### Are there any tests for feature enablement/disablement?
 
-No.
+No
 
 ### Rollout, Upgrade and Rollback Planning
 
-
-
 ###### How can a rollout or rollback fail? Can it impact already running workloads?
 
 It's an opt-in feature for end-users and will maintain current behaviors if not set, so
 it will not impact the running workloads.
 
 ###### What specific metrics should inform a rollback?
 
-No.
+No
 
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
-No.
+No
 
 ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
 
-No.
+No
 
 ### Monitoring Requirements
 
-No.
+Monitor the metrics
+- "kubelet_image_gc_before_storage_eviction" that contains `image-gc-threshold` and `imagefs-available` labels
+
 
 ###### How can an operator determine if the feature is in use by workloads?
 
-No.
+- Verify the Kubelet Configuration with the Kubelet's configz endpoint
+- Monitor the `kubelet_image_gc_before_storage_eviction`, denote whether `--image-gc-high-threshold` smaller than  `100 - imagefs.available`
 
-###### How can someone using this feature know that it is working for their instance?
 
+###### How can someone using this feature know that it is working for their instance?
 
-- [ ] Events
-  - Event Reason: 
-- [ ] API .status
-  - Condition name: 
-  - Other field: 
-- [ ] Other (treat as last resort)
-  - Details:
+- [x] Other (treat as last resort)
+  - `kubelet_image_gc_before_storage_eviction` metric is 1 when `--image-gc-high-threshold` smaller than  `100 - imagefs.available`.
 
 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
 
@@ -252,188 +242,79 @@ No.
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
 
-- [ ] Metrics
-  - Metric name:
+- [x] Metrics
+  - Metric name: `kubelet_image_gc_before_storage_eviction`
   - [Optional] Aggregation method:
   - Components exposing the metric:
 - [ ] Other (treat as last resort)
   - Details:
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
-
 ### Dependencies
 
-
-
 ###### Does this feature depend on any specific services running in the cluster?
 
-<!--
-Think about both cluster-level services (e.g. metrics-server) as well
-as node-level agents (e.g. specific version of CRI). Focus on external or
-optional services that are needed. For example, if this feature depends on
-a cloud provider API, or upon an external software-defined storage or network
-control plane.
-
-For each of these, fill in the following—thinking about running existing user workloads
-and creating new ones, as well as about cluster-level services (e.g. DNS):
-  - [Dependency name]
-    - Usage description:
-      - Impact of its outage on the feature:
-      - Impact of its degraded performance or high-error rates on the feature:
--->
+Just Kubelet
 
 ### Scalability
 
-<!--
-For alpha, this section is encouraged: reviewers should consider these questions
-and attempt to answer them.
-
-For beta, this section is required: reviewers must answer these questions.
-
-For GA, this section is required: approvers should be able to confirm the
-previous answers based on experience in the field.
--->
-
 ###### Will enabling / using this feature result in any new API calls?
 
-<!--
-Describe them, providing:
-  - API call type (e.g. PATCH pods)
-  - estimated throughput
-  - originating component(s) (e.g. Kubelet, Feature-X-controller)
-Focusing mostly on:
-  - components listing and/or watching resources they didn't before
-  - API calls that may be triggered by changes of some Kubernetes resources
-    (e.g. update of object X triggers new updates of object Y)
-  - periodic API calls to reconcile state (e.g. periodic fetching state,
-    heartbeats, leader election, etc.)
--->
+No
 
 ###### Will enabling / using this feature result in introducing new API types?
 
-<!--
-Describe them, providing:
-  - API type
-  - Supported number of objects per cluster
-  - Supported number of objects per namespace (for namespace-scoped objects)
--->
+No
 
 ###### Will enabling / using this feature result in any new calls to the cloud provider?
 
-<!--
-Describe them, providing:
-  - Which API(s):
-  - Estimated increase:
--->
+No
 
 ###### Will enabling / using this feature result in increasing size or count of the existing API objects?
 
-<!--
-Describe them, providing:
-  - API type(s):
-  - Estimated increase in size: (e.g., new annotation of size 32B)
-  - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
--->
+No
 
 ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
 
-<!--
-Look at the [existing SLIs/SLOs].
-
-Think about adding additional work or introducing new steps in between
-(e.g. need to do X to start a container), etc. Please describe the details.
-
-[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
--->
+No
 
 ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
 
-<!--
-Things to keep in mind include: additional in-memory state, additional
-non-trivial computations, excessive access to disks (including increased log
-volume), significant amount of data sent and/or received over network, etc.
-This through this both in small and large cases, again with respect to the
-[supported limits].
-
-[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
--->
+- Potentially, depending on the value of `--image-gc-high-threshold` chosen, there could be more CPU used to do the image removal.
+  - The frequency of the image removal will be a tradeoff for disk pressure of node
 
 ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
 
-<!--
-Focus not just on happy cases, but primarily on more pathological cases
-(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
-If any of the resources can be exhausted, how this is mitigated with the existing limits
-(e.g. pods per node) or new limits added by this KEP?
-
-Are there any tests that were run/should be run to understand performance characteristics better
-and validate the declared limits?
--->
+- It's intended to prevent node get disk pressure
 
 ### Troubleshooting
 
-<!--
-This section must be completed when targeting beta to a release.
-
-For GA, this section is required: approvers should be able to confirm the
-previous answers based on experience in the field.
-
-The Troubleshooting section currently serves the `Playbook` role. We may consider
-splitting it into a dedicated `Playbook` document (potentially with some monitoring
-details). For now, we leave it here.
--->
-
 ###### How does this feature react if the API server and/or etcd is unavailable?
 
+- N/A
+
 ###### What are other known failure modes?
 
-<!--
-For each of them, fill in the following information by copying the below template:
-  - [Failure mode brief description]
-    - Detection: How can it be detected via metrics? Stated another way:
-      how can an operator troubleshoot without logging into a master or worker node?
-    - Mitigations: What can be done to stop the bleeding, especially for already
-      running user workloads?
-    - Diagnostics: What are the useful log messages and their required logging
-      levels that could help debug the issue?
-      Not required until feature graduated to beta.
-    - Testing: Are there any tests for failure mode? If not, describe why.
--->
+Node gets disk pressure
 
 ###### What steps should be taken if SLOs are not being met to determine the problem?
 
+- N/A
+
 ## Implementation History
 
-<!--
-Major milestones in the lifecycle of a KEP should be tracked in this section.
-Major milestones might include:
-- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
-- the `Proposal` section being merged, signaling agreement on a proposed design
-- the date implementation started
-- the first Kubernetes release where an initial version of the KEP was available
-- the version of Kubernetes where the KEP graduated to general availability
-- when the KEP was retired or superseded
--->
+2024-06-26: KEP opened, targeted at Alpha
 
 ## Drawbacks
 
-<!--
-Why should this KEP _not_ be implemented?
--->
+No
 
 ## Alternatives
 
-<!--
-What other approaches did you consider, and why did you rule them out? These do
-not need to be as detailed as the proposal, but should include enough
-information to express the idea and why it was not acceptable.
--->
+- Add a distinguish unused image growth trends to dynamically adjust `--image-gc-high-threshold` plugin
+    - Too complicated, probably won't needed, difficult to distinguish whether an unused image is quickly used or not
 
 ## Infrastructure Needed (Optional)
 
-<!--
-Use this section if you need things from the project/SIG. Examples include a
-new subproject, repos requested, or GitHub details. Listing these here allows a
-SIG to get the process for these resources started right away.
--->
+N/A