Skip to content

Commit

Permalink
add metrics
Browse files Browse the repository at this point in the history
  • Loading branch information
olderTaoist authored and mengxiangyong committed Jul 19, 2024
1 parent f1b13c9 commit 9a26894
Show file tree
Hide file tree
Showing 3 changed files with 64 additions and 179 deletions.
3 changes: 3 additions & 0 deletions keps/prod-readiness/sig-node/4727.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
kep-number: 4580
alpha:
approver: "@wojtek-t"
227 changes: 54 additions & 173 deletions keps/sig-node/4727-reasonable-image-gc/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

# KEP-4727: reasonable --image-gc-high-threshold according to imagefs.available hard evict option


<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
Expand All @@ -20,6 +20,9 @@
- [Integration tests](#integration-tests)
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
- [GA](#ga)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
Expand Down Expand Up @@ -66,17 +69,21 @@ Items marked with (R) are required *prior to targeting to a milestone / release*

## Summary

the default value of `--image-gc-high-threshold` option is 85%, and the default value of `imagefs.available` option is
less than 15%, this will result in image garbage collection not taking effect until node gets disk pressure
Add an feature gate ImageGCBeforeStorageEviction, which denote image gc must occur before kubelet evict.
When ImageGCBeforeStorageEviction is false or is not configured, keep the current behavior.


## Motivation

The default value of `--image-gc-high-threshold` option is 85%, and the default value of `imagefs.available` option is
less than 15%, this will result in image garbage collection not taking effect until node gets disk pressure.

There is no standard to judge whether the value of `--image-gc-high-threshold` is reasonable in different scenarios

### Goals

discuss reasonable values of `--image-gc-high-threshold` for different scenarios, and constrain them by some means.
Discuss reasonable values of `--image-gc-high-threshold` for different scenarios, and constrain them by some means.
Eventually protect users from inopportune configurations, and fix the defaults of `--image-gc-high-threshold` and `imagefs.available` to make they more sense.


### Non-Goals
Expand All @@ -91,29 +98,28 @@ discuss reasonable values of `--image-gc-high-threshold` for different scenarios

#### Story 1

In big data computing scenarios, we often run some computing tasks.
In big data computing scenarios, user often run some computing tasks .
When these tasks are completed, a large number of images are stored on the node.
At this time, we want to perform image garbage collection before the node disk pressure occurs.
At this time, user want to perform image garbage collection before the node disk pressure occurs.
There should be validation to protect that expectation

#### Story 2

Keep the current usage that turn off image garbage collection by setting `--image-gc-high-threshold` 100%.
Cluster administrator misconfigures `--image-gc-high-threshold` and `imagefs.available`.


### Notes/Constraints/Caveats (Optional)



### Risks and Mitigations


## Design Details

Add `ImageGCHighThresholdAccurate` feature gate to kubelet.
Add `ImageGCBeforeStorageEviction` feature gate to kubelet.
When the feature is turned on, the value of `--image-gc-high-threshold` must be smaller than value of `100 - imagefs.available`.
When the feature is turned off, keep the previous usage.

If `ImageGCHighThresholdAccurate` is turned on,

### Test Plan

Expand Down Expand Up @@ -159,6 +165,8 @@ to implement this enhancement.

### Upgrade / Downgrade Strategy

This option is purely contained within the Kubelet, so the only concern is the flag is added to the configuration of the newer
Kubelet and then downgraded.

### Version Skew Strategy

Expand All @@ -168,82 +176,64 @@ to implement this enhancement.

### Feature Enablement and Rollback

<!--
This section must be completed when targeting alpha to a release.
-->

###### How can this feature be enabled / disabled in a live cluster?

<!--
Pick one of these and delete the rest.
Documentation is available on [feature gate lifecycle] and expectations, as
well as the [existing list] of feature gates.
[feature gate lifecycle]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
-->

- [x] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: ImageGCHighThresholdAccurate
- Feature gate name: ImageGCBeforeStorageEviction
- Components depending on the feature gate: kubelet


###### Does enabling the feature change any default behavior?

Yes.
Yes

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, we can just disable the feature gate.

###### What happens if we reenable the feature if it was previously rolled back?

The constrains are respected again.
The constraints are respected again.

###### Are there any tests for feature enablement/disablement?

No.
No

### Rollout, Upgrade and Rollback Planning



###### How can a rollout or rollback fail? Can it impact already running workloads?

It's an opt-in feature for end-users and will maintain current behaviors if not set, so
it will not impact the running workloads.

###### What specific metrics should inform a rollback?

No.
No

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

No.
No

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.
No

### Monitoring Requirements

No.
Monitor the metrics
- "kubelet_image_gc_before_storage_eviction" that contains `image-gc-threshold` and `imagefs-available` labels


###### How can an operator determine if the feature is in use by workloads?

No.
- Verify the Kubelet Configuration with the Kubelet's configz endpoint
- Monitor the `kubelet_image_gc_before_storage_eviction`, denote whether `--image-gc-high-threshold` smaller than `100 - imagefs.available`

###### How can someone using this feature know that it is working for their instance?

###### How can someone using this feature know that it is working for their instance?

- [ ] Events
- Event Reason:
- [ ] API .status
- Condition name:
- Other field:
- [ ] Other (treat as last resort)
- Details:
- [x] Other (treat as last resort)
- `kubelet_image_gc_before_storage_eviction` metric is 1 when `--image-gc-high-threshold` smaller than `100 - imagefs.available`.

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Expand All @@ -252,188 +242,79 @@ No.
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?


- [ ] Metrics
- Metric name:
- [x] Metrics
- Metric name: `kubelet_image_gc_before_storage_eviction`
- [Optional] Aggregation method:
- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:

###### Are there any missing metrics that would be useful to have to improve observability of this feature?


### Dependencies



###### Does this feature depend on any specific services running in the cluster?

<!--
Think about both cluster-level services (e.g. metrics-server) as well
as node-level agents (e.g. specific version of CRI). Focus on external or
optional services that are needed. For example, if this feature depends on
a cloud provider API, or upon an external software-defined storage or network
control plane.
For each of these, fill in the following—thinking about running existing user workloads
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-->
Just Kubelet

### Scalability

<!--
For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->

###### Will enabling / using this feature result in any new API calls?

<!--
Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller)
Focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources
(e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.)
-->
No

###### Will enabling / using this feature result in introducing new API types?

<!--
Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-->
No

###### Will enabling / using this feature result in any new calls to the cloud provider?

<!--
Describe them, providing:
- Which API(s):
- Estimated increase:
-->
No

###### Will enabling / using this feature result in increasing size or count of the existing API objects?

<!--
Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->
No

###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

<!--
Look at the [existing SLIs/SLOs].
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->
No

###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

<!--
Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased log
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->
- Potentially, depending on the value of `--image-gc-high-threshold` chosen, there could be more CPU used to do the image removal.
- The frequency of the image removal will be a tradeoff for disk pressure of node

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

<!--
Focus not just on happy cases, but primarily on more pathological cases
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
If any of the resources can be exhausted, how this is mitigated with the existing limits
(e.g. pods per node) or new limits added by this KEP?
Are there any tests that were run/should be run to understand performance characteristics better
and validate the declared limits?
-->
- It's intended to prevent node get disk pressure

### Troubleshooting

<!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->

###### How does this feature react if the API server and/or etcd is unavailable?

- N/A

###### What are other known failure modes?

<!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
Node gets disk pressure

###### What steps should be taken if SLOs are not being met to determine the problem?

- N/A

## Implementation History

<!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
- the `Proposal` section being merged, signaling agreement on a proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- when the KEP was retired or superseded
-->
2024-06-26: KEP opened, targeted at Alpha

## Drawbacks

<!--
Why should this KEP _not_ be implemented?
-->
No

## Alternatives

<!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
- Add a distinguish unused image growth trends to dynamically adjust `--image-gc-high-threshold` plugin
- Too complicated, probably won't needed, difficult to distinguish whether an unused image is quickly used or not

## Infrastructure Needed (Optional)

<!--
Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, or GitHub details. Listing these here allows a
SIG to get the process for these resources started right away.
-->
N/A
Loading

0 comments on commit 9a26894

Please sign in to comment.