Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3066: subsecond probes #3067

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

mikebrow
Copy link
Member

Signed-off-by: Mike Brown brownwm@us.ibm.com

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 30, 2021
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Nov 30, 2021
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some late thoughts on the discussion / early feedback on the draft.

Comment on lines 810 to 850
Allow for 0 second Probe, with setting millisecond.

Cons:
Loosens validation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds similar to tv_sec and tv_nsec from POSIX time structs. If so, we could mention that analogy.

Comment on lines 802 to 843
<!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention microseconds and nanoseconds too (as these are plausible alternatives to milliseconds).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing microseconds might be useful if we support future healthcheck mechanisms.

Comment on lines 815 to 689
### v2 api for probe.

I think this means a v2 API for Container therefore Pod.
This seems too invasive.

All Seconds fields become resource.Quantity instead of int32. This supports subdivision in a single field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bigger problem is how to support round-tripping to the (still GA) v1 API. Even if we did a v2 API, we'd need a way to represent the subsecond timeout in the v1 API, right?

(I'm personally all for adopting a v2 Pod and v2 PodTemplate API, but lots of clients would expect to still create v1 Deployments, etc).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nod


V2 API for existing objects.
Converting fields from int32 to resource.Quantity.
Subsecond resolution less than one millisecond.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd omit this non-goal; it sounds more like a design choice than a goal we're not pursuing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk

Comment on lines 213 to 216
If the Milliseconds variant of a field is set,
*use it in preference* to the existing Seconds field,
and **completely ignore** the value of the existing field.
This behavior makes it opt-in on setting a non-zero Milliseconds field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an existing admission control mechanism that enforces constraints on timeoutSeconds, does this (draft) change break any API contract that the admission control mechanism is entitled to assume?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehashman recommended in the sig node call that we forgo the alternate fields idea and move to adding a new flag to indicate seconds should be processed as milliseconds.. etc.

great point on we will need to check for admission control impacts..

keps/sig-node/3066-subsecond-probes/README.md Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/kep.yaml Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/kep.yaml Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
@sftim
Copy link
Contributor

sftim commented Feb 14, 2022

Overall, I think my concerns about these API changes are around changes we might want to make a few years in the future.

If we specify probe timeouts in microseconds, that allows for two things:

  • future health check mechanisms that are really fast (maybe they establish a connection and then the subsequent checks are super low latency)
  • admission restrictions or other mechanisms to require a minimum timeout (eg, any microsecond duration > 1000μs).

If we ever have to support second-granularity, millisecond-granularity and microsecond-granularity probes, I hope the API for that won't look as ugly as I fear it might.

The approach I'd still promote is to have an extra field to specify “early failure”: an integer quantity of milliseconds (or microseconds - we should leave room for that). That early failure offset is something that legacy clients can ignore without a big problem and would fully support round-tripping to a future Pod v2 API that unifies these fields.

Using a negative offset makes the distinction between (1s - 995ms) more obvious to a legacy client that's unaware of the new fields. A positive offset (0s + 5ms, or maybe 0s + 5000μs) could get interpreted quite differently.

If we do that, we should absolutely disallow a negative offset ≥ 1.0s seconds. Otherwise we have a challenge around unambiguous round-tripping (eg from a future Pod v2 API back to Pod v1).

@sftim
Copy link
Contributor

sftim commented Feb 14, 2022

For periodSeconds, do we want to support sub-second periods? If we do then a similar mechanism to what I proposed above (early failure) would work, but it would need a different name. We can bikeshed that name if needed, and we might want admission controls to enforce a minimum check interval (eg 0.1s - just speculating).

New Field: int32 PeriodOffsetMilliseconds

If I want to set to 0.5 seconds, 500 milliseconds,
PeriodSeconds <= 10 & PeriodOffsetmilliseconds <= -9500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of a pair of values that could be ambiguous for round-tripping, forcing future APIs to store and track both pieces of data.

By contrast, if we had:

....
   periodSeconds: 1
   earlyRecheckOffset:
      - unit: Millisecond 
        value: 500
...

and the v1 Pod API doesn't allow ambiguous values, then a future v2 Pod API can store a single field value behind the scenes
(eg: periodMicroseconds = 500000u64)

Copy link
Contributor

@sftim sftim Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option, still clamped to less than 1s:

...
   periodSeconds: 0
   extraRecheckOffset:
      - unit: Millisecond 
        value: 500
...

keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved

All Seconds fields become resource.Quantity instead of int32. This supports subdivision in a single field.

### OffsetMilliseconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikebrow can remove this section from the alternatives since it's the main proposal now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nod let's focus in on the offset milliseconds option since and fix the naming above..

keps/sig-node/3066-subsecond-probes/README.md Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
keps/sig-node/3066-subsecond-probes/README.md Outdated Show resolved Hide resolved
Allow for 0 second Probe, with setting millisecond. Similar to what is done with `tv_sec` and `tv_nsec` from POSIX time structs.

Cons:
Loosens validation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this proposal would just add a new field for each existing *Seconds field, like PeriodSeconds and PeriodMilliseconds and then the actual time is the sum? This seems like a way more intuitive and understandable API to me. It's also simpler to enter - just one new field instead of a new struct. And follows the existing *Seconds pattern. I am not seeing the downside? Validation can be done on the total duration as well as the individual fields, can't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clients that don't recognise the new field would be liable to trigger much too early, right? How do we avoid that?

Also, how do we avoid setting a precedent that this riskier approach is the one to take.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's true. The lack of versioning at this granularity is a problem. Of course the "offset" has a similar problem (inaccurate interval from old clients). Here's another (maybe bad) idea:

on write:

  • if *Milliseconds is set, then set *Seconds = ceil(*Milliseconds/1000)
  • otherwise, set *Milliseconds = *Seconds * 1000

Then, on read, both will be set, and old clients will have a slightly longer interval.


Pros:
Cons:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative is to use "Duration", like PeriodDuration: 1.5s or PeriodDuration: 1500ms or PeriodDuration: 1s500ms. But AFAIK we don't do that anywhere else, it would be a new type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing API guidelines say

Duration fields must be represented as integer fields with units being part of the field name

so if for some of those options, if we adopt them I imagine that we also need to change the API guidelines.

```yaml
...
periodSeconds: 1
periodSecondsOffset:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this very unintuitive and verbose as well.

@psschwei
Copy link
Contributor

psschwei commented Mar 9, 2022

Subsecond Probe Options

Thanks @sftim and @johnbelamaric for your reviews! Trying to summarize the options in one place (since it's a bit messy in the KEP at the moment), here's four options that have been discussed for configuring a probe to run on a subsecond interval:

(edit: given that this discussion is still ongoing, probably better to use hackmd rather than having to continue to edit a comment in a PR...)

@psschwei
Copy link
Contributor

psschwei commented Mar 9, 2022

(I've got the above in a hackmd ... if it's easier to edit there vs. doing this in issue comments, let me know and I'll add you both to it)

@sftim
Copy link
Contributor

sftim commented Mar 10, 2022

@psschwei I suggested using early*Offset as an extra field in the current API. I definitely wouldn't call that negative offset periodMilliseconds as that's surely liable to confuse people (early in the field name should be a massive clue).
See #3067 (comment) for more details.

I saw my proposal as distinct because it did fully avoid the inability to round-trip with a combined field.

@psschwei
Copy link
Contributor

I suggested using early*Offset ...

You're right, let me add it to the hackmd now.

Sorry for the oversight; that's why I thought it would be good to put all the options in a single place...

@MadhavJivrajani
Copy link
Contributor

This KEP was read and discussed at this month's session of the KEP reading club, commenting here for discoverability purposes. There is a slack thread on #sig-architecture with questions/comments about this KEP that can be found here, thanks!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 24, 2022
@psschwei
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mikebrow
Copy link
Member Author

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Mar 27, 2024
@k8s-ci-robot
Copy link
Contributor

@mikebrow: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@mikebrow: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify c1de314 link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 30, 2024
@bart0sh
Copy link
Contributor

bart0sh commented Apr 5, 2024

/remove-lifecycle rotten
/cc @kannon92
/easycla

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 5, 2024
@bart0sh
Copy link
Contributor

bart0sh commented Apr 5, 2024

@mikebrow Please sign the CLA and fix failing CI tests, thanks.

@bart0sh
Copy link
Contributor

bart0sh commented Apr 5, 2024

/cc

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 8, 2024
@thockin thockin self-assigned this Jun 8, 2024
@thockin
Copy link
Member

thockin commented Jun 8, 2024

Is this KEP still alive?

@Nibelheims
Copy link

Nibelheims commented Jun 25, 2024

Hi.
I see that a check failed because the README's TOC is outdated. Would the following fix it? If I am not mistaken, external contributions to this PR are not allowed. Thanks.

--- a/keps/sig-node/3066-subsecond-probes/README.md
+++ b/keps/sig-node/3066-subsecond-probes/README.md
@@ -48,10 +48,10 @@ Probe timeouts are limited to seconds and that does NOT work well for clients lo
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
-  - [<code>early*Offset</code>](#)
-  - [<code>*Duration</code>](#-1)
-  - [A new struct, <code>ProbeOffset</code>](#a-new-struct-)
-  - [Setting the time units in a different field, <code>ReadSecondsAs</code>](#setting-the-time-units-in-a-different-field-)
+  - [<code>early*Offset</code>](#earlyoffset)
+  - [<code>*Duration</code>](#duration)
+  - [A new struct, <code>ProbeOffset</code>](#a-new-struct-probeoffset)
+  - [Setting the time units in a different field, <code>ReadSecondsAs</code>](#setting-the-time-units-in-a-different-field-readsecondsas)
   - [v2 api for probe.](#v2-api-for-probe)
   - [OffsetMilliseconds](#offsetmilliseconds)
   - [Reconcile seconds field to nearest whole second.](#reconcile-seconds-field-to-nearest-whole-second)

@mikebrow
Copy link
Member Author

mikebrow commented Jul 1, 2024

Is this KEP still alive?

was paused for 1.31 will bring it back to life for 1.32!

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling this may end up in the same place that fast-restart are approaching -- some node-level "budget" for probes, with an HTB to provide some "fairness", such that the probe's period is a request which will be satisfied when possible, but may be delayed when the node is under contention. E.g. if every pod asks for sub-second probes, we could end up doing hundreds of QPS of traffic just on probing.


There are a few corner cases around default (effective period) values:

`periodSeconds: 0` and `periodMilliseconds: 500` would be 10.5s / 10500ms (as 0 => 10 for `periodSeconds` via [defaults.go](https://github.com/kubernetes/kubernetes/blob/3b13e9445a3bf86c94781c898f224e6690399178/pkg/apis/core/v1/defaults.go#L213-L215))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fail in that case, no? Sorry, you can't use MS unless you also specify S. Any existing pod will still be OK, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nod will update.. should fail, if one wants mills offset must say from any int except 0, because...


`periodSeconds: 2` and `periodMilliseconds: -500` would be 1.5s / 1500ms.

`periodSeconds: 1` and `periodMilliseconds: -500` would be 0.5s / 500ms. (*** To reduce un-necessary resource usage, because periodMilliseconds is used to reduce the period to less than a second the offset is only used until success is reached then no longer becoming 1second after success.***)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we draw out the state-machine for probes and consider how this changes that, specifically in the context of startup, readiness, and liveness actually being different?

In the past we have added fields to probes that don't apply to all modes and it's gross.

@psschwei
Copy link
Contributor

I think a budget makes sense... we had proposed a type of throttling as a mitigation, but should be able to change that to a node-level budget

@thockin
Copy link
Member

thockin commented Jul 12, 2024

I think it makes sense to have both a sane lower-bound (which can depend on probe type) and an overall budget. I think that if we can describe a "budget" semantic that balances requirements, we can use that same pattern in other places. I think requirements include:

  • Protect the node and the kubelet from DoS
  • Bound the resources consumed by probes in a predictable way
  • Provide some notion of fairness across different requestors (definition of "fair" is TBD, but may include priority, aggressiveness, QoS class, size)
  • Service requests as vigorously as possible within the constraints

@lauralorenz we had a very similar discussion wrt restarts.

@lauralorenz
Copy link
Contributor

Hi @psschwei @mikebrow and any other authors -- yes as @thockin mentions, I'm trending towards needing a budget and something that can admit a restart request that is determined in budget for CrashLoopBackOff (#4603) longer term. The restart I'm trying to budget for would be queuing against kubelet directly as the operation is done inline of the kubelet runtime's SyncPod loop. Yours is a little different in that the restart of interest is queueing into to the probe worker's runtime, or perhaps earlier as to whether to allocate a probe worker in the first place (?). Either way it's still taking up resources from the kubelet component so I think we would use somewhat the same kubelet stats to hand out budget and should work together to define it! Let's hang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Assigned
Status: Waiting on Author
Development

Successfully merging this pull request may close these issues.