RFC: Allow perm-failed deployments #14519

ironcladlou · 2015-09-24T20:29:09Z

There are times when the deployment system can infer that the latest deployment state has no reasonable chance of being realized (e.g. a bad or unpullable image). The current deployment controller design will continue to try reconciling indefinitely regardless of the possibly of success. If based on inference or user constraints (e.g. timeout conditions specified in an enhancement to the deployment API #1743) the system is ready to give up its best-effort attempt, the deployment could be somehow marked as "permananently failed" for a given spec hash so that the system won't continue thrashing on a doomed deployment.

There is a bit of functional overlap with inert deployments (#14516) in that both concepts result in the deployment controller "ignoring" a deployment whose state still needs realized, but inert deployments as described don't seem to capture all the context users would want in this case (i.e., it's not enough to just mark a doomed deployment intert without more context about why, and without UX safety nets to distinguish the repurcussions of re-activating the suddenly inert deployment vs. a permafailed deployment.)

ironcladlou · 2015-09-24T20:29:17Z

cc @smarterclayton @nikhiljindal @ghodss @bgrant0607

ironcladlou · 2015-09-25T14:19:04Z

Could permafail be represented as a condition on the deployment?

smarterclayton · 2015-09-25T15:04:09Z

There are two targets:

The deployment is completely invalid and will never be deployable
The latest state of the deployment cannot be deployed

In 2, the user reasonably expects that changing the deployment would result in a new attempt to deploy. So inert isn't really the right knob - the user should expect to know that "state XYZ did not complete within the constraints indicated", but that's not really the same as inert, which is effectively "I don't want consider any future states".

There has to be some mechanism whereby the deployment controller can a) record and b) observe that a given state is unreachable. If the deployment changes from a->b, and b is unreachable, and the user changes back to state a, the deployment should be able to proceed. It has to be easy for the "unreachable" state to be cleared - unreachable is something the deployment controller is indicating on the deployment.

It's also reasonable for a user to expect that there be some way to continue to observe "unreachable" states even if the deployment controller moved on. Our choice in openshift was to mark the replication controller with a failed indication, which the deployment controller acknowledged by stopping processing if that was the current sate. If the user retries a particular state in OpenShift we are creating a new RC since we use versioned numbers - so there is no previous unreachable state to clear. In a hash based state model, we'd need to clear the unreachable / failed flag on the RC somehow to allow the deployment controller to proceed. It's likely we'd want to preserve that this was unreachable states.

It's also reasonable for a user to expect that they have to take affirmative action to retry a failed state. If the deployment controller gets restarted and tries to retry a failed state (for some types of deployment) that could be seriously damaging to a production deployment. So for some patterns we need to be very careful not to have a path whereby a previously "failed" state is not suddenly reattempted.

nikhiljindal · 2015-09-27T22:27:56Z

Yes we can add a "Failed" condition to indicate permfail.
Agreed that inert and permfail are different. An inert deployment will never be tried, but a failed deployment will be retried, if the user updates the deployment.

We need to way for user to be able to retry a failed deployment, without updating it (for ex: after adding a missing image).

ironcladlou · 2015-09-28T17:40:07Z

What if the deployment controller:

Sets a "failed" annotation on the new rc if the controller decides the state is unreachable.
Skips scaling of the current deployment if the new rc exists and has the annotation.
Sets a "failed" condition on the deployment if there exists a new rc which matches the current hash and has the annotation, and clears it otherwise.

Deleting the annotation from the rc would cause a retry. Events could be raised to inform users of the transitions.

bgrant0607 · 2015-11-12T17:23:32Z

Need a condition, like Progressing or Stuck.

nikhiljindal · 2015-11-12T18:27:34Z

We discussed adding a timeout param in the spec, which controls the failed condition. (DeploymentController sets that condition on the deployment, if there has not been any progress for that much time). One of the name suggested was maxProgressThreshholdSeconds.

We also discussed adding a subresource deployments/retry to clear the failed condition in deployment..

0xmichalis · 2015-12-26T15:25:18Z

I was thinking about the timeout param in the spec but I believe a timeout-based approach here adds more complexity than say a retries-based approach as the user has to know beforehand how long of a timeout they must use. The times a deployment comes up vary too. We would also have to always make sure that maxProgressThreshholdSeconds > MindReadySeconds for rolling updates, an error that would most probably be disturbing for users.

How about a retries-based approach? The user would note in the spec how many times they want this deployment retried in case of an error returned while executing the strategy. I would imagine a respective status field noting how many times the deployment was retried. If spec.retries == status.retries, and the deployment still isn't successfull then we mark it as failed (another status field or an annotation?).

Then PUT deployments/<name>/retry would clean status.retries and the {other status field,annotation}.

Thoughts?

nikhiljindal · 2016-01-04T21:33:00Z

Number of retries also seems fine to me.

ironcladlou · 2016-01-05T14:45:58Z

If during a given sync progress can't be made due to some transient error (failing to update due to some connectivity issue), I probably want to continue trying to sync until connectivity is restored. Given that, I wonder if retry counts might not apply uniformly to all classes of error. Can we think of any examples of specific error types where retry thresholds would be more appropriate than time based?

Progress time thresholds seem useful even given the validation complexity. My gut says deployment users are probably interested in expressing timeouts (either overall or progress based) in terms of wall time.

0xmichalis · 2016-01-05T15:32:12Z

I am pretty sure there are users who would want timeout-based deployments but a timeout feels more error-prone vs a retry count. Or so I think. I don't have any strong case over a retry count. That being said, if we think a timeout is more useful then I will try that approach and see where it goes.

ironcladlou · 2016-01-05T15:56:18Z

I am pretty sure there are users who would want timeout-based deployments but a timeout feels more error-prone vs a retry count. Or so I think. I don't have any strong case over a retry count. That being said, if we think a timeout is more useful then I will try that approach and see where it goes.

I think both have value. I agree that timeout is prone to undesirable failure due to the unpredictability of timing within the cluster generally. I think that automation in general around this condition is problematic and needs to come with disclaimers.

Maybe this highlights the importance of giving users clear visibility into status so they can easily know when deployments aren't progressing so they can manually intervene.

krmayankk · 2016-02-17T04:59:25Z

We are more interested in automatic rollbacks. Is this on the roadmap at all ? I started deploying and the deployment is not at all progressing, should it just simply rollback automatically to the last known good version rather than just allow manual intervention or may be there is a policy needed where user selects he wants manual vs automatic. thoughts ?

0xmichalis · 2016-02-17T10:53:01Z

Automatic rollbacks in case of no progress is definitely something we want.

bgrant0607 · 2016-05-02T21:57:44Z

Automatic rollback is #23211

livelace · 2016-05-20T07:57:52Z

Actually this is a dupe of #14519. Can you post in that issue what do you want from a permanently failed deployment to do? I think your issue downstream was that you may end up with a replication controller that has no ready pods and wasn't identified by the deployment?

Ok. Actually we want, that lifecycle hooks will be processed in context of deployment status. If hook ended with error - deployment should be considered as a failure. Main process of pod may working well and pass all tests (liveness/readiness) but if hook returned error - deployment status should be - failed, because hook is important part of pod and it readiness.

0xmichalis · 2016-05-20T08:07:10Z

Ok. Actually we want, that lifecycle hooks will be processed in context of deployment status. If hook ended with error - deployment should be considered as a failure. Main process of pod may working well and pass all tests (liveness/readiness) but if hook returned error - deployment status should be - failed, because hook is important part of pod and it readiness.

@livelace hm, so your problem was hooks? It wasn't clear to me from your original issue. I have answered downstream. Hooks in Kube deployments are not there yet: #14512

@smarterclayton

Automatic merge from submit-queue Add perma-failed deployments API @kubernetes/deployment @smarterclayton API for #14519 Docs at kubernetes/website#1337

@smarterclayton

…failed Automatic merge from submit-queue Controller changes for perma failed deployments This PR adds support for reporting failed deployments based on a timeout parameter defined in the spec. If there is no progress for the amount of time defined as progressDeadlineSeconds then the deployment will be marked as failed by a Progressing condition with a ProgressDeadlineExceeded reason. Follow-up to #19343 Docs at kubernetes/website#1337 Fixes #14519 @kubernetes/deployment @smarterclayton

bgrant0607 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/app-lifecycle team/ux labels Sep 24, 2015

bgrant0607 added the kind/enhancement label Oct 2, 2015

bgrant0607 added this to the v1.2-candidate milestone Nov 12, 2015

bgrant0607 modified the milestones: v1.2, v1.2-candidate Nov 19, 2015

0xmichalis mentioned this issue Jan 6, 2016

Add perma-failed deployments API #19343

Merged

bgrant0607 modified the milestones: next-candidate, v1.2 Jan 29, 2016

bgrant0607 mentioned this issue Mar 18, 2016

Automatic rollback for failed deployments #23211

Closed

bgrant0607 removed this from the next-candidate milestone Mar 18, 2016

bgrant0607 added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 18, 2016

0xmichalis mentioned this issue May 18, 2016

Doesn't detect failed replication controller #25790

Closed

0xmichalis self-assigned this May 26, 2016

0xmichalis mentioned this issue Jun 13, 2016

Move oc deploy in oc rollout openshift/origin#9298

Closed

bgrant0607 mentioned this issue Aug 10, 2016

Documentation on how to recover from a bad deployemnt #29784

Closed

0xmichalis mentioned this issue Oct 18, 2016

[WiP] DaemonSet updates #31693

Closed

0xmichalis mentioned this issue Oct 27, 2016

Controller changes for perma failed deployments #35691

Merged

k8s-github-robot closed this as completed in #35691 Nov 4, 2016

yqwang-ms mentioned this issue Jan 16, 2020

Container stuck in Waiting even for permanent errors (e.g. image not found) and restartPolicy=Never #87278

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Allow perm-failed deployments #14519

RFC: Allow perm-failed deployments #14519

ironcladlou commented Sep 24, 2015

ironcladlou commented Sep 24, 2015

ironcladlou commented Sep 25, 2015

smarterclayton commented Sep 25, 2015

nikhiljindal commented Sep 27, 2015

ironcladlou commented Sep 28, 2015

bgrant0607 commented Nov 12, 2015

nikhiljindal commented Nov 12, 2015

0xmichalis commented Dec 26, 2015

nikhiljindal commented Jan 4, 2016

ironcladlou commented Jan 5, 2016

0xmichalis commented Jan 5, 2016

ironcladlou commented Jan 5, 2016

krmayankk commented Feb 17, 2016

0xmichalis commented Feb 17, 2016

bgrant0607 commented May 2, 2016

livelace commented May 20, 2016

0xmichalis commented May 20, 2016

RFC: Allow perm-failed deployments #14519

RFC: Allow perm-failed deployments #14519

Comments

ironcladlou commented Sep 24, 2015

ironcladlou commented Sep 24, 2015

ironcladlou commented Sep 25, 2015

smarterclayton commented Sep 25, 2015

nikhiljindal commented Sep 27, 2015

ironcladlou commented Sep 28, 2015

bgrant0607 commented Nov 12, 2015

nikhiljindal commented Nov 12, 2015

0xmichalis commented Dec 26, 2015

nikhiljindal commented Jan 4, 2016

ironcladlou commented Jan 5, 2016

0xmichalis commented Jan 5, 2016

ironcladlou commented Jan 5, 2016

krmayankk commented Feb 17, 2016

0xmichalis commented Feb 17, 2016

bgrant0607 commented May 2, 2016

livelace commented May 20, 2016

0xmichalis commented May 20, 2016