Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Allow perm-failed deployments #14519

Closed
ironcladlou opened this issue Sep 24, 2015 · 17 comments
Closed

RFC: Allow perm-failed deployments #14519

ironcladlou opened this issue Sep 24, 2015 · 17 comments
Assignees
Labels
area/app-lifecycle priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@ironcladlou
Copy link
Contributor

There are times when the deployment system can infer that the latest deployment state has no reasonable chance of being realized (e.g. a bad or unpullable image). The current deployment controller design will continue to try reconciling indefinitely regardless of the possibly of success. If based on inference or user constraints (e.g. timeout conditions specified in an enhancement to the deployment API #1743) the system is ready to give up its best-effort attempt, the deployment could be somehow marked as "permananently failed" for a given spec hash so that the system won't continue thrashing on a doomed deployment.

There is a bit of functional overlap with inert deployments (#14516) in that both concepts result in the deployment controller "ignoring" a deployment whose state still needs realized, but inert deployments as described don't seem to capture all the context users would want in this case (i.e., it's not enough to just mark a doomed deployment intert without more context about why, and without UX safety nets to distinguish the repurcussions of re-activating the suddenly inert deployment vs. a permafailed deployment.)

@ironcladlou
Copy link
Contributor Author

@bgrant0607 bgrant0607 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/app-lifecycle team/ux labels Sep 24, 2015
@ironcladlou
Copy link
Contributor Author

Could permafail be represented as a condition on the deployment?

@smarterclayton
Copy link
Contributor

There are two targets:

  1. The deployment is completely invalid and will never be deployable
  2. The latest state of the deployment cannot be deployed

In 2, the user reasonably expects that changing the deployment would result in a new attempt to deploy. So inert isn't really the right knob - the user should expect to know that "state XYZ did not complete within the constraints indicated", but that's not really the same as inert, which is effectively "I don't want consider any future states".

There has to be some mechanism whereby the deployment controller can a) record and b) observe that a given state is unreachable. If the deployment changes from a->b, and b is unreachable, and the user changes back to state a, the deployment should be able to proceed. It has to be easy for the "unreachable" state to be cleared - unreachable is something the deployment controller is indicating on the deployment.

It's also reasonable for a user to expect that there be some way to continue to observe "unreachable" states even if the deployment controller moved on. Our choice in openshift was to mark the replication controller with a failed indication, which the deployment controller acknowledged by stopping processing if that was the current sate. If the user retries a particular state in OpenShift we are creating a new RC since we use versioned numbers - so there is no previous unreachable state to clear. In a hash based state model, we'd need to clear the unreachable / failed flag on the RC somehow to allow the deployment controller to proceed. It's likely we'd want to preserve that this was unreachable states.

It's also reasonable for a user to expect that they have to take affirmative action to retry a failed state. If the deployment controller gets restarted and tries to retry a failed state (for some types of deployment) that could be seriously damaging to a production deployment. So for some patterns we need to be very careful not to have a path whereby a previously "failed" state is not suddenly reattempted.

@nikhiljindal
Copy link
Contributor

Yes we can add a "Failed" condition to indicate permfail.
Agreed that inert and permfail are different. An inert deployment will never be tried, but a failed deployment will be retried, if the user updates the deployment.

We need to way for user to be able to retry a failed deployment, without updating it (for ex: after adding a missing image).

@ironcladlou
Copy link
Contributor Author

What if the deployment controller:

  1. Sets a "failed" annotation on the new rc if the controller decides the state is unreachable.
  2. Skips scaling of the current deployment if the new rc exists and has the annotation.
  3. Sets a "failed" condition on the deployment if there exists a new rc which matches the current hash and has the annotation, and clears it otherwise.

Deleting the annotation from the rc would cause a retry. Events could be raised to inform users of the transitions.

@bgrant0607
Copy link
Member

Need a condition, like Progressing or Stuck.

@nikhiljindal
Copy link
Contributor

We discussed adding a timeout param in the spec, which controls the failed condition. (DeploymentController sets that condition on the deployment, if there has not been any progress for that much time). One of the name suggested was maxProgressThreshholdSeconds.

We also discussed adding a subresource deployments/retry to clear the failed condition in deployment..

@bgrant0607 bgrant0607 modified the milestones: v1.2, v1.2-candidate Nov 19, 2015
@0xmichalis
Copy link
Contributor

I was thinking about the timeout param in the spec but I believe a timeout-based approach here adds more complexity than say a retries-based approach as the user has to know beforehand how long of a timeout they must use. The times a deployment comes up vary too. We would also have to always make sure that maxProgressThreshholdSeconds > MindReadySeconds for rolling updates, an error that would most probably be disturbing for users.

How about a retries-based approach? The user would note in the spec how many times they want this deployment retried in case of an error returned while executing the strategy. I would imagine a respective status field noting how many times the deployment was retried. If spec.retries == status.retries, and the deployment still isn't successfull then we mark it as failed (another status field or an annotation?).

Then PUT deployments/<name>/retry would clean status.retries and the {other status field,annotation}.

Thoughts?

@nikhiljindal
Copy link
Contributor

Number of retries also seems fine to me.

@ironcladlou
Copy link
Contributor Author

If during a given sync progress can't be made due to some transient error (failing to update due to some connectivity issue), I probably want to continue trying to sync until connectivity is restored. Given that, I wonder if retry counts might not apply uniformly to all classes of error. Can we think of any examples of specific error types where retry thresholds would be more appropriate than time based?

Progress time thresholds seem useful even given the validation complexity. My gut says deployment users are probably interested in expressing timeouts (either overall or progress based) in terms of wall time.

@0xmichalis
Copy link
Contributor

I am pretty sure there are users who would want timeout-based deployments but a timeout feels more error-prone vs a retry count. Or so I think. I don't have any strong case over a retry count. That being said, if we think a timeout is more useful then I will try that approach and see where it goes.

@ironcladlou
Copy link
Contributor Author

I am pretty sure there are users who would want timeout-based deployments but a timeout feels more error-prone vs a retry count. Or so I think. I don't have any strong case over a retry count. That being said, if we think a timeout is more useful then I will try that approach and see where it goes.

I think both have value. I agree that timeout is prone to undesirable failure due to the unpredictability of timing within the cluster generally. I think that automation in general around this condition is problematic and needs to come with disclaimers.

Maybe this highlights the importance of giving users clear visibility into status so they can easily know when deployments aren't progressing so they can manually intervene.

@krmayankk
Copy link

We are more interested in automatic rollbacks. Is this on the roadmap at all ? I started deploying and the deployment is not at all progressing, should it just simply rollback automatically to the last known good version rather than just allow manual intervention or may be there is a policy needed where user selects he wants manual vs automatic. thoughts ?

@0xmichalis
Copy link
Contributor

Automatic rollbacks in case of no progress is definitely something we want.

@bgrant0607 bgrant0607 removed this from the next-candidate milestone Mar 18, 2016
@bgrant0607 bgrant0607 added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 18, 2016
@bgrant0607
Copy link
Member

Automatic rollback is #23211

@livelace
Copy link

Actually this is a dupe of #14519. Can you post in that issue what do you want from a permanently failed deployment to do? I think your issue downstream was that you may end up with a replication controller that has no ready pods and wasn't identified by the deployment?

Ok. Actually we want, that lifecycle hooks will be processed in context of deployment status. If hook ended with error - deployment should be considered as a failure. Main process of pod may working well and pass all tests (liveness/readiness) but if hook returned error - deployment status should be - failed, because hook is important part of pod and it readiness.

@0xmichalis
Copy link
Contributor

Ok. Actually we want, that lifecycle hooks will be processed in context of deployment status. If hook ended with error - deployment should be considered as a failure. Main process of pod may working well and pass all tests (liveness/readiness) but if hook returned error - deployment status should be - failed, because hook is important part of pod and it readiness.

@livelace hm, so your problem was hooks? It wasn't clear to me from your original issue. I have answered downstream. Hooks in Kube deployments are not there yet: #14512

@0xmichalis 0xmichalis self-assigned this May 26, 2016
k8s-github-robot pushed a commit that referenced this issue Oct 28, 2016
Automatic merge from submit-queue

Add perma-failed deployments API

@kubernetes/deployment @smarterclayton 

API for #14519

Docs at kubernetes/website#1337
k8s-github-robot pushed a commit that referenced this issue Nov 4, 2016
…failed

Automatic merge from submit-queue

Controller changes for perma failed deployments

This PR adds support for reporting failed deployments based on a timeout
parameter defined in the spec. If there is no progress for the amount
of time defined as progressDeadlineSeconds then the deployment will be
marked as failed by a Progressing condition with a ProgressDeadlineExceeded
reason.

Follow-up to #19343

Docs at kubernetes/website#1337

Fixes #14519

@kubernetes/deployment @smarterclayton
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/app-lifecycle priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

7 participants