Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow lifecycle events longer than one hour #69

Merged
merged 1 commit into from
May 11, 2021

Conversation

mrak
Copy link
Contributor

@mrak mrak commented May 10, 2021

This adds --max-time-to-process to allow wait states longer than the one hour default. AWS allows up to 48 hour wait states (or 100 times the heartbeat timeout, whichever is shorter). This was previously a hardcoded value of 3600 seconds, which remains the default if this flag is not specified.

https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html#lifecycle-hook-wait-state

@mrak mrak requested a review from a team as a code owner May 10, 2021 21:51
@CLAassistant
Copy link

CLAassistant commented May 10, 2021

CLA assistant check
All committers have signed the CLA.

This adds --max-time-to-process to allow wait states longer than the one
hour default. AWS allows up to 48 hour wait states (or 100 times the
heartbeat timeouts, whichever is shorter).

https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html#lifecycle-hook-wait-state
Signed-off-by: mrak and cale <emrak@paypal.com>
@codecov
Copy link

codecov bot commented May 10, 2021

Codecov Report

Merging #69 (b08f879) into master (d9d1df9) will decrease coverage by 0.07%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #69      +/-   ##
==========================================
- Coverage   69.82%   69.74%   -0.08%     
==========================================
  Files          12       12              
  Lines         898      899       +1     
==========================================
  Hits          627      627              
- Misses        211      212       +1     
  Partials       60       60              
Impacted Files Coverage Δ
pkg/service/manager.go 70.58% <ø> (ø)
pkg/service/server.go 59.03% <50.00%> (-0.17%) ⬇️
pkg/service/autoscaling.go 76.47% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d9d1df9...b08f879. Read the comment docs.

@eytan-avisror
Copy link
Collaborator

Thanks @mrak this looks good.
Did you get a chance to test this?
Also another question is, the doc states The maximum amount of time that you can keep an instance in a wait state is 48 hours or 100 times the heartbeat timeout, whichever is smaller. , the heartbeat timeout is defined by the lifecycle hook, so it's possible that we will abandon the instance before the iterations expire, could you test what happens in this case?
e.g. you can easily reproduce by setting your heartbeat timeout to 5 seconds, and running a hook with a long --max-time-to-process - this should take a few minutes to use up the max allowed 100 * heartbeat timeout.

@mrak
Copy link
Contributor Author

mrak commented May 11, 2021

We set the heartbeat timeout to 30 (the minimum) and waited roughly 50 minutes (30s*100/60=50m) while setting the --max-time-to-process=170000 and --drain-timeout=170000.

The expectation was that the 100*heartbeat limit would be hit before the max time to process, which we verified to be the case. At that point the ASG moved on with TERMINATING Proceed, lifecycle-manager attempted to post another heartbeat to a now-non-existing queue item (which failed) and then lifecycle-manager determined that that node was complete and moved on to other things.

Copy link
Collaborator

@eytan-avisror eytan-avisror left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@eytan-avisror eytan-avisror merged commit 5a0d2e2 into keikoproj:master May 11, 2021
@mrak mrak deleted the max-event-time branch May 11, 2021 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants