Bucket drain events using semaphore #49

eytan-avisror · 2020-03-05T18:50:29Z

Fixes #48

Limit is currently set to 32 concurrent terminations, once drain is completed semaphore is released.

Example:
80 instances are terminated > 32 will begin draining, others are blocked (heartbeat is sent) > once drain is completed semaphore is released > next batch starts draining.

This causes a delay in handling lifecycle events up to (TerminatingInstances / 32 * MaxDrainTime)

TBD:

Make value controllable by flag, with default value of 32

Testing:

Terminate 100 instances running pods with preStop hook that sleeps 90 seconds, evaluate delays
Memory utilization using default value does not exceed recommended values

codecov · 2020-03-05T19:13:05Z

Codecov Report

Merging #49 into master will decrease coverage by 0.01%.
The diff coverage is 71.42%.

@@            Coverage Diff             @@
##           master      #49      +/-   ##
==========================================
- Coverage   73.06%   73.05%   -0.02%     
==========================================
  Files          12       12              
  Lines         995     1002       +7     
==========================================
+ Hits          727      732       +5     
- Misses        206      207       +1     
- Partials       62       63       +1

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cd1979...1bab820. Read the comment docs.

eytan-avisror · 2020-03-06T20:16:24Z

Test

Run 100 node, with deployment of 100 pod replicas with pre terminate step of sleep 170s
Terminate 97 nodes

Results

Batching:

delay in processing time gradually increasing (blue) clearly visible, gaps are 170s of sleep time as expected.

Memory consumption:

Memory consumption seem to hover around 500mb per batch of 32 x kubectl drain (the default value).

I have also bumped up the examples to request 2048Mi

Edge cases to consider

in scenarios where the max-drain-timeout is set to a large number, and drains are taking a very long time (e.g. due to PDB / scale up), AND a large number of nodes is terminating, it might take a long time to start deregistration for last batches, however this scenario is highly unlikely as it would mean all 32 kubectl threads are being delayed - also this can be configured by either setting a lower timeout or adding more parallel drains (and memory).

kevdowney

LG

kevdowney · 2020-03-06T21:31:39Z

cmd/serve.go

@@ -101,6 +104,7 @@ func init() {
 	serveCmd.Flags().StringVar(&queueName, "queue-name", "", "the name of the SQS queue to consume lifecycle hooks from")
 	serveCmd.Flags().StringVar(&kubectlLocalPath, "kubectl-path", "/usr/local/bin/kubectl", "the path to kubectl binary")
 	serveCmd.Flags().StringVar(&logLevel, "log-level", "info", "the logging level (info, warning, debug)")
+	serveCmd.Flags().Int64Var(&maxDrainConcurrency, "max-drain-concurrency", 32, "maximum number of node drains to process in parallel")


You might just want the unsigned int uint instead, then you just convert to int64 for the semaphore.NewWeighted(int64(maxDrainConcurrency))

What is the difference with using int and converting? uint is simply not accepting negative numbers, it would also be possible to use int and convert to int64, no?

I guess my point is, if we know we only use this variable as an int64 why do we need an extra conversion?

use semaphore

1d69e87

eytan-avisror requested a review from a team as a code owner March 5, 2020 18:50

Eytan Avisror added 3 commits March 5, 2020 10:57

add flag

9ef02db

add vendor

73ba64f

fix tests

61c4e7b

Update lifecycle-manager.yaml

1bab820

kevdowney approved these changes Mar 6, 2020

View reviewed changes

vgunapati approved these changes Mar 6, 2020

View reviewed changes

eytan-avisror merged commit eabed53 into keikoproj:master Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bucket drain events using semaphore #49

Bucket drain events using semaphore #49

eytan-avisror commented Mar 5, 2020 •

edited

Loading

codecov bot commented Mar 5, 2020 •

edited

Loading

eytan-avisror commented Mar 6, 2020 •

edited

Loading

kevdowney left a comment

kevdowney Mar 6, 2020

eytan-avisror Mar 6, 2020 •

edited

Loading

eytan-avisror Mar 6, 2020

Bucket drain events using semaphore #49

Bucket drain events using semaphore #49

Conversation

eytan-avisror commented Mar 5, 2020 • edited Loading

codecov bot commented Mar 5, 2020 • edited Loading

Codecov Report

eytan-avisror commented Mar 6, 2020 • edited Loading

Test

Results

Edge cases to consider

kevdowney left a comment

Choose a reason for hiding this comment

kevdowney Mar 6, 2020

Choose a reason for hiding this comment

eytan-avisror Mar 6, 2020 • edited Loading

Choose a reason for hiding this comment

eytan-avisror Mar 6, 2020

Choose a reason for hiding this comment

eytan-avisror commented Mar 5, 2020 •

edited

Loading

codecov bot commented Mar 5, 2020 •

edited

Loading

eytan-avisror commented Mar 6, 2020 •

edited

Loading

eytan-avisror Mar 6, 2020 •

edited

Loading