Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

host-ctr: Implement exponential backoff for image pulls #433

Merged
merged 1 commit into from
Oct 24, 2019

Conversation

etungsten
Copy link
Contributor

@etungsten etungsten commented Oct 17, 2019

Issue #, if available: Implements #431

Description of changes:

  • Implements exponential backoff for image pulls.
    • Current configuration is maximum 5 attempts, retry interval multiplier of 2.
    • Random jitter of 2-6 seconds is added to each retry interval
  • Moves signal notification channel set up to after image pull since that is mostly relevant to only when we're starting a container task.
  • Ran go fmt
  • Minor modification to an info log message

Testing done:
Locally testing host-ctr with bad/invalid image:

$ sudo ./host-ctr -source registry.hub.docker.com/library/busybox:asdasd -pull-image-only -containerd-socket /run/containerd/containerd.sock
WARN[0000] failed to pull image. Waiting 4.291s before retrying...  error="failed to resolve reference \"registry.hub.docker.com/library/busybox:asdasd\": registry.hub.docker.com/library/busybox:asdasd not found"
WARN[0005] failed to pull image. Waiting 7.232s before retrying...  error="failed to resolve reference \"registry.hub.docker.com/library/busybox:asdasd\": registry.hub.docker.com/library/busybox:asdasd not found"
WARN[0012] failed to pull image. Waiting 8.961s before retrying...  error="failed to resolve reference \"registry.hub.docker.com/library/busybox:asdasd\": registry.hub.docker.com/library/busybox:asdasd not found"
WARN[0022] failed to pull image. Waiting 10.513s before retrying...  error="failed to resolve reference \"registry.hub.docker.com/library/busybox:asdasd\": registry.hub.docker.com/library/busybox:asdasd not found"
WARN[0032] failed to pull image. Waiting 18.934s before retrying...  error="failed to resolve reference \"registry.hub.docker.com/library/busybox:asdasd\": registry.hub.docker.com/library/busybox:asdasd not found"
ERRO[0052] retries exhausted: failed to resolve reference "registry.hub.docker.com/library/busybox:asdasd": registry.hub.docker.com/library/busybox:asdasd not found  ref="registry.hub.docker.com/library/busybox:asdasd"

With good image:

✦ $ sudo ./host-ctr -source registry.hub.docker.com/library/busybox:latest -pull-image-only -containerd-socket /run/containerd/containerd.sock
INFO[0000] Pulled successfully                           img="registry.hub.docker.com/library/busybox:latest"
INFO[0000] Unpacking...                                  img="registry.hub.docker.com/library/busybox:latest"

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
containerd.WithSchema1Conversion)
if err != nil {
return nil, errors.Wrap(err, "Failed to pull ctr image")
// Retry with exponential backoff when failures occur, with 10 attempts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most folks would expect to see a fairly common implementation reused such as https://github.com/cenkalti/backoff, but this looks small enough to me to stand alone perfectly well in this use case.

workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎃

@etungsten
Copy link
Contributor Author

etungsten commented Oct 18, 2019

Addresses @jahkeup 's comments.
Replaces time.Sleep with select{ } on channel returned by time.NewTimer and ctx.Done()

Copy link
Member

@jahkeup jahkeup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to handle the contexts throughout main to respond to signals thoroughly.

workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
@etungsten
Copy link
Contributor Author

Addresses @jahkeup 's comment regarding signal channel.
Interrupting during image pull retries:

✦ $ sudo ./host-ctr -source registry.hub.docker.com/library/busybox -pull-image-only -containerd-socket /run/containerd/containerd.sock
WARN[0000] Failed to pull image. Waiting 1 seconds before retrying...  error="failed to resolve reference \"registry.hub.docker.com/library/busybox\": object required"
WARN[0001] Failed to pull image. Waiting 2 seconds before retrying...  error="failed to resolve reference \"registry.hub.docker.com/library/busybox\": object required"
WARN[0003] Failed to pull image. Waiting 4 seconds before retrying...  error="failed to resolve reference \"registry.hub.docker.com/library/busybox\": object required"
WARN[0007] Failed to pull image. Waiting 8 seconds before retrying...  error="failed to resolve reference \"registry.hub.docker.com/library/busybox\": object required"
WARN[0015] Failed to pull image. Waiting 16 seconds before retrying...  error="failed to resolve reference \"registry.hub.docker.com/library/busybox\": object required"
^CINFO[0015] Received signal: interrupt                   
ERRO[0015] context ended while retrying: failed to resolve reference "registry.hub.docker.com/library/busybox": object required  ref=registry.hub.docker.com/library/busybox

@etungsten
Copy link
Contributor Author

Updates maximum retry interval from 32 seconds to 30 seconds.

@etungsten
Copy link
Contributor Author

Sorry, I misunderstood the comment about the duration. Updated to guarantee maximum retry duration won't exceed 31 seconds (1+2+4+8+16).

workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
@etungsten
Copy link
Contributor Author

Addresses @jahkeup 's comments

  • Ran go fmt
  • Use empty contexts for defer statements that's concerned with clean up
  • Create new cleanCtx for logging and deriving contexts for sending signals to container task.

Copy link
Member

@jahkeup jahkeup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple nits, but otherwise looks great!! 🐻

workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
@etungsten
Copy link
Contributor Author

Addresses @jahkeup comments.

@bcressey
Copy link
Contributor

bcressey commented Oct 19, 2019

Some thoughts:

  • we should add some jitter to the retry attempts, so we don't have a large number of nodes retrying exactly in lockstep
  • we should make sure that the system converges on a good state even in the face of a multi-day outage for the upstream registry

It's useful for host-ctr to periodically error out and restart from the top rather than retrying indefinitely; that can help if the underlying cause was transient (e.g. bad DNS resolvers) and would be fixed by an application restart. In that case the service needs to be configured to always restart.

In the context of a prolonged outage, a retry pattern like 1 -> 2 -> 4 -> 8 ... <restart> ... 1 -> 2 -> 4 -> 8 ... isn't ideal since we don't need the rapid bursts of retries, and that drives load on an upstream that may be struggling. Maybe a single try at start, then a random delay for 45-75 seconds between attempts. The service should also delay between restarts so we don't end up with two requests with no intervening delay (last request -> exit -> first request).

@etungsten
Copy link
Contributor Author

Rebase develop.

Copy link
Member

@jahkeup jahkeup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bcressey I think we should lift some of the handling into the unit, we can modify the RestartSec directive to specify a larger period of time to allow for retries. @etungsten we can also use a well-known returnable error (with var retriesError = ... or similar) to detect a retry error and sleep before exiting to add that jitter.

Either way, I think to make this reliable we'll also want to further designate errors to be handled to determine transient failures (where possible) from permanent failures (eg: user specified a bad image). This can be a fast follow item IMO. I think this already gets us to a better standing in terms of transient errors.

workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
@etungsten
Copy link
Contributor Author

etungsten commented Oct 21, 2019

It seems to me that there isn't a good way to match on types of errors to determine if a failure was transient or not.
I tried reading the source for Client.Pull to see what error message suffix or prefix I should match on. I'm not sure if suffix and prefix matching is something we'd like to do, it doesn't seem reliable since the error messages could change depending how the underlying Resolver is implemented.
Please let me know if there's a preferred way of doing this.
For now, I've created an issue to track this #443

@etungsten
Copy link
Contributor Author

@jahkeup
Copy link
Member

jahkeup commented Oct 21, 2019

RestartSec is already set to 10 in host-containers@.service:

Yeah, but I think to address @bcressey's concerns, we'd want to increase that and/or add some delay to the to-be-errored exit to make host-ctr spread out its restarts (maybe with a flag to control the behavior).

@etungsten
Copy link
Contributor Author

etungsten commented Oct 21, 2019

Adds a random jitter of 2-6 to each retry attempt to prevent multiple clients from trying to request in synchronization.

  • Configuration allows for minimum total retry duration of (2*5+31)=41 and maximum retry duration of (6*5 + 31)=61 seconds

Addresses @jahkeup 's comment

@etungsten etungsten force-pushed the host-ctr-retry-pull branch 2 times, most recently from b352492 to 03eb767 Compare October 21, 2019 21:49
@etungsten
Copy link
Contributor Author

go fmt and stop the jitter from accumulating

@etungsten
Copy link
Contributor Author

Increases RestartSec from 10 seconds to 45 seconds.

@tjkirch tjkirch requested a review from bcressey October 23, 2019 17:27
Copy link
Member

@jahkeup jahkeup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current code paths are okay even with the small comments I have, if you have a chance to make these changes now please do. If we need to move on, please track em to fix! 👍

workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
workspaces/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
Implements exponential backoff for image pulls.

Also move signal handling channel set up to after image pull happens
since its mostly relevant to only when we're starting a container
task.
@etungsten
Copy link
Contributor Author

Addresses @jahkeup 's comments

Copy link
Member

@jahkeup jahkeup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⛵ 💾 👍

return nil, errors.Wrap(err, "retries exhausted")
}
// Add a random jitter between 2 - 6 seconds to the retry interval
retryIntervalWithJitter := retryInterval + time.Duration(rand.Int31n(jitterPeakAmplitude))*time.Millisecond + jitterLowerBound*time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should use types in the const where possible, but is small enough in scope that this isn't needed at the moment. Let's carry on, let's get this on a host!

@etungsten etungsten merged commit 07ebe30 into develop Oct 24, 2019
@etungsten etungsten deleted the host-ctr-retry-pull branch October 24, 2019 21:30
@iliana iliana added this to the v0.2.0 milestone Nov 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants