Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KazooRetry: Do uniform jitter over the whole backoff interval #521

Merged
merged 2 commits into from
Sep 26, 2018

Conversation

andreasgudmundsson
Copy link
Contributor

The previous implementation would add a fixed amount of jitter around
the calculated back-off time. Retry attempts were thus clustered
around the exponentially spaced backoff points.

This patch does exponential backoff but uniformly spreads the retries
over an interval [0, backoff**attempt]

The following graphs show the connection rate/s when trying to open 20k new connections
to ZK. Each client was configured with, backoff=2, delay=10 and max_delay=180. The ZK in question is rate-limited and that explains the ceiling you see on the second graph.

Old Jitter
The connections attempts come in herds after the current delay interval has passed.
After over 30mins only about 10k connections had been established and that is where I cut-off the graph.

zk_connections_no_jitter

Uniform Jitter
The connection attempts reach the rate-limit then after 8 minutes the connection rate
falls below the limit and the remaining clients connect over a period of 5 mins. The whole
process takes 13 minutes.

zk_connections_patched

Copy link
Member

@jeffwidman jeffwidman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an important PR--those graphs are very compelling.

I'd like to merge this.

However, can you update the git commit messages to match this formatting? https://github.com/python-zk/kazoo/blob/master/CONTRIBUTING.md#git-commit-guidelines

Sorry to be pedantic, but we use an automated tool to generate changelogs from commit messages, hence the formatting requirement.

kazoo/retry.py Outdated
@@ -129,8 +125,7 @@ def __call__(self, func, *args, **kwargs):
if self._attempts == self.max_tries:
raise RetryFailedError("Too many retry attempts")
self._attempts += 1
sleeptime = self._cur_delay + (
random.randint(0, self.max_jitter) / 100.0)
sleeptime = random.randint(0, 1+int(self._cur_delay))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: missing spaces around the +

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a style fix along with an updated commit message.

retry_max_delay=3600,
)

_RETRY_COMPAT_MAPPING = dict(
max_retries='max_tries',
retry_delay='delay',
retry_backoff='backoff',
retry_jitter='max_jitter',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Technically this is a backwards-incompatible change as it will stop warning that this is a deprecated param. However, it looks like this has been deprecated for over 5 years since version 1.2: 3d577e4#diff-7d5ec76d88673978d0b9b679863f349dR228

So I'm in agreement with simply removing this altogether.

@andreasgudmundsson
Copy link
Contributor Author

I've changed the commit message so it should pass the commit message guidelines.

…ff interval

The previous implementation would add a fixed amount of jitter around
the calculated back-off time. Retry attempts were thus clustered
around the exponentially spaced backoff points.

This patch does exponential backoff but uniformly spreads the retries
over an interval [0, backoff**attempt]
@jeffwidman jeffwidman merged commit 60366d2 into python-zk:master Sep 26, 2018
@jeffwidman
Copy link
Member

Awesome, merged! Thank you @andreasgudmundsson again for demonstrating with those graphs

@andreasgudmundsson
Copy link
Contributor Author

Thanks for merging the request @jeffwidman. Do you have plans for making a new release?

@jeffwidman
Copy link
Member

Do you have plans for making a new release?

There's a couple of submitted PR's that I'm waiting on the owners to rebase/address minor review points and it'd be nice to land those first. So I'll wait a week, and then I'm traveling the following week for work, so probably the week after (3 weeks from now) I'd be willing to cut a release because that way if there's a bug in one of the recently submitted PRs, I'll have the time to merge a fix and cut a second bugfix release. If I forget, please ping me on here.

@andreasgudmundsson
Copy link
Contributor Author

If I forget, please ping me on here.

Ping

@StephenSorriaux
Copy link
Member

Hi @andreasgudmundsson , the 2.6.0 release of Kazoo is now live!

ofek pushed a commit to DataDog/integrations-core that referenced this pull request Dec 12, 2018
* Bump kafka-python to 1.4.4

This bumps to 1.4.4 which added the new `KafkaAdminClient` which we need
in order to make `monitor_unlisted_consumer_groups` work for consumer
offsets stored in Kafka.

It also comes in handy since it will let us remove a bunch of code from
the existing `KafkaCheck` and just rely on upstream.

* Bump vendored kazoo to 2.6.0

This fixes a number of issues, including handling backoff/jitter better:
python-zk/kazoo#521
nmuesch pushed a commit to DataDog/integrations-core that referenced this pull request Dec 14, 2018
* Bump kafka-python to 1.4.4

This bumps to 1.4.4 which added the new `KafkaAdminClient` which we need
in order to make `monitor_unlisted_consumer_groups` work for consumer
offsets stored in Kafka.

It also comes in handy since it will let us remove a bunch of code from
the existing `KafkaCheck` and just rely on upstream.

* Bump vendored kazoo to 2.6.0

This fixes a number of issues, including handling backoff/jitter better:
python-zk/kazoo#521
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants