Remove hard coded connect handshake timeouts #4176

fjetter · 2020-10-21T15:18:02Z

This code section, in particular the handshake (#4019), is causing issues for me in "high load" scenarios. I see broken connections popping up once I reach about ~200 workers and these errors tear down the entire system. From the exceptions I cannot infer whether the handshake runs into a timeout or whether it is completely broken (mostly because our internal user reports are not as thorough as I'd like them to be but I'm investigating).
Whenever the handshake fails, it is raised as a CommClosedError which is, unfortunately, not an EnvironmentError. Therefore, what I wanted to change is the exception type upon which we retry to be more inclusive.
Then I had a look at the code and was really confused about the retry behaviour and the individual timeouts and started to (subjectively) simplify this piece of code and write a test for it. The test is a bit messy, works but I'd appreciate suggestions on how this can be implemented cleaner.

W.r.t the implementation, I am open for suggestions and can revert anything/everything/nothing depending on the feedback here.

Here a quick dump of my thoughts to this

We should not only retry EnvironmentErrors but rather more inclusive error classes (at least the CommClosed we raise ourselves)
We should retry with (an exponential) backoff
Ideally with a jitter
I don't have a strong opinion about individual timeouts of the read/write/connect. Therefore I chose this approach, where every step may take time until the deadline is reached. I guess one could argue that this should somehow be split up but I want for the reduced complexity approach instead.
I removed the backoff cap since I figured we wouldn't really need it here. This is just a gut feeling. Happy to introduce something
I oriented the implementation on https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ and chose the FullJitter approach. Not sure if this applies 100% but it made me feel as if my choices where "data driven" and this sounded similar enough to our situation :)

In case anybody wonders, with the chosen base of 0.01 this results in (non randomised) backoffs

In [1]: backoff_base = 0.01

In [2]: [backoff_base * 2 ** ix for ix in range(10)]
Out[2]: [0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12]

In [3]: sum([backoff_base * 2 ** ix for ix in range(10)])
Out[3]: 10.23

With the jitter this likely adds up to a probably significantly larger number of max retries

There is currently an alternative fix for this section open, see #4167 cc @jcrist

jcrist

Thanks @fjetter , this looks better than my attempt (apologies for never finishing that up). A few comments.

distributed/comm/core.py

fjetter · 2020-10-21T16:35:07Z

Tests fail because I changed the way exceptions are reraised. Will need to change this logic again

quasiben · 2020-10-21T17:41:01Z

@pentschev and I have been looking at these issues as well. @pentschev are you interested in testing out this PR for our UCX case ? If not, I can check it out

pentschev · 2020-10-21T21:57:17Z

@quasiben I tested this now and unfortunately this doesn't resolve our issues, but this particular piece of code seems to be causing trouble in different situations which is a bit worrying. I'm still not certain where's the problem on our side, but it seems like it's not really related to a timeout for us, we mostly see connections getting closed during read/write even if we increase the wait_for timeout to a very large number, so perhaps we're missing an await or something analogous to there.

quasiben · 2020-10-22T12:23:34Z

Thank you for testing @pentschev

fjetter · 2020-10-26T15:02:49Z

After implementing the suggestion of @jcrist to not include the handshakes in the retries, I added a test checking for "slow handshakes" and stumbled over the listener timeouts as well. I completely removed the timeouts for the handshake in the listener since I figured, it is fine on the connector side to enforce timeouts but I'm not entirely sure about this. If we require a timeout there as well, we'll need to configure it somehow since 1s is not enough.

distributed/comm/core.py

pentschev · 2020-10-27T17:19:11Z

With the most recent changes I've also seen CommClosedErrors go away in our Dask-CUDA+UCX use case. However, I see some cleanup issues that I'm not sure yet whether they're related to the handshake connection. Regardless of that, this seems like great progress and we can continue looking up for the cleanup issues down the road. I'm definitely +1 on this PR.

fjetter · 2020-10-28T12:57:31Z

I still had an issue in the code where I encountered negative backoffs. That might've been the cause for the failing builds. The retry logic itself didn't change otherwise. I think the important change in the last commits was to remove the timeout from the listeners but I'm not sure if it is safe to not have any there.

Other than this, I'm also wondering what to do with the comm in case the handshake fails. I am now trying to close it but what happens if the close fails or gets stuck for whatever reason?
@jcrist you added a comm.abort for this case. any reason why the come.close() isn't sufficient? Shall I add a comm.abort as well?

jcrist · 2020-10-28T13:39:29Z

added a comm.abort for this case. any reason why the come.close() isn't sufficient? Shall I add a comm.abort as well?

What you have here is fine. I added a comm.abort in that location since it was being called from a synchronous context (comm.abort isn't an async method).

jcrist · 2020-10-28T13:43:24Z

distributed/comm/core.py

+            active_exception = exc
+            # FullJitter see https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
+
+            upper_cap = min(time_left(), backoff_base * (2 ** attempt))


We should also bump the intermediate_cap by some fraction, in case the initial size is too small. As is right now, no connect attempt can last longer than timeout/5. Perhaps 1.5 x it every attempt?

I don't have a good feeling about this intermediate cap but increasing it by a factor each attempt should be a safe default. I'll add the x1.5

Maybe, I'll add a smaller factor? Otherwise, we'd effectively limit ourselves to effectively three tries

In [1]: sum([0.2 * 1.5 ** attempt for attempt in range(3)]) Out[1]: 0.95

as I said, I don't have a good feeling about how important these intermediate caps really are

I agree on this, running into DNS race conditions feels like the sign of larger problems elsewhere, but for now we should at least match the existing behavior. My goal with increasing the value here is that depending on the value of distributed.comm.timeouts.connect, no intermediate_cap may be large enough to complete if it's set at 1/5 the timeout. I'm not too worried about limiting to 3 attempts (note this is only true if the timeout is hit each time, not some other error), as more attempts than that is likely a sign of deeper issues. 1.5 or 1.25 both seem fine, I wouldn't want to go lower than that.

jcrist · 2020-10-28T13:44:50Z

distributed/comm/core.py

+            # FullJitter see https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
+
+            upper_cap = min(time_left(), backoff_base * (2 ** attempt))
+            backoff = random.uniform(0, upper_cap)


The backoff should get progressively larger each attempt, as is this is adding a random sleep with progressively larger ranges (but 0 is still valid). I liked the old algorithm better of roughly 1.5x the previous backoff with some jitter. Why did you make this change? (Edit: followed the link to the aws post - the algorithm(s) there look fine, but there are some bugs in this implementation of them).

The AWS blog post does not include timeouts which is why I slightly changed the logic but I wouldn't classify it as a bug

I oriented myself on the FullJitter approach which is defined by

sleep = random_between(0, min(cap, base * 2 ** attempt))

where I chose the remaining time left as the cap. The reason why I chose this as a cap is because I don't want the coroutine to unnecessarily block for too much longer if there is no chance for it to complete anyhow.

In a previous iteration I was doing it the other way round, namely

sleep = min(cap, random_between(0, base * 2 ** attempt))

which is slightly different but what they both have in common is that they do not progressively produce larger backoffs. Only the average expected backoff is increasing progressively but 0 is theoretically still valid for all attempts. Only the algorithm EqualJitter offers the guarantee that zero is never chosen but it performs slightly worse than the others.

Would you prefer 1. over 2. or do I have an error in my logic?

The AWS blog post does not include timeouts which is why I slightly changed the logic

But the calculation here is still for a backoff in attempts, not timeouts (so it seems the blogpost should apply as written?).

I missed the FullJitter option, which does look like what you've implemented here. For connect failures though, I think we do want to ensure some amount of backoff is used in case the server is still starting up or unavailable for other reasons. Both the "Equal Jitter" and "Decorrelated Jitter" options should (IIUC) provide a guarantee of non-zero backoff times (Decorrelated looking slightly better), but we could also use what you have here with a non-zero min (perhaps 0.05 or something small). What you have here seems fine too though on further thought, thanks for the explanation.

(so it seems the blogpost should apply as written?).

There is the one scenario where the new backoff would breach the timeout

(numbers only for clarity)

Initial timeout maybe 5s

We're in attempt 10 with about 100ms remaining

New backoff would calculate to 200ms

We'd wait and try to connect again but the 11th attempt will then have a timeout of zero, i.e. it is guaranteed to fail

-> We'd wait for 5.1s, i.e. longer than the configured timeout

I introduced the cap since it would, at least give the marginal chance of another try with 100ms timeout and it would not breach the total amount of time waited. The analysis of the blog post assumes that we'll retry indefinitely (or a fixed amount of max attempts) until we're successful but we're in a slightly different scenario here.

"Decorrelated Jitter" options should (IIUC) provide a guarantee of non-zero backoff times (Decorrelated looking slightly better)

Correct, this was a false statement of mine. The decorrelated jitter is always lower capped by the base in this example.

fjetter · 2020-10-28T16:00:16Z

Finally, I'm wondering if the default connect timeout should be increased. We currently have 10s as a default connect timeout but I guess this was set at a time where we had a simpler retry mechanism (w/out intermediate capping) and without handshakes. Considering that multiple people encountered the CommClosed exceptions since the handshake was hard coded to 1s this might indicate that a more conservative value should be set for the overall timeout (My issues did not appear again with the default, so I'd be fine either way, just asking the question)

jcrist · 2020-10-28T16:04:01Z

I'd be fine increasing the default to something higher (perhaps 30s?), but don't think that necessarily needs to be done here. Unless others think otherwise, I think we should get this fix in but leave the timeout the same.

jcrist · 2020-10-29T15:22:08Z

This generally looks good to me, but there's a test failure at test_worker_who_has_clears_after_failed_connection that looks related.

fjetter · 2020-10-29T15:31:10Z

Yes, I'm looking into the tests and am currently suspecting the intermediate_cap to be too small and am trying to increase it. I'm also considering to remove the cap entirely after the initial failure but am waiting for the build.

However, I don't think it is actually caused by this change but somehow amplified. I noticed an awful lot of error logs which are apparently retried and may be responsible for an overall flakiness of the system at the moment #4199

TomAugspurger · 2020-10-30T12:53:29Z

How's this looking @fjetter? Does #4200 / #4199 need to be resolved before this can be merged?

For reference, we're planning to backport this fix and issue a release, ideally today but we can push if it isn't ready. Does this build in any way on #4200, so that it would need to be backported too, or are they likely independent?

fjetter · 2020-10-30T14:47:12Z

I'm pretty certain #4200 / #4199 was introduced by #4107 which is not part of distributed==2.30.0, therefore the backport is not necessary. I'm struggling to find a reason for all of the failures and am just suspecting this to be connected to #4199 but I cannot confirm it.

TomAugspurger · 2020-10-30T16:30:15Z

Thanks. #4204 is testing out this diff against 2.30.x. If CI passes there then we can be confident that the test failures here are unrelated.

We plan to merge this to master, cherry-pick & backport it to 2.30.x, and then release 2.30.1.

jennakwon06 · 2020-11-02T17:14:49Z

Hello - I see that 2.30.1 is in the changelog ( https://distributed.dask.org/en/latest/changelog.html ) but not available in PyPi yet.

We are blocked on the connect timeout fix and was wondering when it would be available on PyPI.

Thanks!

jcrist · 2020-11-02T17:17:05Z

The 2.30.1 release isn't out yet, see dask/community#105 for more info.

jrbourbeau

Tests have passed over in #4204, so I'm going to merge this PR and include it in the 2.30.1 release. Thanks @fjetter (and @jcrist @quasiben @pentschev for reviewing)!

(cherry picked from commit 4205280)

Allow connect retries if handshake fails

6ceed9e

fjetter mentioned this pull request Oct 21, 2020

Hardcoded time outs lead to complete teardown of communication layer #4118

Open

jcrist reviewed Oct 21, 2020

View reviewed changes

distributed/comm/core.py Outdated Show resolved Hide resolved

distributed/comm/core.py Outdated Show resolved Hide resolved

jcrist mentioned this pull request Oct 22, 2020

Fix distributed comm connect timeout logic #4167

Closed

fix connect retries

1065a01

fjetter mentioned this pull request Oct 26, 2020

Fix regression in task stealing for already released keys #4182

Merged

Do not retry handshake

39206a5

Explicit exception message if failing in hadnshake

1e1f66e

fjetter commented Oct 26, 2020

View reviewed changes

distributed/comm/core.py Show resolved Hide resolved

Fix wrong imoprt path

4db6ca0

TomAugspurger mentioned this pull request Oct 27, 2020

Bugfix release for distributed on Friday, October 30 dask/community#105

Closed

Avoid negative backoffs

674116b

jcrist reviewed Oct 28, 2020

View reviewed changes

fjetter added 2 commits October 28, 2020 16:51

Increase intermediate cap

4a1c6a1

Reinstate handshake timeout on listener side

72bb064

fjetter changed the title ~~Allow connect retries if handshake fails~~ Remove hard coded connect handshake timeouts Oct 28, 2020

TomAugspurger mentioned this pull request Oct 30, 2020

(TEST) Apply Florian's patch to 2.30.x #4204

Closed

jrbourbeau reviewed Nov 3, 2020

View reviewed changes

jrbourbeau merged commit 4205280 into dask:master Nov 3, 2020

jrbourbeau pushed a commit that referenced this pull request Nov 3, 2020

Remove hard coded connect handshake timeouts (#4176)

1c757df

(cherry picked from commit 4205280)

jcrist mentioned this pull request Nov 4, 2020

Frequent CommClosed issues w/ inproc comms on Prefect #4165

Closed

This was referenced Nov 9, 2020

Increase default for connect timeout #4228

Closed

Computation deadlocks after inter worker communication error #4133

Closed

eric-czech mentioned this pull request Dec 19, 2020

Identify lack of scalability in gwas_linear_regression sgkit-dev/sgkit#390

Open

fjetter mentioned this pull request Jul 20, 2021

Increase robustness to TimeoutError during connect #5096

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove hard coded connect handshake timeouts #4176

Remove hard coded connect handshake timeouts #4176

fjetter commented Oct 21, 2020

jcrist left a comment

fjetter commented Oct 21, 2020

quasiben commented Oct 21, 2020

pentschev commented Oct 21, 2020

quasiben commented Oct 22, 2020

fjetter commented Oct 26, 2020

pentschev commented Oct 27, 2020

fjetter commented Oct 28, 2020

jcrist commented Oct 28, 2020

jcrist Oct 28, 2020

fjetter Oct 28, 2020

fjetter Oct 28, 2020 •

edited

Loading

jcrist Oct 28, 2020 •

edited

Loading

jcrist Oct 28, 2020 •

edited

Loading

fjetter Oct 28, 2020 •

edited

Loading

jcrist Oct 28, 2020

fjetter Oct 28, 2020

fjetter commented Oct 28, 2020

jcrist commented Oct 28, 2020

jcrist commented Oct 29, 2020

fjetter commented Oct 29, 2020

TomAugspurger commented Oct 30, 2020

fjetter commented Oct 30, 2020

TomAugspurger commented Oct 30, 2020

jennakwon06 commented Nov 2, 2020

jcrist commented Nov 2, 2020

jrbourbeau left a comment

Remove hard coded connect handshake timeouts #4176

Remove hard coded connect handshake timeouts #4176

Conversation

fjetter commented Oct 21, 2020

jcrist left a comment

Choose a reason for hiding this comment

fjetter commented Oct 21, 2020

quasiben commented Oct 21, 2020

pentschev commented Oct 21, 2020

quasiben commented Oct 22, 2020

fjetter commented Oct 26, 2020

pentschev commented Oct 27, 2020

fjetter commented Oct 28, 2020

jcrist commented Oct 28, 2020

jcrist Oct 28, 2020

Choose a reason for hiding this comment

fjetter Oct 28, 2020

Choose a reason for hiding this comment

fjetter Oct 28, 2020 • edited Loading

Choose a reason for hiding this comment

jcrist Oct 28, 2020 • edited Loading

Choose a reason for hiding this comment

jcrist Oct 28, 2020 • edited Loading

Choose a reason for hiding this comment

fjetter Oct 28, 2020 • edited Loading

Choose a reason for hiding this comment

jcrist Oct 28, 2020

Choose a reason for hiding this comment

fjetter Oct 28, 2020

Choose a reason for hiding this comment

fjetter commented Oct 28, 2020

jcrist commented Oct 28, 2020

jcrist commented Oct 29, 2020

fjetter commented Oct 29, 2020

TomAugspurger commented Oct 30, 2020

fjetter commented Oct 30, 2020

TomAugspurger commented Oct 30, 2020

jennakwon06 commented Nov 2, 2020

jcrist commented Nov 2, 2020

jrbourbeau left a comment

Choose a reason for hiding this comment

fjetter Oct 28, 2020 •

edited

Loading

jcrist Oct 28, 2020 •

edited

Loading

jcrist Oct 28, 2020 •

edited

Loading

fjetter Oct 28, 2020 •

edited

Loading