Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove hard coded connect handshake timeouts #4176

Merged
merged 8 commits into from
Nov 3, 2020

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Oct 21, 2020

This code section, in particular the handshake (#4019), is causing issues for me in "high load" scenarios. I see broken connections popping up once I reach about ~200 workers and these errors tear down the entire system. From the exceptions I cannot infer whether the handshake runs into a timeout or whether it is completely broken (mostly because our internal user reports are not as thorough as I'd like them to be but I'm investigating).
Whenever the handshake fails, it is raised as a CommClosedError which is, unfortunately, not an EnvironmentError. Therefore, what I wanted to change is the exception type upon which we retry to be more inclusive.
Then I had a look at the code and was really confused about the retry behaviour and the individual timeouts and started to (subjectively) simplify this piece of code and write a test for it. The test is a bit messy, works but I'd appreciate suggestions on how this can be implemented cleaner.

W.r.t the implementation, I am open for suggestions and can revert anything/everything/nothing depending on the feedback here.

Here a quick dump of my thoughts to this

  • We should not only retry EnvironmentErrors but rather more inclusive error classes (at least the CommClosed we raise ourselves)
  • We should retry with (an exponential) backoff
  • Ideally with a jitter
  • I don't have a strong opinion about individual timeouts of the read/write/connect. Therefore I chose this approach, where every step may take time until the deadline is reached. I guess one could argue that this should somehow be split up but I want for the reduced complexity approach instead.
  • I removed the backoff cap since I figured we wouldn't really need it here. This is just a gut feeling. Happy to introduce something
  • I oriented the implementation on https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ and chose the FullJitter approach. Not sure if this applies 100% but it made me feel as if my choices where "data driven" and this sounded similar enough to our situation :)

In case anybody wonders, with the chosen base of 0.01 this results in (non randomised) backoffs

In [1]: backoff_base = 0.01

In [2]: [backoff_base * 2 ** ix for ix in range(10)]
Out[2]: [0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12]

In [3]: sum([backoff_base * 2 ** ix for ix in range(10)])
Out[3]: 10.23

With the jitter this likely adds up to a probably significantly larger number of max retries

There is currently an alternative fix for this section open, see #4167 cc @jcrist

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fjetter , this looks better than my attempt (apologies for never finishing that up). A few comments.

distributed/comm/core.py Outdated Show resolved Hide resolved
distributed/comm/core.py Outdated Show resolved Hide resolved
@fjetter
Copy link
Member Author

fjetter commented Oct 21, 2020

Tests fail because I changed the way exceptions are reraised. Will need to change this logic again

@quasiben
Copy link
Member

@pentschev and I have been looking at these issues as well. @pentschev are you interested in testing out this PR for our UCX case ? If not, I can check it out

@pentschev
Copy link
Member

@quasiben I tested this now and unfortunately this doesn't resolve our issues, but this particular piece of code seems to be causing trouble in different situations which is a bit worrying. I'm still not certain where's the problem on our side, but it seems like it's not really related to a timeout for us, we mostly see connections getting closed during read/write even if we increase the wait_for timeout to a very large number, so perhaps we're missing an await or something analogous to there.

@quasiben
Copy link
Member

Thank you for testing @pentschev

@fjetter
Copy link
Member Author

fjetter commented Oct 26, 2020

After implementing the suggestion of @jcrist to not include the handshakes in the retries, I added a test checking for "slow handshakes" and stumbled over the listener timeouts as well. I completely removed the timeouts for the handshake in the listener since I figured, it is fine on the connector side to enforce timeouts but I'm not entirely sure about this. If we require a timeout there as well, we'll need to configure it somehow since 1s is not enough.

@pentschev
Copy link
Member

With the most recent changes I've also seen CommClosedErrors go away in our Dask-CUDA+UCX use case. However, I see some cleanup issues that I'm not sure yet whether they're related to the handshake connection. Regardless of that, this seems like great progress and we can continue looking up for the cleanup issues down the road. I'm definitely +1 on this PR.

@fjetter
Copy link
Member Author

fjetter commented Oct 28, 2020

I still had an issue in the code where I encountered negative backoffs. That might've been the cause for the failing builds. The retry logic itself didn't change otherwise. I think the important change in the last commits was to remove the timeout from the listeners but I'm not sure if it is safe to not have any there.

Other than this, I'm also wondering what to do with the comm in case the handshake fails. I am now trying to close it but what happens if the close fails or gets stuck for whatever reason?
@jcrist you added a comm.abort for this case. any reason why the come.close() isn't sufficient? Shall I add a comm.abort as well?

@jcrist
Copy link
Member

jcrist commented Oct 28, 2020

added a comm.abort for this case. any reason why the come.close() isn't sufficient? Shall I add a comm.abort as well?

What you have here is fine. I added a comm.abort in that location since it was being called from a synchronous context (comm.abort isn't an async method).

active_exception = exc
# FullJitter see https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

upper_cap = min(time_left(), backoff_base * (2 ** attempt))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also bump the intermediate_cap by some fraction, in case the initial size is too small. As is right now, no connect attempt can last longer than timeout/5. Perhaps 1.5 x it every attempt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good feeling about this intermediate cap but increasing it by a factor each attempt should be a safe default. I'll add the x1.5

Copy link
Member Author

@fjetter fjetter Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I'll add a smaller factor? Otherwise, we'd effectively limit ourselves to effectively three tries

In [1]: sum([0.2 * 1.5 ** attempt for attempt in range(3)])
Out[1]: 0.95

as I said, I don't have a good feeling about how important these intermediate caps really are

Copy link
Member

@jcrist jcrist Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on this, running into DNS race conditions feels like the sign of larger problems elsewhere, but for now we should at least match the existing behavior. My goal with increasing the value here is that depending on the value of distributed.comm.timeouts.connect, no intermediate_cap may be large enough to complete if it's set at 1/5 the timeout. I'm not too worried about limiting to 3 attempts (note this is only true if the timeout is hit each time, not some other error), as more attempts than that is likely a sign of deeper issues. 1.5 or 1.25 both seem fine, I wouldn't want to go lower than that.

# FullJitter see https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

upper_cap = min(time_left(), backoff_base * (2 ** attempt))
backoff = random.uniform(0, upper_cap)
Copy link
Member

@jcrist jcrist Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backoff should get progressively larger each attempt, as is this is adding a random sleep with progressively larger ranges (but 0 is still valid). I liked the old algorithm better of roughly 1.5x the previous backoff with some jitter. Why did you make this change? (Edit: followed the link to the aws post - the algorithm(s) there look fine, but there are some bugs in this implementation of them).

Copy link
Member Author

@fjetter fjetter Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AWS blog post does not include timeouts which is why I slightly changed the logic but I wouldn't classify it as a bug

I oriented myself on the FullJitter approach which is defined by

  1. sleep = random_between(0, min(cap, base * 2 ** attempt))

where I chose the remaining time left as the cap. The reason why I chose this as a cap is because I don't want the coroutine to unnecessarily block for too much longer if there is no chance for it to complete anyhow.

In a previous iteration I was doing it the other way round, namely

  1. sleep = min(cap, random_between(0, base * 2 ** attempt))

which is slightly different but what they both have in common is that they do not progressively produce larger backoffs. Only the average expected backoff is increasing progressively but 0 is theoretically still valid for all attempts. Only the algorithm EqualJitter offers the guarantee that zero is never chosen but it performs slightly worse than the others.

Would you prefer 1. over 2. or do I have an error in my logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AWS blog post does not include timeouts which is why I slightly changed the logic

But the calculation here is still for a backoff in attempts, not timeouts (so it seems the blogpost should apply as written?).

I missed the FullJitter option, which does look like what you've implemented here. For connect failures though, I think we do want to ensure some amount of backoff is used in case the server is still starting up or unavailable for other reasons. Both the "Equal Jitter" and "Decorrelated Jitter" options should (IIUC) provide a guarantee of non-zero backoff times (Decorrelated looking slightly better), but we could also use what you have here with a non-zero min (perhaps 0.05 or something small). What you have here seems fine too though on further thought, thanks for the explanation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(so it seems the blogpost should apply as written?).

There is the one scenario where the new backoff would breach the timeout

(numbers only for clarity)

  • Initial timeout maybe 5s
  • We're in attempt 10 with about 100ms remaining
  • New backoff would calculate to 200ms
  • We'd wait and try to connect again but the 11th attempt will then have a timeout of zero, i.e. it is guaranteed to fail

-> We'd wait for 5.1s, i.e. longer than the configured timeout

I introduced the cap since it would, at least give the marginal chance of another try with 100ms timeout and it would not breach the total amount of time waited. The analysis of the blog post assumes that we'll retry indefinitely (or a fixed amount of max attempts) until we're successful but we're in a slightly different scenario here.

"Decorrelated Jitter" options should (IIUC) provide a guarantee of non-zero backoff times (Decorrelated looking slightly better)

Correct, this was a false statement of mine. The decorrelated jitter is always lower capped by the base in this example.

@fjetter
Copy link
Member Author

fjetter commented Oct 28, 2020

Finally, I'm wondering if the default connect timeout should be increased. We currently have 10s as a default connect timeout but I guess this was set at a time where we had a simpler retry mechanism (w/out intermediate capping) and without handshakes. Considering that multiple people encountered the CommClosed exceptions since the handshake was hard coded to 1s this might indicate that a more conservative value should be set for the overall timeout (My issues did not appear again with the default, so I'd be fine either way, just asking the question)

@jcrist
Copy link
Member

jcrist commented Oct 28, 2020

I'd be fine increasing the default to something higher (perhaps 30s?), but don't think that necessarily needs to be done here. Unless others think otherwise, I think we should get this fix in but leave the timeout the same.

@fjetter fjetter changed the title Allow connect retries if handshake fails Remove hard coded connect handshake timeouts Oct 28, 2020
@jcrist
Copy link
Member

jcrist commented Oct 29, 2020

This generally looks good to me, but there's a test failure at test_worker_who_has_clears_after_failed_connection that looks related.

@fjetter
Copy link
Member Author

fjetter commented Oct 29, 2020

Yes, I'm looking into the tests and am currently suspecting the intermediate_cap to be too small and am trying to increase it. I'm also considering to remove the cap entirely after the initial failure but am waiting for the build.

However, I don't think it is actually caused by this change but somehow amplified. I noticed an awful lot of error logs which are apparently retried and may be responsible for an overall flakiness of the system at the moment #4199

@TomAugspurger
Copy link
Member

How's this looking @fjetter? Does #4200 / #4199 need to be resolved before this can be merged?

For reference, we're planning to backport this fix and issue a release, ideally today but we can push if it isn't ready. Does this build in any way on #4200, so that it would need to be backported too, or are they likely independent?

@fjetter
Copy link
Member Author

fjetter commented Oct 30, 2020

I'm pretty certain #4200 / #4199 was introduced by #4107 which is not part of distributed==2.30.0, therefore the backport is not necessary. I'm struggling to find a reason for all of the failures and am just suspecting this to be connected to #4199 but I cannot confirm it.

@TomAugspurger
Copy link
Member

Thanks. #4204 is testing out this diff against 2.30.x. If CI passes there then we can be confident that the test failures here are unrelated.

We plan to merge this to master, cherry-pick & backport it to 2.30.x, and then release 2.30.1.

@jennakwon06
Copy link

Hello - I see that 2.30.1 is in the changelog ( https://distributed.dask.org/en/latest/changelog.html ) but not available in PyPi yet.

We are blocked on the connect timeout fix and was wondering when it would be available on PyPI.

Thanks!

@jcrist
Copy link
Member

jcrist commented Nov 2, 2020

The 2.30.1 release isn't out yet, see dask/community#105 for more info.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests have passed over in #4204, so I'm going to merge this PR and include it in the 2.30.1 release. Thanks @fjetter (and @jcrist @quasiben @pentschev for reviewing)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants