Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on transient error codes in preload #5982

Merged
merged 2 commits into from
Mar 30, 2022

Conversation

mrocklin
Copy link
Member

Closes #5981

cc @jacobtomlinson if you have a moment

also cc @shughes-uk

@github-actions
Copy link
Contributor

github-actions bot commented Mar 23, 2022

Unit Test Results

       12 files  ±  0         12 suites  ±0   6h 11m 37s ⏱️ - 1h 8m 59s
  2 669 tests +  3    2 588 ✔️ +  3    81 💤 +  1  0  - 1 
15 942 runs  +34  15 100 ✔️ +50  842 💤  - 15  0  - 1 

Results for commit 6bc4415. ± Comparison against base commit 4866615.

♻️ This comment has been updated with latest results.

response = urllib.request.urlopen(request)
source = response.read().decode()
duration = 0.2
for i in range(10):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably work but i'd really like to adjust how persistent it would be. For our application we'd probably want about 2-3 minutes of retrying if needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this could be made configurable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to do whatever. I'm curious though, under what conditions would a web application not respond under 3 minutes? That seems like a long time to me. If that application is Coiled, then I'd love to see higher standards applied there if possible rather than making Dask more lenient than is typical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have hard data, but I suspect there's a long tail of events not-entirely-under-our-control that would justify retrying.

E.g. AWS networking has a hiccup for a little while, etc.

I'm not sure if that really justifies long retries, though.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to reuse existing logic (new dependency) https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#urllib3.util.Retry

urllib3 is a very common library (e.g. requests, boto, jupyterlab, conda depend on it). Chances are high this is installed regardless. We'd need to promote it to a proper dependency, of course

@mrocklin
Copy link
Member Author

Cool. Should I still take this on or is someone else able to get this out quickly?

I'm at a conference today but I can squeeze this in. I'm not particularly better at this than the rest of the team though, so I'm inclined to offload to someone else with more experience if they're available. If this is going to take another person 24 hours to spin up though then I'll just handle it.

@mrocklin
Copy link
Member Author

I'm totally happy to use urllib3 if it's useful. My sense is that the Retry object that @fjetter points to would give us ...

  1. Retry logic
  2. Redirect handling

It wouldn't give us some of the other exceptions that @shughes-uk mentioned. That's fine, we can keep the custom logic.

I'm fine going either way. Folks seem excited about urllib3. Currently, for the small scope that we need, I don't currently observe a large difference, but possibly there's more in urllib3 that would help us to handle the current situation.

If I were to go ahead solo right now I would probably just continue down the current path in this PR, extending it and making it configurble. I'm not an expert here though, and am happy to go another direction though.

@fjetter
Copy link
Member

fjetter commented Mar 24, 2022

It wouldn't give us some of the other exceptions that @shughes-uk mentioned. That's fine, we can keep the custom logic.

I might've missed something but what custom logic would be required? The object below should cover everything.

    urllib3.util.Retry(
        status_forcelist=[429, 504, 503, 502],
        backoff_factor=0.2,
    )

My sense is also that this should be configurable. Probably only few people actually care. However, I've been in enough arguments about what kind of error codes to retry that I know some will. For instance, should 500 be included in the retry? Similar argument about the duration. With this default and backoff of 0.2, this may retry for up to ~200s That is relatively long

In [1]: def backoff_factor(attempt, base=0.1):
   ...:     return base * 2**attempt

In [2]: [backoff_factor(attempt, 0.2) for attempt in range(10)]
Out[2]: [0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2, 102.4]

In [3]: sum([backoff_factor(attempt, 0.2) for attempt in range(10)])
Out[3]: 204.60000000000002

In [4]: sum([backoff_factor(attempt, 0.2) for attempt in range(10)]) / 60
Out[4]: 3.4100000000000006

I'm also ok with the current version of the code although I don't have any concerns about adding urllib3 as a dependency

@mrocklin
Copy link
Member Author

I threw up a version with urllib3 so that we could see it. It isn't yet configurable. Tests seem to be unhappy about a lingering thread. That might be unrelated though. If this works well then I'll go ahead with this and add configuration. If tests are unhappy for some reason then I'll revert to using urllib and custom logic.

@mrocklin
Copy link
Member Author

Configuration here is an intersting question. We can either call out specific options in our config, or we can say "we'll pass through any options to urllib3.util.Retry. I dont' have a strong preference either way. Currently I'm inclined towards the latter. It marries us a little bit to this library, but I'd like to avoid owning this configuration space if we can.

@shughes-uk is there someone who should try this out on your end to make sure that things are happy?

@dchudz
Copy link
Contributor

dchudz commented Mar 26, 2022

A Coiled-internal issue for details of testing on our end: https://gitlab.com/coiled/cloud/-/issues/4628

@ntabris
Copy link
Contributor

ntabris commented Mar 30, 2022

I'm happy with this PR/solution. I've used almost the exact same code for a different "phone home" and validated my similar code under load testing (which was failing before on account of transient errors).

@mrocklin
Copy link
Member Author

OK, I'm planning to merge this as-is if there are no further objections today.

If folks end up wanting configuration then this will be easy to add in the future.

@mrocklin mrocklin merged commit b9902d7 into dask:main Mar 30, 2022
@mrocklin mrocklin deleted the preload-transient-error-codes branch March 30, 2022 22:50
mrocklin added a commit to mrocklin/distributed that referenced this pull request Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preload fetching should be robust to transient error codes
6 participants