Retry on transient error codes in preload #5982

mrocklin · 2022-03-23T04:04:40Z

cc @jacobtomlinson if you have a moment

github-actions · 2022-03-23T05:13:04Z

Unit Test Results

      12 files ±  0       12 suites ±0 6h 11m 37s ⏱️ - 1h 8m 59s
  2 669 tests +  3   2 588 ✔️ +  3   81 💤 +  1 0 ❌ - 1
15 942 runs +34 15 100 ✔️ +50 842 💤 - 15 0 ❌ - 1

Results for commit 6bc4415. ± Comparison against base commit 4866615.

♻️ This comment has been updated with latest results.

shughes-uk · 2022-03-23T07:19:20Z

distributed/preloading.py

-    response = urllib.request.urlopen(request)
-    source = response.read().decode()
+    duration = 0.2
+    for i in range(10):


This would probably work but i'd really like to adjust how persistent it would be. For our application we'd probably want about 2-3 minutes of retrying if needed

Perhaps this could be made configurable?

I'm happy to do whatever. I'm curious though, under what conditions would a web application not respond under 3 minutes? That seems like a long time to me. If that application is Coiled, then I'd love to see higher standards applied there if possible rather than making Dask more lenient than is typical.

I don't have hard data, but I suspect there's a long tail of events not-entirely-under-our-control that would justify retrying.

E.g. AWS networking has a hiccup for a little while, etc.

I'm not sure if that really justifies long retries, though.

fjetter

We might want to reuse existing logic (new dependency) https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#urllib3.util.Retry

urllib3 is a very common library (e.g. requests, boto, jupyterlab, conda depend on it). Chances are high this is installed regardless. We'd need to promote it to a proper dependency, of course

mrocklin · 2022-03-23T12:47:29Z

Cool. Should I still take this on or is someone else able to get this out quickly?

I'm at a conference today but I can squeeze this in. I'm not particularly better at this than the rest of the team though, so I'm inclined to offload to someone else with more experience if they're available. If this is going to take another person 24 hours to spin up though then I'll just handle it.

mrocklin · 2022-03-24T03:06:08Z

I'm totally happy to use urllib3 if it's useful. My sense is that the Retry object that @fjetter points to would give us ...

Retry logic
Redirect handling

It wouldn't give us some of the other exceptions that @shughes-uk mentioned. That's fine, we can keep the custom logic.

I'm fine going either way. Folks seem excited about urllib3. Currently, for the small scope that we need, I don't currently observe a large difference, but possibly there's more in urllib3 that would help us to handle the current situation.

If I were to go ahead solo right now I would probably just continue down the current path in this PR, extending it and making it configurble. I'm not an expert here though, and am happy to go another direction though.

fjetter · 2022-03-24T10:12:34Z

It wouldn't give us some of the other exceptions that @shughes-uk mentioned. That's fine, we can keep the custom logic.

I might've missed something but what custom logic would be required? The object below should cover everything.

    urllib3.util.Retry(
        status_forcelist=[429, 504, 503, 502],
        backoff_factor=0.2,
    )

My sense is also that this should be configurable. Probably only few people actually care. However, I've been in enough arguments about what kind of error codes to retry that I know some will. For instance, should 500 be included in the retry? Similar argument about the duration. With this default and backoff of 0.2, this may retry for up to ~200s That is relatively long

In [1]: def backoff_factor(attempt, base=0.1):
   ...:     return base * 2**attempt

In [2]: [backoff_factor(attempt, 0.2) for attempt in range(10)]
Out[2]: [0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2, 102.4]

In [3]: sum([backoff_factor(attempt, 0.2) for attempt in range(10)])
Out[3]: 204.60000000000002

In [4]: sum([backoff_factor(attempt, 0.2) for attempt in range(10)]) / 60
Out[4]: 3.4100000000000006

I'm also ok with the current version of the code although I don't have any concerns about adding urllib3 as a dependency

mrocklin · 2022-03-24T15:21:51Z

I threw up a version with urllib3 so that we could see it. It isn't yet configurable. Tests seem to be unhappy about a lingering thread. That might be unrelated though. If this works well then I'll go ahead with this and add configuration. If tests are unhappy for some reason then I'll revert to using urllib and custom logic.

mrocklin · 2022-03-25T11:23:44Z

Configuration here is an intersting question. We can either call out specific options in our config, or we can say "we'll pass through any options to urllib3.util.Retry. I dont' have a strong preference either way. Currently I'm inclined towards the latter. It marries us a little bit to this library, but I'd like to avoid owning this configuration space if we can.

@shughes-uk is there someone who should try this out on your end to make sure that things are happy?

dchudz · 2022-03-26T14:58:47Z

A Coiled-internal issue for details of testing on our end: https://gitlab.com/coiled/cloud/-/issues/4628

ntabris · 2022-03-30T15:12:18Z

I'm happy with this PR/solution. I've used almost the exact same code for a different "phone home" and validated my similar code under load testing (which was failing before on account of transient errors).

mrocklin · 2022-03-30T15:23:05Z

OK, I'm planning to merge this as-is if there are no further objections today.

If folks end up wanting configuration then this will be easy to add in the future.

Retry on transient error codes in preload

0fccd6a

mrocklin mentioned this pull request Mar 23, 2022

Preload fetching should be robust to transient error codes #5981

Closed

shughes-uk reviewed Mar 23, 2022

View reviewed changes

fjetter reviewed Mar 23, 2022

View reviewed changes

Use urllib3 for web preloads

6bc4415

mrocklin merged commit b9902d7 into dask:main Mar 30, 2022

mrocklin deleted the preload-transient-error-codes branch March 30, 2022 22:50

mrocklin added a commit to mrocklin/distributed that referenced this pull request Mar 31, 2022

Retry on transient error codes in preload (dask#5982)

d0afbb1

jrbourbeau mentioned this pull request Apr 11, 2022

Add urllib3 to nightly conda builds #6102

Merged

jrbourbeau mentioned this pull request Apr 21, 2022

⚠️ CI failed ⚠️ coiled/benchmarks#80

Closed

dchudz mentioned this pull request Apr 25, 2022

Pre-release versions of coiled-runtime coiled/benchmarks#86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on transient error codes in preload #5982

Retry on transient error codes in preload #5982

mrocklin commented Mar 23, 2022

github-actions bot commented Mar 23, 2022 •

edited

Loading

shughes-uk Mar 23, 2022

jacobtomlinson Mar 23, 2022

mrocklin Mar 24, 2022

dchudz Mar 26, 2022

fjetter left a comment

mrocklin commented Mar 23, 2022

mrocklin commented Mar 24, 2022

fjetter commented Mar 24, 2022

mrocklin commented Mar 24, 2022

mrocklin commented Mar 25, 2022

dchudz commented Mar 26, 2022

ntabris commented Mar 30, 2022

mrocklin commented Mar 30, 2022

Retry on transient error codes in preload #5982

Retry on transient error codes in preload #5982

Conversation

mrocklin commented Mar 23, 2022

github-actions bot commented Mar 23, 2022 • edited Loading

Unit Test Results

shughes-uk Mar 23, 2022

Choose a reason for hiding this comment

jacobtomlinson Mar 23, 2022

Choose a reason for hiding this comment

mrocklin Mar 24, 2022

Choose a reason for hiding this comment

dchudz Mar 26, 2022

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

mrocklin commented Mar 23, 2022

mrocklin commented Mar 24, 2022

fjetter commented Mar 24, 2022

mrocklin commented Mar 24, 2022

mrocklin commented Mar 25, 2022

dchudz commented Mar 26, 2022

ntabris commented Mar 30, 2022

mrocklin commented Mar 30, 2022

github-actions bot commented Mar 23, 2022 •

edited

Loading