Use forkserver on Unix and Python 3 #687

pitrou · 2016-11-17T18:59:48Z

The "forkserver" method is a multiprocessing feature on Python 3. In this mode of operation, a middleman process is spawned that will fork children on behalf of the parent. This avoids inheriting the parent's system resources (file descriptors, mutexes, etc.).

This fixes the test_hdfs hangs here (on Python 3, that is).

I'm not terribly sold on mp_context as a name. Suggestions welcome.

pitrou · 2016-11-17T19:07:15Z

btw, I wonder if a similar change should be made in Dask too.

pitrou · 2016-11-17T19:10:20Z

This makes the test suite slower on Python 3 (442 s. vs. 178 s.). I'm not terribly surprised, since launching a process is now a bit more expensive (though still less than with the "spawn" method). In stable regime, this probably doesn't matter, but the test suite launches tons of child processes.

mrocklin · 2016-11-17T19:13:27Z

Thoughts on how to resolve the HDFS issue on Python 2?

Is forkserver still the right decision in Python 3 if we don't care about HDFS?

pitrou · 2016-11-17T19:21:47Z

Is forkserver still the right decision in Python 3 if we don't care about HDFS?

Probably. There are other possible issues with fork(). Most third-party libraries (Python or C) are not fork-safe, so we may run into similar issues (in my previous job we had broken SSL connections until we stopped sharing resources).
Also, the fact that file descriptors are inherited in the child (without the child necessarily noticing) means some resources can be lingering even if the client closes them, for example a network connection could remain open for some time.

The one functional downside is that there are a couple things to be aware of when not using the "fork" method. These are the same guidelines as on Windows: https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

pitrou · 2016-11-17T19:30:14Z

Thoughts on how to resolve the HDFS issue on Python 2?

Not sure. One possibility would be to force all HDFS operations in a dedicated process, but since this would typically deal with large-ish data, the marshalling overhead may be a problem.

mrocklin · 2016-11-17T21:41:37Z

distributed/utils.py

@@ -29,6 +30,12 @@
 logger = logging.getLogger(__name__)


+if PY3 and not sys.platform.startswith('win'):
+    mp_context = multiprocessing.get_context('forkserver')


It might be nice to get this from the config file so that we can set this to fork for faster tests during local development.

from .config import config context = config.get('multiprocessing_context', 'forkserver') mp_context = multiprocessing.get_context(context)

There is probably a better name for this.

There is probably a better name for this.

Hmm, do you want to suggest one?

I was more concerned about adding a config option when there was a significant delay to using forkserver. Now that this is as fast as using fork I'll retract my comment.

jakirkham · 2016-11-18T18:51:55Z

Have you guys looked at billiard? It's effectively a backported multiprocessing for Python 2.7. That would let you do this for Python 2 and 3.

pitrou · 2016-11-18T21:03:16Z

Have you guys looked at billiard? It's effectively a backported multiprocessing for Python 2.7. That would let you do this for Python 2 and 3.

That's interesting. I had never heard about it before. Is there any documentation? I couldn't find any.

pitrou · 2016-11-21T11:53:45Z

Ok, I looked at billiard, and the "forkserver" method isn't available on Python 2:

>>> billiard.get_context('forkserver')
Traceback (most recent call last):
  File "<ipython-input-4-93eda6a9a76c>", line 1, in <module>
    billiard.get_context('forkserver')
  File "/home/antoine/miniconda3/envs/dask27/lib/python2.7/site-packages/billiard/context.py", line 292, in get_context
    return super(DefaultContext, self).get_context(method)
  File "/home/antoine/miniconda3/envs/dask27/lib/python2.7/site-packages/billiard/context.py", line 241, in get_context
    ctx._check_available()
  File "/home/antoine/miniconda3/envs/dask27/lib/python2.7/site-packages/billiard/context.py", line 366, in _check_available
    raise ValueError('forkserver start method not available')
ValueError: forkserver start method not available

pitrou · 2016-11-21T15:19:07Z

It turns out one can speed up the spawning by defining some modules to preload in the forkserver process. This makes the test suite as fast on 3.x as on 2.x.

pitrou · 2016-11-21T16:12:49Z

For some unknown reason, the hangs in test_hdfs have reappeared on Travis. I'm a bit baffled, since I can't reproduce them anymore at home. See https://travis-ci.org/dask/distributed/jobs/177681409

pitrou · 2016-11-21T16:24:45Z

Ok, I manage to reproduce the hangs locally with hdfs3 master, but not with hdfs3 0.1.2. After bisecting, I find out the first buggy revision seems to be dask/hdfs3@bd76002.

mrocklin · 2016-11-21T16:27:10Z

Feel free to revert the locket stuff in hdfs3. That appeared to resolve
things on my local machine, but ended up making things somehow worse on
travis-ci.

On Mon, Nov 21, 2016 at 11:24 AM, Antoine Pitrou notifications@github.com
wrote:

Ok, I manage to reproduce the hangs locally with hdfs3 master, but not
with hdfs3 0.1.2. After bisecting, I find out the first buggy revision
seems to be dask/hdfs3@bd76002
dask/hdfs3@bd76002
.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#687 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AASszEYuWPfaBpoTnecsjG_KQTJY_APUks5rAcXNgaJpZM4K1rZt
.

jakirkham · 2016-11-21T16:34:32Z

Which version of billiard are you using?

pitrou · 2016-11-21T16:36:26Z

3.5.0.2, installed using pip.

jakirkham · 2016-11-21T16:45:07Z

Yeah, I see the same thing. Opened issue ( celery/billiard#200 ) to get more info.

jakirkham · 2016-11-21T17:01:55Z

After a cursory look, it appears one would need to backport the socket library from Python 3 to Python 2 in order to get forkserver to work. I'm unaware of any existing backport like this.

pitrou · 2016-11-21T18:12:52Z

@jakirkham, I'm not sure billiard is very well-tested (see e.g. celery/billiard#201)

pitrou · 2016-11-21T18:24:00Z

Ok, the latest changes to hdfs3 suppressed the hangs.

mrocklin · 2016-11-21T18:25:44Z

I'm curious about the state of Python 2.

pitrou · 2016-11-21T18:40:02Z

Python 2 can still hang occasionally: https://travis-ci.org/dask/distributed/jobs/177681408

mrocklin · 2016-11-21T18:47:54Z

Any recommendations on how to resolve this? I see a few options:

Previously we would lock reading from HDFS with a file-based lock.
This did seem to resolve the issue. It became harder to do with recent
refactoring (hence the current situation) but I could probably throw
together a hack to recreate the old solution.
Nannies do have the ability to create worker processes using the
subprocess module rather than fork. This is slow and painful, but could
work.
We just don't support HDFS on Python 2 for the near future until
someone complains. This is somewhat expensive politically.

On Mon, Nov 21, 2016 at 1:40 PM, Antoine Pitrou notifications@github.com
wrote:

Python 2 can still hang occasionally: https://travis-ci.org/dask/
distributed/jobs/177681408

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#687 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AASszML1R4fcTfmTBzFKDsL2I-Z7UHa2ks5rAeWDgaJpZM4K1rZt
.

pitrou · 2016-11-21T18:50:45Z

Previously we would lock reading from HDFS with a file-based lock

Judging by the problems on Travis, the file-based locks didn't solve the issue, right?

Nannies do have the ability to create worker processes using the subprocess module rather than fork

As long as the parent doesn't try to use hdfs3, forking isn't an issue. That's why forkserver works. So if nanny spawns a first process and then forks children from it, it should work.

We just don't support HDFS on Python 2 for the near future until someone complains. This is somewhat expensive politically.

And cheap technically :-)

A possible solution would be to get forkserver to work on Python 2, using billiard, but billiard doesn't seem tremendously well-tested and I wonder if it's meant for outside use or just Celery's internal use.

mrocklin · 2016-11-21T18:55:38Z

Judging by the problems on Travis, the file-based locks didn't solve the issue, right?

Correct that the current solution didn't resolve the issue. An older solution that locked just around reading
did appear to resolve the issue in practice. I'll pull up a reference commit in a second.

As long as the parent doesn't try to use hdfs3, forking isn't an issue. That's why forkserver works. So if nanny spawns a first process and then forks children from it, it should work.

If that's the case then it may be that it's only our tests that fail and that the current solution would work in practice.

mrocklin · 2016-11-21T18:57:27Z

This was the previous solution:

distributed/distributed/hdfs.py

Lines 34 to 40 in fd64029

    
           def read_block_from_hdfs(filename, offset, length, host=None, port=None, 
        
                   delimiter=None): 
        
               from locket import lock_file 
        
               with lock_file('.lock'): 
        
                   hdfs = HDFileSystem(host=host, port=port) 
        
                   bytes = hdfs.read_block(filename, offset, length, delimiter) 
        
               return bytes

For whatever reason (perhaps even unrelated to that code) we didn't run into concurrency issues.

pitrou · 2016-11-21T19:00:18Z

Well, concurrency issues tend to crop up randomly, so perhaps we were lucky that the tests didn't stress concurrency enough? I'm skeptical that any solution based on locks would actually solve the issue, since the problem lies elsewhere.

If that's the case then it may be that it's only our tests that fail and that the current solution would work in practice.

Except if people use the Client and Worker APIs directly?

mrocklin · 2016-11-21T19:01:30Z

Using Client APIs directly is pretty common. I know of only a few groups that creates workers manually and even they create the worker as the first thing they do.

pitrou · 2016-11-21T19:47:14Z

Hmm, so how about skipping the parts of test_hdfs that fork on Python 2?

mrocklin · 2016-11-21T20:07:08Z

Hmm, so how about skipping the parts of test_hdfs that fork on Python 2?

I would be OK with this.

pitrou · 2016-11-21T20:07:22Z

I can confirm that skipping the tests that use utils_test.cluster() seems to suppress the hangs on 2.7.

mrocklin · 2016-11-21T23:05:23Z

This looks good to me. Merging soon if no comments.

Use forkserver on Unix and Python 3

2603b41

mrocklin reviewed Nov 17, 2016

View reviewed changes

Improve test suite speed

8ae6d2f

Install lz4 and paramiko

b0c1d57

pitrou mentioned this pull request Nov 21, 2016

Add file based locks dask/hdfs3#99

Merged

mrocklin mentioned this pull request Nov 21, 2016

Remove caching logic from xarray.Variable pydata/xarray#1128

Merged

Skip forking HDFS tests on py2

fc68597

mrocklin merged commit 6972e94 into dask:master Nov 21, 2016

tumb1er mentioned this pull request Feb 28, 2017

Please disable forkserver for py3 by config flag #913

Closed

m1so mentioned this pull request Jan 2, 2022

Side-effect free & lazy multiprocessing context #5475

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use forkserver on Unix and Python 3 #687

Use forkserver on Unix and Python 3 #687

pitrou commented Nov 17, 2016 •

edited

Loading

pitrou commented Nov 17, 2016

pitrou commented Nov 17, 2016 •

edited

Loading

mrocklin commented Nov 17, 2016

pitrou commented Nov 17, 2016

pitrou commented Nov 17, 2016

mrocklin Nov 17, 2016

pitrou Nov 21, 2016

mrocklin Nov 21, 2016

jakirkham commented Nov 18, 2016

pitrou commented Nov 18, 2016

pitrou commented Nov 21, 2016

pitrou commented Nov 21, 2016

pitrou commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

jakirkham commented Nov 21, 2016

pitrou commented Nov 21, 2016

jakirkham commented Nov 21, 2016

jakirkham commented Nov 21, 2016

pitrou commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

mrocklin commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

Use forkserver on Unix and Python 3 #687

Use forkserver on Unix and Python 3 #687

Conversation

pitrou commented Nov 17, 2016 • edited Loading

pitrou commented Nov 17, 2016

pitrou commented Nov 17, 2016 • edited Loading

mrocklin commented Nov 17, 2016

pitrou commented Nov 17, 2016

pitrou commented Nov 17, 2016

mrocklin Nov 17, 2016

Choose a reason for hiding this comment

pitrou Nov 21, 2016

Choose a reason for hiding this comment

mrocklin Nov 21, 2016

Choose a reason for hiding this comment

jakirkham commented Nov 18, 2016

pitrou commented Nov 18, 2016

pitrou commented Nov 21, 2016

pitrou commented Nov 21, 2016

pitrou commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

jakirkham commented Nov 21, 2016

pitrou commented Nov 21, 2016

jakirkham commented Nov 21, 2016

jakirkham commented Nov 21, 2016

pitrou commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

mrocklin commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

pitrou commented Nov 21, 2016

mrocklin commented Nov 21, 2016

pitrou commented Nov 17, 2016 •

edited

Loading

pitrou commented Nov 17, 2016 •

edited

Loading