Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use forkserver on Unix and Python 3 #687

Merged
merged 4 commits into from
Nov 21, 2016
Merged

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Nov 17, 2016

The "forkserver" method is a multiprocessing feature on Python 3. In this mode of operation, a middleman process is spawned that will fork children on behalf of the parent. This avoids inheriting the parent's system resources (file descriptors, mutexes, etc.).

This fixes the test_hdfs hangs here (on Python 3, that is).

I'm not terribly sold on mp_context as a name. Suggestions welcome.

@pitrou
Copy link
Member Author

pitrou commented Nov 17, 2016

btw, I wonder if a similar change should be made in Dask too.

@pitrou
Copy link
Member Author

pitrou commented Nov 17, 2016

This makes the test suite slower on Python 3 (442 s. vs. 178 s.). I'm not terribly surprised, since launching a process is now a bit more expensive (though still less than with the "spawn" method). In stable regime, this probably doesn't matter, but the test suite launches tons of child processes.

@mrocklin
Copy link
Member

Thoughts on how to resolve the HDFS issue on Python 2?

Is forkserver still the right decision in Python 3 if we don't care about HDFS?

@pitrou
Copy link
Member Author

pitrou commented Nov 17, 2016

Is forkserver still the right decision in Python 3 if we don't care about HDFS?

Probably. There are other possible issues with fork(). Most third-party libraries (Python or C) are not fork-safe, so we may run into similar issues (in my previous job we had broken SSL connections until we stopped sharing resources).
Also, the fact that file descriptors are inherited in the child (without the child necessarily noticing) means some resources can be lingering even if the client closes them, for example a network connection could remain open for some time.

The one functional downside is that there are a couple things to be aware of when not using the "fork" method. These are the same guidelines as on Windows: https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

@pitrou
Copy link
Member Author

pitrou commented Nov 17, 2016

Thoughts on how to resolve the HDFS issue on Python 2?

Not sure. One possibility would be to force all HDFS operations in a dedicated process, but since this would typically deal with large-ish data, the marshalling overhead may be a problem.

@@ -29,6 +30,12 @@
logger = logging.getLogger(__name__)


if PY3 and not sys.platform.startswith('win'):
mp_context = multiprocessing.get_context('forkserver')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to get this from the config file so that we can set this to fork for faster tests during local development.

from .config import config
context = config.get('multiprocessing_context', 'forkserver')
mp_context = multiprocessing.get_context(context)

There is probably a better name for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably a better name for this.

Hmm, do you want to suggest one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more concerned about adding a config option when there was a significant delay to using forkserver. Now that this is as fast as using fork I'll retract my comment.

@jakirkham
Copy link
Member

Have you guys looked at billiard? It's effectively a backported multiprocessing for Python 2.7. That would let you do this for Python 2 and 3.

@pitrou
Copy link
Member Author

pitrou commented Nov 18, 2016

Have you guys looked at billiard? It's effectively a backported multiprocessing for Python 2.7. That would let you do this for Python 2 and 3.

That's interesting. I had never heard about it before. Is there any documentation? I couldn't find any.

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2016

Ok, I looked at billiard, and the "forkserver" method isn't available on Python 2:

>>> billiard.get_context('forkserver')
Traceback (most recent call last):
  File "<ipython-input-4-93eda6a9a76c>", line 1, in <module>
    billiard.get_context('forkserver')
  File "/home/antoine/miniconda3/envs/dask27/lib/python2.7/site-packages/billiard/context.py", line 292, in get_context
    return super(DefaultContext, self).get_context(method)
  File "/home/antoine/miniconda3/envs/dask27/lib/python2.7/site-packages/billiard/context.py", line 241, in get_context
    ctx._check_available()
  File "/home/antoine/miniconda3/envs/dask27/lib/python2.7/site-packages/billiard/context.py", line 366, in _check_available
    raise ValueError('forkserver start method not available')
ValueError: forkserver start method not available

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2016

It turns out one can speed up the spawning by defining some modules to preload in the forkserver process. This makes the test suite as fast on 3.x as on 2.x.

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2016

For some unknown reason, the hangs in test_hdfs have reappeared on Travis. I'm a bit baffled, since I can't reproduce them anymore at home. See https://travis-ci.org/dask/distributed/jobs/177681409

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2016

Ok, I manage to reproduce the hangs locally with hdfs3 master, but not with hdfs3 0.1.2. After bisecting, I find out the first buggy revision seems to be dask/hdfs3@bd76002.

@mrocklin
Copy link
Member

Feel free to revert the locket stuff in hdfs3. That appeared to resolve
things on my local machine, but ended up making things somehow worse on
travis-ci.

On Mon, Nov 21, 2016 at 11:24 AM, Antoine Pitrou notifications@github.com
wrote:

Ok, I manage to reproduce the hangs locally with hdfs3 master, but not
with hdfs3 0.1.2. After bisecting, I find out the first buggy revision
seems to be dask/hdfs3@bd76002
dask/hdfs3@bd76002
.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#687 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AASszEYuWPfaBpoTnecsjG_KQTJY_APUks5rAcXNgaJpZM4K1rZt
.

@jakirkham
Copy link
Member

Which version of billiard are you using?

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2016

3.5.0.2, installed using pip.

@jakirkham
Copy link
Member

Yeah, I see the same thing. Opened issue ( celery/billiard#200 ) to get more info.

@jakirkham
Copy link
Member

After a cursory look, it appears one would need to backport the socket library from Python 3 to Python 2 in order to get forkserver to work. I'm unaware of any existing backport like this.

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2016

@jakirkham, I'm not sure billiard is very well-tested (see e.g. celery/billiard#201)

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2016

Ok, the latest changes to hdfs3 suppressed the hangs.

@mrocklin
Copy link
Member

I'm curious about the state of Python 2.

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2016

Python 2 can still hang occasionally: https://travis-ci.org/dask/distributed/jobs/177681408

@mrocklin
Copy link
Member

Any recommendations on how to resolve this? I see a few options:

  1. Previously we would lock reading from HDFS with a file-based lock.
    This did seem to resolve the issue. It became harder to do with recent
    refactoring (hence the current situation) but I could probably throw
    together a hack to recreate the old solution.
  2. Nannies do have the ability to create worker processes using the
    subprocess module rather than fork. This is slow and painful, but could
    work.
  3. We just don't support HDFS on Python 2 for the near future until
    someone complains. This is somewhat expensive politically.

On Mon, Nov 21, 2016 at 1:40 PM, Antoine Pitrou notifications@github.com
wrote:

Python 2 can still hang occasionally: https://travis-ci.org/dask/
distributed/jobs/177681408


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#687 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AASszML1R4fcTfmTBzFKDsL2I-Z7UHa2ks5rAeWDgaJpZM4K1rZt
.

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2016

  1. Previously we would lock reading from HDFS with a file-based lock

Judging by the problems on Travis, the file-based locks didn't solve the issue, right?

  1. Nannies do have the ability to create worker processes using the subprocess module rather than fork

As long as the parent doesn't try to use hdfs3, forking isn't an issue. That's why forkserver works. So if nanny spawns a first process and then forks children from it, it should work.

  1. We just don't support HDFS on Python 2 for the near future until someone complains. This is somewhat expensive politically.

And cheap technically :-)

A possible solution would be to get forkserver to work on Python 2, using billiard, but billiard doesn't seem tremendously well-tested and I wonder if it's meant for outside use or just Celery's internal use.

@mrocklin
Copy link
Member

Judging by the problems on Travis, the file-based locks didn't solve the issue, right?

Correct that the current solution didn't resolve the issue. An older solution that locked just around reading
did appear to resolve the issue in practice. I'll pull up a reference commit in a second.

As long as the parent doesn't try to use hdfs3, forking isn't an issue. That's why forkserver works. So if nanny spawns a first process and then forks children from it, it should work.

If that's the case then it may be that it's only our tests that fail and that the current solution would work in practice.

@mrocklin
Copy link
Member

This was the previous solution:

def read_block_from_hdfs(filename, offset, length, host=None, port=None,
delimiter=None):
from locket import lock_file
with lock_file('.lock'):
hdfs = HDFileSystem(host=host, port=port)
bytes = hdfs.read_block(filename, offset, length, delimiter)
return bytes

For whatever reason (perhaps even unrelated to that code) we didn't run into concurrency issues.

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2016

Well, concurrency issues tend to crop up randomly, so perhaps we were lucky that the tests didn't stress concurrency enough? I'm skeptical that any solution based on locks would actually solve the issue, since the problem lies elsewhere.

If that's the case then it may be that it's only our tests that fail and that the current solution would work in practice.

Except if people use the Client and Worker APIs directly?

@mrocklin
Copy link
Member

Using Client APIs directly is pretty common. I know of only a few groups that creates workers manually and even they create the worker as the first thing they do.

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2016

Hmm, so how about skipping the parts of test_hdfs that fork on Python 2?

@mrocklin
Copy link
Member

Hmm, so how about skipping the parts of test_hdfs that fork on Python 2?

I would be OK with this.

@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2016

I can confirm that skipping the tests that use utils_test.cluster() seems to suppress the hangs on 2.7.

@mrocklin
Copy link
Member

This looks good to me. Merging soon if no comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants