Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask Collection Interface #2748

Merged
merged 2 commits into from
Oct 15, 2017
Merged

Dask Collection Interface #2748

merged 2 commits into from
Oct 15, 2017

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented Oct 5, 2017

This adds a standard protocol for the dask collections implemented via a set of methods (and no required base class). The methods are similar to the existing non-public api, but not exactly the same. I recommend the new doc page (docs/source/custom-collections.rst) as a starting point.

Supersedes #1068.

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't started reviewing code yet, but I had a few small comments on the documentation (which is great by the way).

>>> from dask.threaded import get as threaded_get
>>> class MyCollection(object):
... # Use the threaded scheduler by default
... __dask_default_get__ = staticmethod(threaded_get)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask.threaded.get


.. method:: __dask_postcompute__(self)

Finalizer and (optional) extra arguments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unable to understand the intent of this function from this description


.. note:: It's also recommended to define ``__dask_tokenize__``,
see :ref:`deterministic-hashing`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to see all of these methods laid out in a mock compute function, just to get a context of how they are likely to be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may eventually choose to rename this function from postpersist to something like collection_from_graph. I can imagine it being used in other contexts. I'm not suggesting any action now. Just rambling.

@jcrist
Copy link
Member Author

jcrist commented Oct 5, 2017

Note that with the new design for __dask_postcompute__ (which adds support for extra arguments to finalizers), this also partly fixes #1398.

@mrocklin
Copy link
Member

mrocklin commented Oct 5, 2017

cc @shoyer @rabernat @jhamman

XArray seems like the obvious first candidate for this.

@jhamman
Copy link
Member

jhamman commented Oct 6, 2017

I'll be interested to hear from @shoyer on his take on the benefits and effort required to factor this into xarray. It does seem like a much more streamlined interface than what we have now.

@mrocklin
Copy link
Member

mrocklin commented Oct 6, 2017

This would give us things like persist and asynchronous behavior, both of which are particularly useful on distributed systems.

result = [(name,) + args + (i,) for i in range(numblocks[ind])]
else:
result = [keys(*(args + (i,))) for i in range(numblocks[ind])]
return result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the nested function here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want a deprecation cycle around _keys. I would not be surprised if this has leaked out into semi-public API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, from the way it's documented it's not entirely clear: http://dask.pydata.org/en/latest/array-design.html#keys-of-the-dask-graph

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah have even mucked with _keys some myself to optimize things. ( #2472 ) That said, the general rule of thumb is that _ means not public.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the nested function here

Array keys are implemented recursively. Since __dask_keys__ takes no parameters I moved the recursive function to be internal. Could also be moved external to a helper function, but this seemed cleaner and keeps things located together.

We may want a deprecation cycle around _keys:

All the old methods still exist and work, and will issue a deprecation warning on use. _keys calls __dask_keys__.

dask/base.py Outdated
"""Base class for dask collections"""

def is_dask_collection(x):
"""Returns if ``x`` is a dask collection"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return True if ...

dask/base.py Outdated
def is_dask_collection(x):
"""Returns if ``x`` is a dask collection"""
dask_graph = getattr(x, '__dask_graph__', None)
return False if dask_graph is None else (dask_graph() is not None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit for clarity

try:
    return x.__dask_graph__() is not None
except AttributeError:
    return False

dask/base.py Outdated
class Base(DaskMethodsMixin):
"""DEPRECATED. The recommended way to create a custom dask object now is to
implement the dask collection interface (see the docs), and optionally
subclass from ``DaskMethodsMixin`` if desired."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we point to docs here?

@@ -265,3 +265,20 @@ class to receive bound method
setattr(cls, name, types.MethodType(func, None, cls))
else:
setattr(cls, name, func)


# Borrowed from six
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include a license?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on your reading of substantial portions. 😜 Though probably best to be safe and include it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this code completely to something simpler, no need anymore.

dask/base.py Outdated

if name in ('eq', 'gt', 'ge', 'lt', 'le', 'ne', 'getitem'):
return
Allows for applying the same optimizations and default scheduler."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name didn't make sense to me at first. I thought it meant something more like "compute this but then return a dask collection". I don't yet have an idea for a better name though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the name doesn't show up in the PR, but is in the file diff, adding the name to the comments to make it easier to follow. The name is compute_as_collection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps compute_as_if_collection?

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the docs. Generally looks great -- this seems like it should be a great fit for xarray.

keys : list
\*\*kwargs
Extra keyword arguments forwarded from the call to ``compute`` or
``persist``. Can be used or ignored as needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For xarray, I suppose we should simply forward to the optimize method from dask.array? e.g., da.Array.__dask_optimize__(dsk, **kwargs)?

Is there a guarantee that keys refers to only a single dask object of this type, or could they be taken from a collection of objects of this type or even objects of different dask types? (I think the last one, but that's not immediately obvious.)

Some pseudo-code for how this is called could by the scheduler could be helpful in thinking this all through.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For xarray, I suppose we should simply forward to the optimize method from dask.array? e.g., da.Array.__dask_optimize__(dsk, **kwargs). Could also do something like:

class YourClass(...):
    @property
    def __dask_optimize__(self):
        return da.Array.__dask_optimize__

Is there a guarantee that keys refers to only a single dask object of this type, or could they be taken from a collection of objects of this type or even objects of different dask types? (I think the last one, but that's not immediately obvious.)

The latter. I added a section documenting how these methods are used in the dask core methods (e.g. compute) that hopefully clarifies this. Also clarified the method docs.


.. note:: It's also recommended to define ``__dask_tokenize__``,
see :ref:`deterministic-hashing`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong 👍

Returns
-------
finalize : callable
A function with the signature ``finalize(results, *extra_args)``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the structure of results? I'm guessing it matches the (nested) structure of __dask_keys__?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've clarified this in the docs, let me know if it's not sufficient.

A function with the signature ``finalize(results, *extra_args)``.
Called with the computed keys from ``__dask_keys__`` and any extra
arguments as specified in ``extra_args``. Should perform any necessary
finalization before returning from ``compute``. For example, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to say something like "Should return an equivalent in-memory collection."

To create your own dask collection, you need to fullfill the following
interface. Note that there is no required base class:

.. method:: __dask_graph__(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to name things like _dask_graph_ instead? Double underscore invokes name mangling in some cases, and I'm not sure that's actually helpful here.

Also, although obviously Python is never going to name methods with the name "dask" in them, the double underscore prefix+postfix is technically reserved for language (I've gotten occasional push-back about this before, e.g., when I proposed an ad-hoc protocol for pandas on python-ideas).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is done to match what NumPy does with its array interface.

Double underscore before and after is different from just double underscore before (without the double underscore after). Naming mangling only applies to the latter case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this was done to mirror how numpy and other libraries implement interfaces. I see no issue with using __foo__ methods here, but happy to change if there's considerable pushback.

Personally I feel that private methods feel wrong here (usually indicate private interfaces that shouldn't be used), while dunder methods look more intentionally thought out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am more than happy for us to mirror NumPy. :)

Dask implements its own deterministic hash function to generate keys based on
the value of arguments. This function is available as ``dask.base.tokenize``.
Many common types already have implementations of ``tokenize``, which can be
found in ``dask/base.py``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for tokenize to be available from the dask namespace directly if it is being advertised as a public API function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think tokenize should be top-level, as most users won't use it at all. It is public api though.


Where possible, it's recommended to define the ``__dask_tokenize__`` method.
This method takes no arguments and should return a value fully
representative of the object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean everything that was handled by dask.base.tokenize will now get a __dask_tokenize__ method instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is just a potentially cleaner way to make tokenize work with classes you control. Either adding a method to the dispatch or implementing __dask_tokenize__ works. For classes you don't control (e.g. numpy.ndarray) the dispatch will still be needed.

This is similar to implementing the pickle methods or registering in copyreg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry by "everything" I meant all objects in Dask proper or is this a no for that case too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the diff, classes internal to dask now use the method. The dispatch is still used elsewhere. I'd say this is a case-by-case decision, similar to using the pickle interface methods vs the copyreg dispatch.

@jakirkham
Copy link
Member

Related to this, it would be very useful to have an interface method to check whether two Dask objects point to the same thing.

@mrocklin
Copy link
Member

mrocklin commented Oct 6, 2017

@jakirkham can you provide some motivation for that? Perhaps in a different issue?

@mrocklin
Copy link
Member

mrocklin commented Oct 6, 2017

I am pleasantly surprised to learn that this doesn't seem to affect the dask/distributed test suite (other than deprecation warnings (which are themselves welcome))

@jcrist
Copy link
Member Author

jcrist commented Oct 10, 2017

I believe all comments have been addressed.

@jakirkham
Copy link
Member

can you provide some motivation for that? Perhaps in a different issue?

I think I have worked around the initial problem that raised this issue. Though it is still nice periodically to be able to check what things share identity. Could give some more thought to other use cases if you would like.

@mrocklin
Copy link
Member

@jhamman it might be interesting to try to apply this to XArray. Are you still game for this? Regardless, it would be useful to have your perspective on the docs here. You are probably the target audience for them so it would be useful to hear any points that you find less than clear.

@jhamman
Copy link
Member

jhamman commented Oct 11, 2017

@mrocklin - yes. I'm still game. I'd like to wait until after we release xarray 0.10. I'll give the docs another review, then let's loop back on this next week.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pseudo-code and description of how all these methods work. I'm feeling pretty good about this.

- If ``optimize_graph`` is ``True`` (default) then the collections are first
grouped by their ``__dask_optimize__`` methods. All collections with the
same ``__dask_optimize__`` method have their graphs merged and keys
concatenated, and then a single call to ``__dask_optimize__`` is made with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple types of collections, which __dask_optimize__ method is called once? An arbitrary one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't see this single call in the pseudo-code implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we group our graphs the optimization function, merge each collection of graphs and call the optimize function on those graphs together. This happens separately for each optimization function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matt is correct. I've updated this section to hopefully clarify this.

@mrocklin
Copy link
Member

I think that we should add appropriate disclaimers to the docs that this might change without notice, merge, and then try implementing this in XArray, providing feedback to Dask.

This adds a standard protocol for the dask collections implemented via a
set of methods (and no required base class). The methods are similar to
the existing non-public api, but not exactly the same.

- Specified collection interface
- Switched all internals to use new interface
- Deprecated previous interface
- Broke out the core methods (e.g. `.compute` into a mixin)
- Deprecated `Base` class
- Documented interface
@jcrist
Copy link
Member Author

jcrist commented Oct 13, 2017

I had to fix some hairy merge conflicts, which was easiest to do by squashing this PR into one large commit beforehand. Apologies for any lost comments :/.

At this point I think this is good to go in. I've added a warning to the docs that parts of this protocol might change without deprecations, as well as clarified a few points in the docs.

@shoyer
Copy link
Member

shoyer commented Oct 13, 2017

Looks good to me, at least from a docs perspective!

@mrocklin
Copy link
Member

I'll take another look through this shortly, but last time I went through it I had only minor comments so I'm fairly optimistic.

The optimized dask graph.


.. staticmethod:: __dask_default_get__(dsk, keys, \*\*kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to change the name of get in the future to something more informative. Is there a different name that we can use here that would be more future-proof? __dask_scheduler__ ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

@jcrist
Copy link
Member Author

jcrist commented Oct 15, 2017

I'm going to merge this now, we can add touch-ups later as needed. Thanks everyone for the reviews.

@jcrist jcrist merged commit 9e9fa10 into dask:master Oct 15, 2017
@jcrist jcrist deleted the dask-interface-redux branch October 15, 2017 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants