Dask Array graph optimization functions #2472

jakirkham · 2017-06-19T20:33:13Z

Sometimes in constructing a computation with Dask Array's dead ends will show up. While it is true that these get removed at computation time, it is nice to be able to some cleanup periodically to keep the Dask Graph size reasonable. Particularly this cleanup is nice if we know a dead end will show up (e.g. slicing).

Previously this happened automatically with slicing, but it proved problematic in general ( #1732 ). A reasonable alternative would be to provide these optimization operations directly to the user. That way it is up to them to make the appropriate decision.

While it is true that there are optimization functions for Dask graphs, it remains unclear (at least to me) how one applies these to an array in general outside of Dask. To get the relevant keys, one must call _keys, which appears to be part of the Private API. Trying to get at this from the Public API does not appear to be straightforward. Even once one performs this sort of optimization, there remains the question of how to get the resulting Dask Graph back into a Dask Array.

Below is what I found works to get cull to act on a Dask Array. However this seems to require using the private API to get the job done. It seems a reasonable solution to this problem would be to create a wrapper function using a workflow like the one below (with any other things I may have missed) and add the wrapper to the private API. Then every function in dask.optimize can be wrapped with this wrapper function and added to the public API.

import dask
import dask.array
import dask.array.core
import dask.sharedict

d = dask.array.ones((10, 12), chunks=(5, 6))

sd = dask.sharedict.ShareDict()
sd.update(dask.optimize.cull(d.dask, d._keys())[0])

do = dask.array.core.Array(sd, d.name, d.chunks, d.dtype)

The text was updated successfully, but these errors were encountered:

jakirkham · 2017-09-27T20:05:45Z

Any thoughts on this?

jakirkham · 2017-11-01T15:09:10Z

Now that there is a proper interface, this should be more approachable.

jakirkham · 2018-01-22T06:11:43Z

The situation here has gotten substantially better with PRs ( #2748 ) and ( #3071 ). The former providing an API to work on Dask Collections (Arrays included) generally. The latter providing an API for optimizations of Dask Collections (again Arrays included). These make it much easier to get at the Dask graphs and keys that underlie Dask Arrays, optimize the Dask graphs, and rebuild Dask Arrays from the optimized graphs and keys. Given these nice improvements, will close out this issue.

jakirkham mentioned this issue Oct 6, 2017

Dask Collection Interface #2748

Merged

jakirkham added the array label Nov 2, 2017

jcrist mentioned this issue Jan 16, 2018

Add dask.base.optimize #3071

Merged

jakirkham closed this as completed Jan 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask Array graph optimization functions #2472

Dask Array graph optimization functions #2472

jakirkham commented Jun 19, 2017

jakirkham commented Sep 27, 2017

jakirkham commented Nov 1, 2017

jakirkham commented Jan 22, 2018

Dask Array graph optimization functions #2472

Dask Array graph optimization functions #2472

Comments

jakirkham commented Jun 19, 2017

jakirkham commented Sep 27, 2017

jakirkham commented Nov 1, 2017

jakirkham commented Jan 22, 2018