Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask Array graph optimization functions #2472

Closed
jakirkham opened this issue Jun 19, 2017 · 3 comments
Closed

Dask Array graph optimization functions #2472

jakirkham opened this issue Jun 19, 2017 · 3 comments
Labels

Comments

@jakirkham
Copy link
Member

Sometimes in constructing a computation with Dask Array's dead ends will show up. While it is true that these get removed at computation time, it is nice to be able to some cleanup periodically to keep the Dask Graph size reasonable. Particularly this cleanup is nice if we know a dead end will show up (e.g. slicing).

Previously this happened automatically with slicing, but it proved problematic in general ( #1732 ). A reasonable alternative would be to provide these optimization operations directly to the user. That way it is up to them to make the appropriate decision.

While it is true that there are optimization functions for Dask graphs, it remains unclear (at least to me) how one applies these to an array in general outside of Dask. To get the relevant keys, one must call _keys, which appears to be part of the Private API. Trying to get at this from the Public API does not appear to be straightforward. Even once one performs this sort of optimization, there remains the question of how to get the resulting Dask Graph back into a Dask Array.

Below is what I found works to get cull to act on a Dask Array. However this seems to require using the private API to get the job done. It seems a reasonable solution to this problem would be to create a wrapper function using a workflow like the one below (with any other things I may have missed) and add the wrapper to the private API. Then every function in dask.optimize can be wrapped with this wrapper function and added to the public API.

import dask
import dask.array
import dask.array.core
import dask.sharedict

d = dask.array.ones((10, 12), chunks=(5, 6))

sd = dask.sharedict.ShareDict()
sd.update(dask.optimize.cull(d.dask, d._keys())[0])

do = dask.array.core.Array(sd, d.name, d.chunks, d.dtype)
@jakirkham
Copy link
Member Author

Any thoughts on this?

@jakirkham
Copy link
Member Author

Now that there is a proper interface, this should be more approachable.

@jakirkham
Copy link
Member Author

The situation here has gotten substantially better with PRs ( #2748 ) and ( #3071 ). The former providing an API to work on Dask Collections (Arrays included) generally. The latter providing an API for optimizations of Dask Collections (again Arrays included). These make it much easier to get at the Dask graphs and keys that underlie Dask Arrays, optimize the Dask graphs, and rebuild Dask Arrays from the optimized graphs and keys. Given these nice improvements, will close out this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant