Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize handling of chunked array types #7019

Merged
merged 189 commits into from
May 18, 2023

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Sep 10, 2022

Initial attempt to get cubed working within xarray, as an alternative to dask.

I've added a manager kwarg to the .chunk methods so you can do da.chunk(manager="cubed") to convert to a chunked cubed.CoreArray, with the default still being da.chunk(manager="dask"). (I couldn't think of a better name than "manager", as "backend" and "executor" are already taken.)

At the moment it should work except for an import error that I don't understand, see below.

Fro cubed to work at all with this PR we would also need:

To-dos for me on this PR:

  • Re-route xarray.apply_ufunc through cubed.apply_gufunc instead of dask's apply_gufunc when appropriate,
  • Add from_array_kwargs to opening functions, e.g. open_zarr, and open_dataset,
  • Add from_array_kwargs to creation functions, such as full_like,
  • Add store_kwargs as a way to propagate cubed-specific kwargs when saving to_zarr.

To complete this project more generally we should also:

cc @tomwhite

xarray/core/pycompat.py Outdated Show resolved Hide resolved
@TomNicholas TomNicholas added topic-arrays related to flexible array support and removed topic-typing labels Sep 10, 2022
@github-actions github-actions bot added topic-typing and removed topic-arrays related to flexible array support labels Sep 10, 2022
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/types.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Member Author

@Illviljan thanks for all your comments!

Would you (or @keewis?) be willing to approve this PR now? I would really like to merge this so that I can release a version of xarray that I can use as a dependency for cubed-xarray.

@dcherian
Copy link
Contributor

Thanks @TomNicholas Big change!

@dcherian dcherian merged commit 6cd6122 into pydata:main May 18, 2023
@TomNicholas
Copy link
Member Author

Woooo thanks @dcherian !

@jhamman
Copy link
Member

jhamman commented May 18, 2023

👏 Congrats @TomNicholas on getting this in! Such an important contribution. 👏

@TomNicholas TomNicholas mentioned this pull request May 18, 2023
4 tasks
@tomwhite
Copy link
Contributor

Thanks for all your hard work on this @TomNicholas!

@TomNicholas TomNicholas mentioned this pull request Jun 8, 2023
4 tasks
dstansby pushed a commit to dstansby/xarray that referenced this pull request Jun 28, 2023
* generalise chunk methods to allow cubed

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fic typing typo

* fixed circular import

* fix some mypy errors

* added cubed to mypy ignore list

* simplify __array_ufunc__ check

* Revert "simplify __array_ufunc__ check" as I pushed to wrong branch

This reverts commit cdcb3fb.

* update cubed array type

* fix missed conflict

* sketch for ChunkManager adapter class

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove erroneous docstring about usage of map_blocks

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

* apply_ufunc -> apply_gufunc

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

* chunk -> from_array

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

* remove staticmethods

* attempt to type methods of ABC

* from_array

* attempt to specify types

* method for checking array type

* Update pyproject.toml

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed import errors

* generalize .chunk method kwargs

* used dask functions in dask chunkmanager

* define signatures for apply_gufunc, blockwise, map_blocks

* prototype function to detect which parallel backend to use

* add cubed.apply_gufunc

* ruffify

* add rechunk and compute methods for cubed

* xr.apply_ufunc now dispatches to chunkmanager.apply_gufunc

* CubedManager.chunks

* attempt to keep dask and cubed imports lazy

* generalize idxmax

* move unify_chunks import to ChunkManager

* generalize Dataset.load()

* check explicitly for chunks attribute instead of hard-coding cubed

* better function names

* add cubed version of unify_chunks

* recognize wrapped duck dask arrays (e.g. pint wrapping dask)

* add some tests for fetching ChunkManagers

* add from_array_kwargs to open_dataset

* add from_array_kwargs to open_zarr

* pipe constructors through chunkmanager

* generalize map_blocks inside coding

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed full_like

* add from_array_kwargs to open_zarr

* don't import dask.tokenize

* fix bugs with passing from_array_kwargs down

* generalise reductions by adding to chunkmanager

* moved nanfirst/nanlast to duck_array_ops from dask_array_ops

* generalize interp

* generalized chunk_hint function inside indexing

* DaskIndexingAdapter->ChunkedIndexingAdapter

* Revert "DaskIndexingAdapter->ChunkedIndexingAdapter"

This reverts commit 4ca044b.

* pass cubed-related kwargs down through to_zarr by adding .store to ChunkManager

* fix typing_extensions on py3.9

* fix ImportError with cubed array type

* give up trying to import TypeAlias in CI

* fix import of T_Chunks

* fix no_implicit_optional warnings

* don't define CubedManager if cubed can't be imported

* fix local mypy errors

* don't explicitly pass enforce_ndim into dask.array.map_blocks

* fix drop_axis default

* use indexing adapter on cubed arrays too

* use array API-compatible version of astype function

* whatsnew

* document new kwargs

* add chunkmanager entrypoint

* move CubedManager to a separate package

* guess chunkmanager based on whats available

* fix bug with tokenizing

* adapt tests to emulate existence of entrypoint

* use fixture to setup/teardown dummy entrypoint

* refactor to make DaskManager unavailable if dask not installed

* typing

* move whatsnew to latest xarray version

* remove superfluous lines from whatsnew

* fix bug where zarr backend attempted to use dask when not installed

* Remove rogue print statement

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

* Clarify what's new

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

* use monkeypatch to mock registering of dummy chunkmanager

* more tests for guessing chunkmanager correctly

* raise TypeError if no chunkmanager found for array types

* Correct is_chunked_array check

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

* vendor dask.array.core.normalize_chunks

* add default implementation of rechunk in ABC

* remove cubed-specific type check in daskmanager

* nanfirst->chunked_nanfirst

* revert adding cubed to NON_NUMPY_SUPPORTED_ARRAY_TYPES

* licensing to vendor functions from dask

* fix bug

* ignore mypy error

* separate chunk_manager kwarg from from_array_kwargs dict

* rename kwarg to chunked_array_type

* refactor from_array_kwargs in .chunk ready for deprecation

* print statements in test so I can comment on them

* remove print statements now I've commented on them in PR

* should fix dask naming tests

* make dask-specific kwargs explicit in from_array

* debugging print statements

* Revert "debugging print statements"

This reverts commit 7dc6581.

* fix gnarly bug with auto-determining chunksizes caused by not referring to dask.config

* hopefully fix broken docstring

* Revert "make dask-specific kwargs explicit in from_array"

This reverts commit 53d6094.

* show chunksize limit used in failing tests

* move lazy indexing adapter up out of chunkmanager code

* try upgrading minimum version of dask

* Revert "try upgrading minimum version of dask"

This reverts commit 796a577.

* un-vendor dask.array.core.normalize_chunks

* refactored to all passing ChunkManagerEntrypoint objects directly

* Remove redundant Nones from types

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* From future import annotations

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* From functools import annotations

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* From future import annotations

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* defined type for NormalizedChunks

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* standardized capitalization of ChunkManagerEntrypoint

* ensure ruff doesn't remove import

* ignore remaining typing errors stemming from unclear dask typing for chunks arguments

* rename store_kwargs->chunkmanager_store_kwargs

* missed return value

* array API fixes for astype

* Revert "array API fixes for astype"

This reverts commit 9cd9078.

* Apply suggestions from code review

* Update xarray/tests/test_parallelcompat.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* overridden -> subclassed

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* from_array_kwargs is optional

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* ensured all compute calls go through chunkmanager

* Raise if multiple chunkmanagers recognize array type

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* from_array_kwargs is optional

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* from_array_kwargs is optional

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* from_array_kwargs is optional

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* from_array_kwargs is optional

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* from_array_kwargs is optional

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* fixes for chunk methods

* correct readme to reflect fact we aren't vendoring dask in this PR any more

* update whatsnew

* more docstring corrections

* remove comment

* Raise NotImplementedErrors in all abstract methods

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* type hints for every arg in ChunkManagerEntryPOint methods

* more explicit typing + fixes for mypy errors revealed

* Keyword-only arguments in full_like etc.

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* None as default instead of {}

* fix bug apparently introduced by changing default type of drop_axis kwarg to map_blocks

* Removed hopefully-unnecessary mypy ignore

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* removed unnecessary mypy ignores

* change default value of drop_axis kwarg in map_blocks and catch when dask version < 2022.9.1

* fix checking of dask version in map_blocks

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
@TomNicholas TomNicholas mentioned this pull request Jun 28, 2023
4 tasks
@TomNicholas TomNicholas deleted the cubed_integration branch July 24, 2023 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design CI Continuous Integration tools dependencies Pull requests that update a dependency file enhancement io needs review run-benchmark Run the ASV benchmark workflow topic-arrays related to flexible array support topic-backends topic-CF conventions topic-dask topic-indexing topic-rolling topic-typing topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alternative parallel execution frameworks in xarray
9 participants