Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are da.chunks and ds.chunks properties inconsistent? #5843

Closed
TomNicholas opened this issue Oct 7, 2021 · 6 comments · Fixed by #5900
Closed

Why are da.chunks and ds.chunks properties inconsistent? #5843

TomNicholas opened this issue Oct 7, 2021 · 6 comments · Fixed by #5900
Labels
API design design question topic-arrays related to flexible array support

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Oct 7, 2021

Basically the title, but what I'm referring to is this:

In [2]: da = xr.DataArray([[0, 1], [2, 3]], name='foo').chunk(1)

In [3]: ds = da.to_dataset()

In [4]: da.chunks
Out[4]: ((1, 1), (1, 1))

In [5]: ds.chunks
Out[5]: Frozen({'dim_0': (1, 1), 'dim_1': (1, 1)})

Why does DataArray.chunks return a tuple and Dataset.chunks return a frozen dictionary?

This seems a bit silly, for a few reasons:

  1. it means that some perfectly reasonable code might fail unnecessarily if passed a DataArray instead of a Dataset or vice versa, such as

    def is_core_dim_chunked(obj, core_dim):
        return len(obj.chunks[core_dim]) > 1

    which will work as intended for a dataset but raises a TypeError for a dataarray.

  2. it breaks the pattern we use for .sizes, where

    In [14]: da.sizes
    Out[14]: Frozen({'dim_0': 2, 'dim_1': 2})
    
    In [15]: ds.sizes
    Out[15]: Frozen({'dim_0': 2, 'dim_1': 2})
  3. if you want the chunks as a tuple they are always accessible via da.data.chunks, which is a more sensible place to look to find the chunks without dimension names.

  4. It's an undocumented difference, as the docstrings for ds.chunks and da.chunks both only say

    """Block dimensions for this dataset’s data or None if it’s not a dask array."""

    which doesn't tell me anything about the return type, or warn me that the return types are different.

    EDIT: In fact DataArray.chunk doesn't even appear to be listed on the API docs page at all.

In our codebase this difference is mostly washed out by us using ._to_temp_dataset() all the time, and also by the way that the .chunk() method accepts both the tuple and dict form, so both of these invariants hold (but in different ways):

ds == ds.chunk(ds.chunks)
da == da.chunk(da.chunks)

I'm not sure whether making this consistent is worth the effort of a significant breaking change though 😕

(Sort of related to #2103)

@TomNicholas
Copy link
Member Author

Variable.chunks also returns a tuple, which again I feel is weird given that variables have named dimensions.

There is another difference between ds.chunks and da.chunks - the former checks for inconsistent chunking between different variables when called (and will raise ValueError Object has inconsistent chunks along dimension {dim}. This can be fixed by calling unify_chunks()."). In contrast da.chunks doesn't check, and so it's possible to have a DataArray whose data variable is chunked inconsistently with its coordinate variables and not be warned about it.

@shoyer
Copy link
Member

shoyer commented Oct 7, 2021

The honest answer is that I didn't think too carefully about this when originally implementing Xarray's Dask wrapper back in 2015.

DataArray.chunks forwards to chunks on Dask arrays (a tuple), but that didn't make sense for Dataset.chunks due to the lack of a dimension ordering.

@TomNicholas
Copy link
Member Author

The honest answer is that I didn't think too carefully about this when originally implementing Xarray's Dask wrapper back in 2015.

I guessed that might be the case!

I'm not sure whether making this consistent is worth the effort of a significant breaking change though

Still leaves this question though ^ . I made a draft PR in #5846.

@dcherian
Copy link
Contributor

For DataArrays there is an underlying chunks property so it makes sense to forward it (like shape and dtype). Though perhaps we should only forward those properties that are common to all duck arrays.

It seems better to introduce a new property on both DataArrays and Datasets that always returns a dict (Like sizes vs shape). I came up with two names but don't like either of them: chunksizes seems too similar to chunks; dims_chunks doesn't really seem great either.

There is a similar problem for dtype as @crusaderky points out here

@TomNicholas
Copy link
Member Author

It seems better to introduce a new property on both DataArrays and Datasets that always returns a dict

That's a good suggestion - then we can have backwards compatibility whilst also allowing intuitive code that treats dataarrays and datasets similarly, e.g:

def is_core_dim_chunked(obj, core_dim):
    return len(obj.chunksizes[core_dim]) > 1

chunksizes seems too similar to chunks

I think chunksizes is quite good: it is in keeping with sizes, and auto-complete would also show both chunks and chunksizes when a user types .ch[tab] which I think is helpful.

@dcherian dcherian added the topic-arrays related to flexible array support label Oct 13, 2021
@max-sixty
Copy link
Collaborator

Agree! Now we just need to decide between chunksizes and chunk_sizes...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design design question topic-arrays related to flexible array support
Projects
None yet
4 participants