Why are `da.chunks` and `ds.chunks` properties inconsistent? #5843

TomNicholas · 2021-10-07T17:21:01Z

Basically the title, but what I'm referring to is this:

In [2]: da = xr.DataArray([[0, 1], [2, 3]], name='foo').chunk(1)

In [3]: ds = da.to_dataset()

In [4]: da.chunks
Out[4]: ((1, 1), (1, 1))

In [5]: ds.chunks
Out[5]: Frozen({'dim_0': (1, 1), 'dim_1': (1, 1)})

Why does DataArray.chunks return a tuple and Dataset.chunks return a frozen dictionary?

This seems a bit silly, for a few reasons:

it means that some perfectly reasonable code might fail unnecessarily if passed a DataArray instead of a Dataset or vice versa, such as
```
def is_core_dim_chunked(obj, core_dim):
    return len(obj.chunks[core_dim]) > 1
```
which will work as intended for a dataset but raises a TypeError for a dataarray.

it breaks the pattern we use for .sizes, where

In [14]: da.sizes
Out[14]: Frozen({'dim_0': 2, 'dim_1': 2})

In [15]: ds.sizes
Out[15]: Frozen({'dim_0': 2, 'dim_1': 2})

if you want the chunks as a tuple they are always accessible via da.data.chunks, which is a more sensible place to look to find the chunks without dimension names.
It's an undocumented difference, as the docstrings for ds.chunks and da.chunks both only say

"""Block dimensions for this dataset’s data or None if it’s not a dask array."""

which doesn't tell me anything about the return type, or warn me that the return types are different.

EDIT: In fact DataArray.chunk doesn't even appear to be listed on the API docs page at all.

In our codebase this difference is mostly washed out by us using ._to_temp_dataset() all the time, and also by the way that the .chunk() method accepts both the tuple and dict form, so both of these invariants hold (but in different ways):

ds == ds.chunk(ds.chunks)
da == da.chunk(da.chunks)

I'm not sure whether making this consistent is worth the effort of a significant breaking change though 😕

(Sort of related to #2103)

The text was updated successfully, but these errors were encountered:

TomNicholas · 2021-10-07T22:00:55Z

Variable.chunks also returns a tuple, which again I feel is weird given that variables have named dimensions.

There is another difference between ds.chunks and da.chunks - the former checks for inconsistent chunking between different variables when called (and will raise ValueError Object has inconsistent chunks along dimension {dim}. This can be fixed by calling unify_chunks()."). In contrast da.chunks doesn't check, and so it's possible to have a DataArray whose data variable is chunked inconsistently with its coordinate variables and not be warned about it.

shoyer · 2021-10-07T23:56:54Z

The honest answer is that I didn't think too carefully about this when originally implementing Xarray's Dask wrapper back in 2015.

DataArray.chunks forwards to chunks on Dask arrays (a tuple), but that didn't make sense for Dataset.chunks due to the lack of a dimension ordering.

TomNicholas · 2021-10-08T00:03:36Z

The honest answer is that I didn't think too carefully about this when originally implementing Xarray's Dask wrapper back in 2015.

I guessed that might be the case!

I'm not sure whether making this consistent is worth the effort of a significant breaking change though

Still leaves this question though ^ . I made a draft PR in #5846.

dcherian · 2021-10-11T06:22:24Z

For DataArrays there is an underlying chunks property so it makes sense to forward it (like shape and dtype). Though perhaps we should only forward those properties that are common to all duck arrays.

It seems better to introduce a new property on both DataArrays and Datasets that always returns a dict (Like sizes vs shape). I came up with two names but don't like either of them: chunksizes seems too similar to chunks; dims_chunks doesn't really seem great either.

There is a similar problem for dtype as @crusaderky points out here

TomNicholas · 2021-10-13T16:09:15Z

It seems better to introduce a new property on both DataArrays and Datasets that always returns a dict

That's a good suggestion - then we can have backwards compatibility whilst also allowing intuitive code that treats dataarrays and datasets similarly, e.g:

def is_core_dim_chunked(obj, core_dim):
    return len(obj.chunksizes[core_dim]) > 1

chunksizes seems too similar to chunks

I think chunksizes is quite good: it is in keeping with sizes, and auto-complete would also show both chunks and chunksizes when a user types .ch[tab] which I think is helpful.

max-sixty · 2021-10-23T21:23:52Z

Agree! Now we just need to decide between chunksizes and chunk_sizes...

TomNicholas added API design design question labels Oct 7, 2021

TomNicholas mentioned this issue Oct 8, 2021

Change return type of DataArray.chunks and Dataset.chunks to a dict #5846

Closed

5 tasks

dcherian added the topic-arrays related to flexible array support label Oct 13, 2021

TomNicholas mentioned this issue Oct 26, 2021

Add .chunksizes property #5900

Merged

5 tasks

TomNicholas closed this as completed in #5900 Oct 29, 2021

TomNicholas mentioned this issue Dec 8, 2021

map_blocks not converting dataarrays correctly #6052

Closed

Illviljan mentioned this issue Dec 11, 2021

DOC: Add "auto" to dataarray chunk method #6068

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are `da.chunks` and `ds.chunks` properties inconsistent? #5843

Why are `da.chunks` and `ds.chunks` properties inconsistent? #5843

TomNicholas commented Oct 7, 2021 •

edited

Loading

TomNicholas commented Oct 7, 2021

shoyer commented Oct 7, 2021

TomNicholas commented Oct 8, 2021

dcherian commented Oct 11, 2021

TomNicholas commented Oct 13, 2021

max-sixty commented Oct 23, 2021

Why are da.chunks and ds.chunks properties inconsistent? #5843

Why are da.chunks and ds.chunks properties inconsistent? #5843

Comments

TomNicholas commented Oct 7, 2021 • edited Loading

TomNicholas commented Oct 7, 2021

shoyer commented Oct 7, 2021

TomNicholas commented Oct 8, 2021

dcherian commented Oct 11, 2021

TomNicholas commented Oct 13, 2021

max-sixty commented Oct 23, 2021

Why are `da.chunks` and `ds.chunks` properties inconsistent? #5843

Why are `da.chunks` and `ds.chunks` properties inconsistent? #5843

TomNicholas commented Oct 7, 2021 •

edited

Loading