Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for __array_function__ implementers (sparse arrays) [WIP] #3117

Merged
merged 31 commits into from
Aug 5, 2019

Conversation

nvictus
Copy link
Contributor

@nvictus nvictus commented Jul 13, 2019

Doing a SciPy sprint. Working towards #1375

Together with pydata/sparse#261, it seems to work.

  • Closes #xxxx
  • Tests added
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

@pep8speaks
Copy link

pep8speaks commented Jul 13, 2019

Hello @nvictus! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-05 15:49:19 UTC

@codecov
Copy link

codecov bot commented Jul 13, 2019

Codecov Report

Merging #3117 into master will increase coverage by 0.03%.
The diff coverage is 91.89%.

@@            Coverage Diff             @@
##           master    #3117      +/-   ##
==========================================
+ Coverage    95.7%   95.73%   +0.03%     
==========================================
  Files          63       63              
  Lines       12861    12870       +9     
==========================================
+ Hits        12308    12321      +13     
+ Misses        553      549       -4

@andersy005
Copy link
Member

andersy005 commented Jul 14, 2019

@nvictus, thank you for your work!

I just tried this on CuPy arrays, and it seems to be working during array creation:

In [1]: import cupy as cp                                                           

In [2]: import xarray as xr                                                         

In [3]: x = cp.arange(6).reshape(2, 3).astype('f')                                  

In [4]: y = cp.ones((2, 3), dtype='int')                           

In [5]: x                                                          
Out[5]: 
array([[0., 1., 2.],
       [3., 4., 5.]], dtype=float32)

In [6]: y                                                          
Out[6]: 
array([[1, 1, 1],
       [1, 1, 1]])

In [7]: y.device                                                   
Out[7]: <CUDA Device 0>

In [8]: x.device                                                   
Out[8]: <CUDA Device 0>

In [9]: ds = xr.Dataset()                                          

In [10]: ds['x'] = xr.DataArray(x, dims=['lat', 'lon'])            

In [11]: ds['y'] = xr.DataArray(y, dims=['lat', 'lon'])            

In [12]: ds                                                        
Out[12]: 
<xarray.Dataset>
Dimensions:  (lat: 2, lon: 3)
Dimensions without coordinates: lat, lon
Data variables:
    x        (lat, lon) float32 ...
    y        (lat, lon) int64 ...

In [13]: ds.x.data.device                                          
Out[13]: <CUDA Device 0>

In [14]: ds.y.data.device                                          
Out[14]: <CUDA Device 0>

Even though it failed when I tried applying an operation on the dataset, this is still awesome! I am very excited and looking forward to seeing this feature in xarray:

In [15]: m = ds.mean(dim='lat')                                    
-------------------------------------------------------------------
TypeError                         Traceback (most recent call last)
<ipython-input-15-8e4d5e7d5ee3> in <module>
----> 1 m = ds.mean(dim='lat')

/glade/work/abanihi/devel/pangeo/xarray/xarray/core/common.py in wrapped_func(self, dim, skipna, **kwargs)
     65                 return self.reduce(func, dim, skipna=skipna,
     66                                    numeric_only=numeric_only, allow_lazy=True,
---> 67                                    **kwargs)
     68         else:
     69             def wrapped_func(self, dim=None, **kwargs):  # type: ignore

/glade/work/abanihi/devel/pangeo/xarray/xarray/core/dataset.py in reduce(self, func, dim, keep_attrs, keepdims, numeric_only, allow_lazy, **kwargs)
   3532                                                  keepdims=keepdims,
   3533                                                  allow_lazy=allow_lazy,
-> 3534                                                  **kwargs)
   3535 
   3536         coord_names = set(k for k in self.coords if k in variables)

/glade/work/abanihi/devel/pangeo/xarray/xarray/core/variable.py in reduce(self, func, dim, axis, keep_attrs, keepdims, allow_lazy, **kwargs)
   1392         input_data = self.data if allow_lazy else self.values
   1393         if axis is not None:
-> 1394             data = func(input_data, axis=axis, **kwargs)
   1395         else:
   1396             data = func(input_data, **kwargs)

/glade/work/abanihi/devel/pangeo/xarray/xarray/core/duck_array_ops.py in mean(array, axis, skipna, **kwargs)
    370         return _to_pytimedelta(mean_timedeltas, unit='us') + offset
    371     else:
--> 372         return _mean(array, axis=axis, skipna=skipna, **kwargs)
    373 
    374 

/glade/work/abanihi/devel/pangeo/xarray/xarray/core/duck_array_ops.py in f(values, axis, skipna, **kwargs)
    257 
    258         try:
--> 259             return func(values, axis=axis, **kwargs)
    260         except AttributeError:
    261             if isinstance(values, dask_array_type):

/glade/work/abanihi/devel/pangeo/xarray/xarray/core/nanops.py in nanmean(a, axis, dtype, out)
    158         return dask_array.nanmean(a, axis=axis, dtype=dtype)
    159 
--> 160     return np.nanmean(a, axis=axis, dtype=dtype)
    161 
    162 

/glade/work/abanihi/softwares/miniconda3/envs/xarray-tests/lib/python3.7/site-packages/numpy/core/overrides.py in public_api(*args, **kwargs)
    163             relevant_args = dispatcher(*args, **kwargs)
    164             return implement_array_function(
--> 165                 implementation, public_api, relevant_args, args, kwargs)
    166 
    167         if module is not None:

TypeError: no implementation found for 'numpy.nanmean' on types that implement __array_function__: [<class 'cupy.core.core.ndarray'>]

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really excited about this!

xarray/core/variable.py Outdated Show resolved Hide resolved
xarray/core/variable.py Outdated Show resolved Hide resolved
@shoyer
Copy link
Member

shoyer commented Jul 14, 2019

Even though it failed when I tried applying an operation on the dataset, this is still awesome!

Yes, it really is!

For this specific failure, we should think about adding an option for the default skipna value, or maybe making the semantics depend on the array type.

If someone is using xarray to wrap a computation oriented library like CuPy, they probably almost always want to set skipna=False (along with join='exact'). I don't think I've seen any deep library that has bothered to implement nanmean.

@mrocklin mrocklin mentioned this pull request Jul 14, 2019
xarray/tests/test_nep18.py Outdated Show resolved Hide resolved
xarray/tests/test_nep18.py Outdated Show resolved Hide resolved

sparse = pytest.importorskip('sparse')


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xarray does automatic alignment, so it would also be really good to add a test case with multiple duck arrays with different coordinate labels, to verify that alignment works. xarray/tests/test_dask.py also has a bunch of good examples of things to test.

xarray/tests/test_nep18.py Outdated Show resolved Hide resolved
xarray/core/duck_array_ops.py Outdated Show resolved Hide resolved
This version should be much more compatible out of the box with duck typing.
@nvictus
Copy link
Contributor Author

nvictus commented Jul 17, 2019

After writing more tests, turns out sparse-backed xarrays were generating strange coordinates. On closer inspection, this is because sparse.COO objects have their own coords attribute which stores the indices of its nonzeros.

With a serendipitous shape and density of a sparse array, there were the right number of coords such that they got converted into xarray coords:

>>> S = sparse.random((100,100)) 
>>> S.coords                                                                                                                                                                                                          
array([[ 0,  0,  3,  3,  4,  5,  6,  7,  7,  7,  9, 10, 11, 14, 14, 16,
        17, 18, 19, 19, 21, 21, 21, 21, 22, 23, 23, 24, 24, 25, 26, 28,
        29, 31, 35, 35, 36, 36, 38, 39, 41, 42, 42, 43, 44, 46, 46, 47,
        48, 48, 49, 49, 49, 49, 50, 52, 53, 54, 55, 56, 57, 57, 58, 60,
        60, 61, 61, 62, 64, 64, 68, 70, 72, 73, 76, 77, 78, 79, 79, 80,
        81, 83, 84, 85, 88, 89, 90, 90, 90, 92, 92, 93, 93, 94, 94, 96,
        97, 98, 98, 99],
       [14, 28, 58, 67, 66, 37, 50,  9, 66, 67,  6, 22, 44, 51, 64, 26,
        53, 91, 45, 81, 13, 25, 42, 86, 47, 22, 67, 77, 81, 22, 96, 18,
        23, 96, 34, 57,  6, 96, 91, 86, 75, 86, 88, 91, 30,  6, 65, 47,
        61, 74, 14, 73, 82, 83, 32, 42, 53, 68, 21, 38, 48, 50, 87,  8,
        89, 22, 57, 60, 92, 94, 19, 79, 38, 53, 32, 95, 69, 22, 46, 17,
        17, 86, 36,  7, 71, 35,  9, 58, 79, 22, 68, 10, 47, 48, 54, 72,
        24, 47, 63, 86]])
>>> xr.DataArray(S)
<xarray.DataArray (dim_0: 100, dim_1: 100)>
<COO: shape=(100, 100), dtype=float64, nnz=100, fill_value=0.0>
Coordinates:
  * dim_0    (dim_0) int64 0 0 3 3 4 5 6 7 7 7 ... 92 93 93 94 94 96 97 98 98 99
  * dim_1    (dim_1) int64 14 28 58 67 66 37 50 9 66 ... 47 48 54 72 24 47 63 86

A simple fix is to special-case SparseArrays in the DataArray constructor and not extract the sparse coords attribute in that case, but that would require importing sparse.

Would it make sense to just assume that all non-DataArray NEP-18 compliant arrays do not contain an xarray-compliant coords attribute?

@nvictus
Copy link
Contributor Author

nvictus commented Jul 17, 2019

Hmm, looks like the DataWithCoords base type might come in handy here.

assert isinstance(func(A).data, sparse.SparseArray)


class TestSparseVariable:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An aside (don't let this slow you down): these are tests that we could run across any eligible array, including the current numpy & dask arrays

You could pull tests from the current dask tests if that would be a shortcut to building up this test suite from scratch

We could do that by inheriting this class, or having var as a fixture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @max-sixty. I did copy them over from the dask tests. I'm still figuring out which ones are transferable to this use case since I'm not a regular xarray user myself (yet) :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! LMK if I can be helpful at all. Excited about the PR!

@shoyer
Copy link
Member

shoyer commented Jul 17, 2019

Would it make sense to just assume that all non-DataArray NEP-18 compliant arrays do not contain an xarray-compliant coords attribute?

Yes, let's switch:

coords = getattr(data, 'coords', None) 

to

if isinstance(data, DataArray):
    coords = data.coords

@nvictus
Copy link
Contributor Author

nvictus commented Aug 1, 2019

Thanks for bumping this @mrocklin! I've put in some extra work on my free time, which hasn't been pushed yet. I'll try to write up a summary of my findings today. Briefly though, it seems like the two limiting factors for NEP18 duck array support are:

  1. Operations which ultimately coerce duck arrays to ndarrays (e.g. via np.asarray). Many, but maybe not all of these operations can fixed to dispatch to the duck array's implementation. But that leads to:

  2. Operations not supported by the duck type. This happens in a few cases with pydata/sparse, and would have to be solved upstream, unless it's a special case where it might be okay to coerce. e.g. what happens with binary operations that mix array types?

I think NEP18-backed xarray structures can be supported in principle, but it won't prevent some operations from simply failing in some contexts. So maybe xarray will need to define a minimum required implementation subset of the array API for duck arrays.

@shoyer
Copy link
Member

shoyer commented Aug 1, 2019

2. Operations not supported by the duck type. This happens in a few cases with pydata/sparse, and would have to be solved upstream, unless it's a special case where it might be okay to coerce. e.g. what happens with binary operations that mix array types?

This is totally fine for now, as long as there are clear errors when attempting to do an unsupported operation. We can write unit tests with expected failures, which should provide a clear roadmap for things to fix upstream in sparse.

We could attempt to define a minimum required implementation, but in practice I suspect this will be hard to nail down definitively. The ultimate determinant of what works will be xarray's implementation.

@nvictus
Copy link
Contributor Author

nvictus commented Aug 5, 2019

So, tests are passing now and I've documented the expected failures on sparse arrays. :)

As mentioned before, most fall into the categories of (1) implicit coercion to dense and (2) missing operations on sparse arrays.

Turns out a lot of the implicit coercion is due to the use of routines from bottleneck. A few other cases of using np.asarray remain. Sometimes this is through the use of the .values attribute. At the moment, the behavior of Variables and DataArrays is such that .data provides the duck array and .values coerces to numpy, following the original behavior for dask arrays -- which made me realize, we never asked if this behavior is desired in general?

I also modified Variable.load() to be a no-op for duck arrays (unless it is a dask array), so now compute() works. To make equality comparisons work between dense and sparse, I modified as_like_arrays() to coerce all arguments to sparse if at least one is sparse. As a side-effect, a.identical(b) can return True if one is dense and the other sparse. Again, here we're sort of mirroring the existing behavior when the duck array is a dask array, but this may seem questionable to extend to other duck arrays.

If bottleneck is turned off, most remaining issues are due to missing implementations in sparse. Some of these should be easy to add: PR pydata/sparse/pull/261 already exposes the result_type function.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! I think we can merge this in after syncing again with master.

I probably wouldn't call this ready for widespread use until we get a few more features working (e.g., concat), but this is a great start, and it will be easier for others to contribute once this is merged.

xarray/tests/test_duck_array_ops.py Outdated Show resolved Hide resolved
(do('astype', int), True),
(do('broadcast_equals', make_xrarray({'x': 10, 'y': 5})), False),
(do('clip', min=0, max=1), True),
# (do('close'), False),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the commented out cases don't work yet? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, forgot to finish those! :)
I've added tests for all except those I couldn't figure out how to use like swap_dims, sel_points and isel_points.

xarray/tests/test_sparse.py Outdated Show resolved Hide resolved
@shoyer
Copy link
Member

shoyer commented Aug 5, 2019

At the moment, the behavior of Variables and DataArrays is such that .data provides the duck array and .values coerces to numpy, following the original behavior for dask arrays -- which made me realize, we never asked if this behavior is desired in general?

I think the right behavior is probably for .values to be implemented by calling np.asarray() on .data. That means it should raise on sparse arrays.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

def test_binary_op(self):
assert_sparse_eq((2 * self.var).data, 2 * self.data)
assert_sparse_eq((self.var + self.var).data, self.data + self.data)
# assert_eq((self.var[0] + self.var).data, self.data[0] + self.data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

@shoyer
Copy link
Member

shoyer commented Aug 5, 2019

@nvictus we are good to go ahead and merge, and do follow-ups in other PRs?

@nvictus
Copy link
Contributor Author

nvictus commented Aug 5, 2019

Sounds good!

@shoyer shoyer merged commit bbd25c6 into pydata:master Aug 5, 2019
@dcherian
Copy link
Contributor

dcherian commented Aug 5, 2019

Thanks a lot, @nvictus

@mrocklin
Copy link
Contributor

mrocklin commented Aug 5, 2019

Woot! Thanks @nvictus !

dcherian added a commit to yohai/xarray that referenced this pull request Aug 7, 2019
* master:
  pyupgrade one-off run (pydata#3190)
  mfdataset, concat now support the 'join' kwarg. (pydata#3102)
  reduce the size of example dataset in dask docs (pydata#3187)
  add climpred to related-projects (pydata#3188)
  bump rasterio to 1.0.24 in doc building environment (pydata#3186)
  More annotations (pydata#3177)
  Support for __array_function__ implementers (sparse arrays) [WIP] (pydata#3117)
  Internal clean-up of isnull() to avoid relying on pandas (pydata#3132)
  Call darray.compute() in plot() (pydata#3183)
  BUG: fix + test open_mfdataset fails on variable attributes with list… (pydata#3181)
@@ -21,6 +21,7 @@ dependencies:
- pip
- scipy
- seaborn
- sparse
- toolz
Copy link
Contributor

@crusaderky crusaderky Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you exclude py36 and Windows?

@max-sixty
Copy link
Collaborator

Thanks @nvictus !

dcherian added a commit to dcherian/xarray that referenced this pull request Aug 8, 2019
* master:
  pyupgrade one-off run (pydata#3190)
  mfdataset, concat now support the 'join' kwarg. (pydata#3102)
  reduce the size of example dataset in dask docs (pydata#3187)
  add climpred to related-projects (pydata#3188)
  bump rasterio to 1.0.24 in doc building environment (pydata#3186)
  More annotations (pydata#3177)
  Support for __array_function__ implementers (sparse arrays) [WIP] (pydata#3117)
  Internal clean-up of isnull() to avoid relying on pandas (pydata#3132)
  Call darray.compute() in plot() (pydata#3183)
  BUG: fix + test open_mfdataset fails on variable attributes with list… (pydata#3181)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants