Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply to dataset #4863

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
5655571
add a apply_to_dataset method
keewis Jan 30, 2021
90f8d55
write a test for apply_to_dataset on a DataArray
keewis Jan 30, 2021
fd2c897
also add a test for dataset
keewis Jan 30, 2021
857c783
convert apply_to_dataset to a top-level function
keewis Feb 5, 2021
57e94b6
update whats-new.rst
keewis Feb 5, 2021
cdb0f3d
add the new function to api.rst [skip-ci]
keewis Feb 5, 2021
1d81a49
rephrase the note [skip-ci]
keewis Feb 5, 2021
88fe863
add a see also section [skip-ci]
keewis Feb 5, 2021
0daf42d
add examples [skip-ci]
keewis Feb 5, 2021
ef3f791
Merge branch 'master' into apply-to-dataset
keewis Feb 7, 2021
638d61c
Merge branch 'master' into apply-to-dataset
keewis Feb 11, 2021
559d8ef
rename to call_on_dataset
keewis Mar 15, 2021
0c424bf
preserve the name as much as possible
keewis Mar 15, 2021
8db9e7e
update api.rst
keewis Mar 15, 2021
c902dfe
Merge branch 'master' into apply-to-dataset
keewis Mar 15, 2021
43bf70d
update whats-new.rst
keewis Mar 15, 2021
31645e5
remove the notes
keewis Mar 15, 2021
293d9c1
remove the no-op
keewis Mar 15, 2021
d0de1ca
don't rename to None
keewis Mar 15, 2021
a822232
rename to "<this-array>"
keewis Mar 15, 2021
d278919
rewrite [skip-ci]
keewis Mar 15, 2021
0669da9
Merge branch 'master' into apply-to-dataset
keewis Mar 15, 2021
97d4338
rename back to None
keewis Mar 15, 2021
48109db
Merge branch 'master' into apply-to-dataset
keewis Mar 28, 2021
b15d45e
Merge branch 'master' into apply-to-dataset
keewis Apr 5, 2021
371f509
introduce a mandatory name parameter to use as a name for the data va…
keewis May 10, 2021
8f37872
Merge branch 'master' into apply-to-dataset
keewis May 10, 2021
c9459f7
move to the new section in whats-new.rst
keewis May 10, 2021
021ad36
fix the tests
keewis May 11, 2021
dcb747b
Merge branch 'master' into apply-to-dataset
keewis May 31, 2021
7081e15
update the input and expected values
keewis May 31, 2021
52a39f3
add the missing name for the dataset call
keewis May 31, 2021
fcfaaa5
use DataArray.to_dataset instead
keewis May 31, 2021
f2d2880
only convert if the result is a Dataset
keewis May 31, 2021
b59dd1e
Merge branch 'master' into apply-to-dataset
dcherian Jun 21, 2021
12400cb
Merge branch 'main' into apply-to-dataset
keewis Jul 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Top-level functions
apply_ufunc
align
broadcast
call_on_dataset
concat
merge
combine_by_coords
Expand Down
3 changes: 3 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,9 @@ New Features
:py:class:`~core.groupby.DataArrayGroupBy`, inspired by pandas'
:py:meth:`~pandas.core.groupby.GroupBy.get_group`.
By `Deepak Cherian <https://github.com/dcherian>`_.
- Add :py:func:`call_on_dataset` as a way to apply functions expecting
:py:class:`Dataset` objects to :py:class:`DataArray` objects (:issue:`4837`, :pull:`4863`).
By `Justus Magin <https://github.com/keewis>`_.
- Add a ``combine_attrs`` parameter to :py:func:`open_mfdataset` (:pull:`4971`).
By `Justus Magin <https://github.com/keewis>`_.
- Disable the `cfgrib` backend if the `eccodes` library is not installed (:pull:`5083`). By `Baudouin Raoult <https://github.com/b8raoult>`_.
Expand Down
11 changes: 10 additions & 1 deletion xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,15 @@
from .core.alignment import align, broadcast
from .core.combine import combine_by_coords, combine_nested
from .core.common import ALL_DIMS, full_like, ones_like, zeros_like
from .core.computation import apply_ufunc, corr, cov, dot, polyval, where
from .core.computation import (
apply_ufunc,
call_on_dataset,
corr,
cov,
dot,
polyval,
where,
)
from .core.concat import concat
from .core.dataarray import DataArray
from .core.dataset import Dataset
Expand Down Expand Up @@ -46,6 +54,7 @@
# Top-level functions
"align",
"apply_ufunc",
"call_on_dataset",
"as_variable",
"broadcast",
"cftime_range",
Expand Down
93 changes: 93 additions & 0 deletions xarray/core/computation.py
Original file line number Diff line number Diff line change
Expand Up @@ -1151,6 +1151,99 @@ def earth_mover_distance(first_samples,
return apply_array_ufunc(func, *args, dask=dask)


def call_on_dataset(func, obj, *args, **kwargs):
"""apply a function expecting a Dataset to a xarray object

Parameters
----------
func : callable
A function expecting a Dataset as its first parameter.
obj : DataArray or Dataset
The dataset to apply ``func`` to. If a ``DataArray``, convert it to a single
variable ``Dataset`` first.
*args, **kwargs
Additional arguments to ``func``

Returns
-------
DataArray or Dataset
The result of ``func(obj, *args, **kwargs)`` with the same type as ``obj``.

Notes
-----
DataArray objects without a name (or named ``None``) will be renamed to
``"<this-array>"`` before being passed to ``func``. The empty name will be restored
for the result of the call.

See Also
--------
Dataset.map
Dataset.pipe
DataArray.pipe

Examples
--------
>>> def f(ds):
... return xr.Dataset(
... {
... name: var * var.attrs.get("scale", 1)
... for name, var in ds.data_vars.items()
... },
... coords=ds.coords,
... attrs=ds.attrs,
... )
...
>>> ds = xr.Dataset(
... {"a": ("x", [3, 4], {"scale": 0.5}), "b": ("x", [-1, 1], {"scale": 1.5})},
... coords={"x": [0, 1]},
... attrs={"attr": "value"},
... )
>>> ds
<xarray.Dataset>
Dimensions: (x: 2)
Coordinates:
* x (x) int64 0 1
Data variables:
a (x) int64 3 4
b (x) int64 -1 1
Attributes:
attr: value
>>> xr.call_on_dataset(f, ds)
<xarray.Dataset>
Dimensions: (x: 2)
Coordinates:
* x (x) int64 0 1
Data variables:
a (x) float64 1.5 2.0
b (x) float64 -1.5 1.5
Attributes:
attr: value
>>> xr.call_on_dataset(f, ds.a)
<xarray.DataArray 'a' (x: 2)>
array([1.5, 2. ])
Coordinates:
* x (x) int64 0 1
"""
from .dataarray import _THIS_ARRAY, DataArray
from .parallel import dataarray_to_dataset, dataset_to_dataarray

if isinstance(obj, DataArray):
ds = dataarray_to_dataset(obj)
if obj.name is None:
ds = ds.rename({_THIS_ARRAY: "<this-array>"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option would be to use backends.core.api.DATAARRAY_VARIABLE which is used when writing a DataArray to netcdf (I think). I don't feel strongly about this.

Copy link
Collaborator Author

@keewis keewis Apr 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I don't know which name would be better, either. THIS_ARRAY is not part of the public API so we can't use it, and None obviously doesn't work, either. Using a string seems like a good choice but the exact value will almost always be arbitrary. The advantage of "<this-array>" is that it is the string representation of THIS_ARRAY, but that's the only reason I chose that. DATAARRAY_VARIABLE or DATAARRAY_NAME have the value f"__xarray_dataarray_{type}__", but neither of them are actually part of the public API (I think?), which means they have the same issue as THIS_ARRAY (not sure if that's actually a problem, though: the simply reference a string).

Copy link
Collaborator

@max-sixty max-sixty Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought @keewis 's idea of self was good from #5493 (comment), to the extent that could apply here

Edit: but then @dcherian pointed out this will fail if there's a dim called self!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this in xarray-contrib/pint-xarray#110, and for that at least it's actually an advantage that I have to pass the name. Not sure if that's the same for every other use case though (but defining the name explicitly is not much overhead so it should be fine)

else:
ds = obj

result = func(ds, *args, **kwargs)

if isinstance(obj, DataArray):
if obj.name is None:
result = result.rename({"<this-array>": None})
result = dataset_to_dataarray(result)

return result


def cov(da_a, da_b, dim=None, ddof=1):
"""
Compute covariance between two DataArray objects along a shared dimension.
Expand Down
44 changes: 44 additions & 0 deletions xarray/tests/test_computation.py
Original file line number Diff line number Diff line change
Expand Up @@ -468,6 +468,50 @@ def test_apply_groupby_add():
add(data_array.groupby("y"), data_array.groupby("x"))


@pytest.mark.parametrize(
["obj", "expected"],
(
pytest.param(
xr.DataArray(
[0, 1],
coords={
"x": ("x", [-1, 1], {"a": 1, "b": 2}),
"u": ("x", [2, 3], {"c": 3}),
},
dims="x",
attrs={"d": 4, "e": 5},
),
xr.DataArray([0, 1], coords={"x": [-1, 1], "u": ("x", [2, 3])}, dims="x"),
id="DataArray",
),
pytest.param(
xr.Dataset(
{"a": ("x", [1, 2], {"a": 1, "b": 2}), "b": ("x", [0, 1], {"c": 3})},
coords={
"x": ("x", [-1, 1], {"d": 4, "e": 5}),
"u": ("x", [2, 3], {"f": 6}),
},
),
xr.Dataset(
{"a": ("x", [1, 2]), "b": ("x", [0, 1])},
coords={"x": [-1, 1], "u": ("x", [2, 3])},
),
id="Dataset",
),
),
)
def test_call_on_dataset(obj, expected):
def clear_all_attrs(ds):
new_ds = ds.copy()
for var in new_ds.variables.values():
var.attrs.clear()
new_ds.attrs.clear()
return new_ds

actual = xr.call_on_dataset(clear_all_attrs, obj)
assert_identical(actual, expected)


def test_unified_dim_sizes():
assert unified_dim_sizes([xr.Variable((), 0)]) == {}
assert unified_dim_sizes([xr.Variable("x", [1]), xr.Variable("x", [1])]) == {"x": 1}
Expand Down