Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataArray.pad, Dataset.pad, Variable.pad #3596

Merged
merged 25 commits into from
Mar 19, 2020
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
7c230aa
add pad method to Variable and add corresponding test
mark-boer Nov 5, 2019
5980234
move pad_with_fill value to dask_array_compat.py and make it default …
mark-boer Nov 18, 2019
b6a979b
add pad method to dataarray
mark-boer Nov 20, 2019
80abc3a
add docstrings for variable.pad and dataarray.pad
mark-boer Nov 28, 2019
ed3d88e
add tests for DataArray.pad
mark-boer Dec 3, 2019
d4e484d
improve pad method signature and support dictionaries as pad_options …
mark-boer Dec 4, 2019
65d7495
fix linting errors and remove typo from tests
mark-boer Dec 8, 2019
0d7f1a7
implement suggested changes: pad_width => padwidths, use pytest.mark.…
mark-boer Dec 8, 2019
1ee2950
move pad method to dataset
mark-boer Dec 28, 2019
11023c3
add helper function to variable.pad and fix some mypy errors
mark-boer Dec 29, 2019
3aae4ba
add some more tests for DataArray.pad and add docstrings to all pad m…
mark-boer Dec 31, 2019
742487e
Merge branch 'master' into feature/dataarray_pad
mark-boer Dec 31, 2019
314f007
add workaround for dask.pad mode=mean that converts integers to float…
mark-boer Jan 1, 2020
7515478
disable linear_ramp test and add pad to whats-new.rst and api.rst
mark-boer Jan 25, 2020
ba3f0a4
Merge branch 'master' into feature/dataarray_pad
mark-boer Jan 25, 2020
855c39e
fix small mege issue in test_unit
mark-boer Jan 26, 2020
d507d1d
fix DataArray.pad and Dataset.pad docstrings
mark-boer Jan 26, 2020
64ac8a2
implement suggested changes from code review: add option of integer p…
mark-boer Feb 12, 2020
71e11bb
apply isort and and set linear_ramp to xfail
mark-boer Feb 12, 2020
7060b07
Minor fixes.
dcherian Mar 5, 2020
588ff03
Merge remote-tracking branch 'upstream/master' into feature/dataarray…
mark-boer Mar 8, 2020
3e6f792
fix merge issue and make some minor changes as suggested in the code …
mark-boer Mar 8, 2020
6958da9
fix test_unit.test_pad_constant_values
mark-boer Mar 8, 2020
af0a4a1
Keewis review comments
dcherian Mar 18, 2020
f781f72
Add experimental warning
dcherian Mar 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions doc/api-hidden.rst
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,6 @@
Variable.min
Variable.no_conflicts
Variable.notnull
Variable.pad_with_fill_value
Variable.prod
Variable.quantile
Variable.rank
Expand Down Expand Up @@ -453,7 +452,6 @@
IndexVariable.min
IndexVariable.no_conflicts
IndexVariable.notnull
IndexVariable.pad_with_fill_value
IndexVariable.prod
IndexVariable.quantile
IndexVariable.rank
Expand Down
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,7 @@ Reshaping and reorganizing
Dataset.to_stacked_array
Dataset.shift
Dataset.roll
Dataset.pad
Dataset.sortby
Dataset.broadcast_like

Expand Down Expand Up @@ -399,6 +400,7 @@ Reshaping and reorganizing
DataArray.to_unstacked_dataset
DataArray.shift
DataArray.roll
DataArray.pad
DataArray.sortby
DataArray.broadcast_like

Expand Down
2 changes: 2 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,8 @@ Breaking changes

New Features
~~~~~~~~~~~~
- Implement :py:meth:`DataArray.pad` and :py:meth:`Dataset.pad`. (:issue:`2605`, :pull:`3596`).
By `Mark Boer <https://github.com/mark-boer>`_.
- :py:meth:`DataArray.sel` and :py:meth:`Dataset.sel` now support :py:class:`pandas.CategoricalIndex`. (:issue:`3669`)
By `Keisuke Fujii <https://github.com/fujiisoup>`_.
- Support using an existing, opened h5netcdf ``File`` with
Expand Down
47 changes: 47 additions & 0 deletions xarray/core/dask_array_compat.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import warnings
from distutils.version import LooseVersion
from typing import Iterable

Expand Down Expand Up @@ -99,6 +100,52 @@ def meta_from_array(x, ndim=None, dtype=None):
return meta


def _validate_pad_output_shape(input_shape, pad_width, output_shape):
""" Validates the output shape of dask.array.pad, raising a RuntimeError if they do not match.
In the current versions of dask (2.2/2.4), dask.array.pad with mode='reflect' sometimes returns
an invalid shape.
"""
isint = lambda i: isinstance(i, int)

if isint(pad_width):
pass
elif len(pad_width) == 2 and all(map(isint, pad_width)):
pad_width = sum(pad_width)
elif (
len(pad_width) == len(input_shape)
and all(map(lambda x: len(x) == 2, pad_width))
and all((isint(i) for p in pad_width for i in p))
):
pad_width = np.sum(pad_width, axis=1)
else:
# unreachable: dask.array.pad should already have thrown an error
raise ValueError("Invalid value for `pad_width`")

if not np.array_equal(np.array(input_shape) + pad_width, output_shape):
raise RuntimeError(
"There seems to be something wrong with the shape of the output of dask.array.pad, "
"try upgrading Dask, use a different pad mode e.g. mode='constant' or first convert "
"your DataArray/Dataset to one backed by a numpy array by calling the `compute()` method."
dcherian marked this conversation as resolved.
Show resolved Hide resolved
"See: https://github.com/dask/dask/issues/5303"
)


def pad(array, pad_width, mode="constant", **kwargs):
padded = da.pad(array, pad_width, mode=mode, **kwargs)
# workaround for inconsistency between numpy and dask: https://github.com/dask/dask/issues/5303
if mode == "mean" and issubclass(array.dtype.type, np.integer):
warnings.warn(
'dask.array.pad(mode="mean") converts integers to floats. xarray converts '
"these floats back to integers to keep the interface consistent. There is a chance that "
"this introduces rounding errors. If you wish to keep the values as floats, first change "
"the dtype to a float before calling pad.",
UserWarning,
)
return da.round(padded).astype(array.dtype)
_validate_pad_output_shape(array.shape, pad_width, padded.shape)
return padded


if LooseVersion(dask_version) >= LooseVersion("2.8.1"):
median = da.median
else:
Expand Down
164 changes: 164 additions & 0 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -3239,6 +3239,170 @@ def map_blocks(

return map_blocks(func, self, args, kwargs)

def pad(
self,
pad_width: Mapping[Hashable, Union[int, Tuple[int, int]]] = None,
mode: str = "constant",
stat_length: Union[
int, Tuple[int, int], Mapping[Hashable, Tuple[int, int]]
] = None,
dcherian marked this conversation as resolved.
Show resolved Hide resolved
constant_values: Union[
int, Tuple[int, int], Mapping[Hashable, Tuple[int, int]]
] = None,
end_values: Union[
int, Tuple[int, int], Mapping[Hashable, Tuple[int, int]]
] = None,
reflect_type: str = None,
**pad_width_kwargs: Any,
) -> "DataArray":
"""Pad this array along one or more dimensions.

When using one of the modes ("edge", "reflect", "symmetric", "wrap"),
coordinates will be padded with the same mode, otherwise coordinates
are padded using the "constant" mode with fill_value dtypes.NA.

Parameters
----------
pad_width : Mapping with the form of {dim: (pad_before, pad_after)}
dcherian marked this conversation as resolved.
Show resolved Hide resolved
Number of values padded along each dimension.
{dim: pad} is a shortcut for pad_before = pad_after = pad
mode : str
One of the following string values (taken from numpy docs)

'constant' (default)
Pads with a constant value.
'edge'
Pads with the edge values of array.
'linear_ramp'
Pads with the linear ramp between end_value and the
array edge value.
'maximum'
Pads with the maximum value of all or part of the
vector along each axis.
'mean'
Pads with the mean value of all or part of the
vector along each axis.
'median'
Pads with the median value of all or part of the
vector along each axis.
'minimum'
Pads with the minimum value of all or part of the
vector along each axis.
'reflect'
Pads with the reflection of the vector mirrored on
the first and last values of the vector along each
axis.
'symmetric'
Pads with the reflection of the vector mirrored
along the edge of the array.
'wrap'
Pads with the wrap of the vector along the axis.
The first values are used to pad the end and the
end values are used to pad the beginning.
stat_length : int, tuple or mapping of the form {dim: tuple}
Used in 'maximum', 'mean', 'median', and 'minimum'. Number of
values at edge of each axis used to calculate the statistic value.
{dim_1: (before_1, after_1), ... dim_N: (before_N, after_N)} unique
statistic lengths along each dimension.
((before, after),) yields same before and after statistic lengths
for each dimension.
(stat_length,) or int is a shortcut for before = after = statistic
length for all axes.
Default is ``None``, to use the entire axis.
constant_values : scalar, tuple or mapping of the form {dim: tuple}
Used in 'constant'. The values to set the padded values for each
axis.
``{dim_1: (before_1, after_1), ... dim_N: (before_N, after_N)}`` unique
pad constants along each dimension.
``((before, after),)`` yields same before and after constants for each
dimension.
``(constant,)`` or ``constant`` is a shortcut for ``before = after = constant`` for
all dimensions.
Default is 0.
end_values : scalar, tuple or mapping of the form {dim: tuple}
Used in 'linear_ramp'. The values used for the ending value of the
linear_ramp and that will form the edge of the padded array.
``{dim_1: (before_1, after_1), ... dim_N: (before_N, after_N)}`` unique
end values along each dimension.
``((before, after),)`` yields same before and after end values for each
axis.
``(constant,)`` or ``constant`` is a shortcut for ``before = after = constant`` for
all axes.
Default is 0.
reflect_type : {'even', 'odd'}, optional
Used in 'reflect', and 'symmetric'. The 'even' style is the
default with an unaltered reflection around the edge value. For
the 'odd' style, the extended part of the array is created by
subtracting the reflected values from two times the edge value.
**pad_width_kwargs:
The keyword arguments form of ``pad_width``.
One of ``pad_width`` or ``pad_width_kwargs`` must be provided.

Returns
-------
padded : DataArray
DataArray with the padded coordinates and data.

See also
--------
DataArray.shift, DataArray.roll, DataArray.bfill, DataArray.ffill, numpy.pad, dask.array.pad

Notes
-----
By default when ``mode="constant"`` and ``constant_values=None``, integer types will be
promoted to ``float`` and padded with ``np.nan``. To avoid type promotion
specify ``constant_values=np.nan``

Examples
--------

>>> arr = xr.DataArray([5, 6, 7], coords=[("x", [0,1,2])])
>>> arr.pad(x=(1,2), constant_values=0)
<xarray.DataArray (x: 6)>
array([0, 5, 6, 7, 0, 0])
Coordinates:
* x (x) float64 nan 0.0 1.0 2.0 nan nan

>>> da = xr.DataArray([[0,1,2,3], [10,11,12,13]],
dims=["x", "y"],
coords={"x": [0,1], "y": [10, 20 ,30, 40], "z": ("x", [100, 200])}
)
>>> da.pad(x=1)
<xarray.DataArray (x: 4, y: 4)>
array([[nan, nan, nan, nan],
[ 0., 1., 2., 3.],
[10., 11., 12., 13.],
[nan, nan, nan, nan]])
Coordinates:
* x (x) float64 nan 0.0 1.0 nan
* y (y) int64 10 20 30 40
z (x) float64 nan 100.0 200.0 nan
>>> da.pad(x=1, constant_values=np.nan)
<xarray.DataArray (x: 4, y: 4)>
array([[-9223372036854775808, -9223372036854775808, -9223372036854775808,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I don't like it, but is is the way that most xarray functions handle this. Although I should add a check for xarray.dtypes.NA. We have discussed adding xarray.dtypes.NA as the default value for the constant_values keyword. But unfortunately this kind of interferes with the fact that you cannot specify constant_values when you set the mode to anything other than "constant". So if you have any ideas, I'm all ears.

>>> da = xr.DataArray(np.arange(9).reshape(3,3), dims=("x", "y"))
>>> da.shift(x=1, fill_value=np.nan)
array([[-9223372036854775808, -9223372036854775808, -9223372036854775808],
       [                   0,                    1,                    2],
       [                   3,                    4,                    5]])
Dimensions without coordinates: x, y

>>> da.rolling(x=3).construct("new_axis", stride=3, fill_value=np.nan)
<xarray.DataArray (x: 1, y: 3, new_axis: 3)>
array([[[-9223372036854775808, -9223372036854775808, 0],
        [-9223372036854775808, -9223372036854775808, 1],
        [-9223372036854775808, -9223372036854775808, 2]]])
Dimensions without coordinates: x, y, new_axis

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think it currently works well now, but we should probably have a different example: passing np.nan is a mistake, and instead users can rely on the default and it'll coerce to float with dtypes.NA implied.

Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That looks to me like a reasonable default.

-9223372036854775808],
[ 0, 1, 2,
3],
[ 10, 11, 12,
13],
[-9223372036854775808, -9223372036854775808, -9223372036854775808,
-9223372036854775808]])
Coordinates:
* x (x) float64 nan 0.0 1.0 nan
* y (y) int64 10 20 30 40
z (x) float64 nan 100.0 200.0 nan
"""
ds = self._to_temp_dataset().pad(
pad_width=pad_width,
mode=mode,
stat_length=stat_length,
constant_values=constant_values,
end_values=end_values,
reflect_type=reflect_type,
**pad_width_kwargs,
)
return self._from_temp_dataset(ds)

# this needs to be at the end, or mypy will confuse with `str`
# https://mypy.readthedocs.io/en/latest/common_issues.html#dealing-with-conflicting-names
str = property(StringAccessor)
Expand Down
Loading