Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weighted quantile #6059

Merged
merged 52 commits into from
Mar 27, 2022
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
dcd1b24
Add weighted quantile
cjauvin Dec 10, 2021
875f766
Add weighted quantile to documentation
cjauvin Dec 10, 2021
217d2f0
Apply suggestions from code review
cjauvin Dec 10, 2021
278bed9
Apply suggestions from code review
cjauvin Dec 10, 2021
80ae229
Improve _weighted_quantile_type7_1d ufunc with suggestions
cjauvin Dec 11, 2021
77bb84e
Merge remote-tracking branch 'origin/main' into weighted-quantile
cjauvin Dec 11, 2021
58af567
Expand scope of invalid q value test
cjauvin Dec 11, 2021
15e3834
Fix weighted quantile with zero weights
cjauvin Dec 13, 2021
83e4210
Replace np.ones by xr.ones_like in weighted quantile test
cjauvin Dec 13, 2021
2237399
Process weighted quantile data with all nans
cjauvin Dec 13, 2021
b936e21
Fix operator precedence bug
cjauvin Dec 13, 2021
c94fa16
Merge branch 'main' into pr/6059
Illviljan Dec 29, 2021
ab810d7
Used effective sample size. Generalize to different quantile types su…
huard Jan 11, 2022
7bcf09e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 11, 2022
8427637
Apply suggestions from code review
huard Jan 12, 2022
3217962
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 12, 2022
7379d22
added missing Typing hints
huard Jan 12, 2022
c8871d1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 12, 2022
b26a5fc
update what's new and pep8 fixes
huard Jan 12, 2022
42ebcc2
merge
huard Jan 12, 2022
5aa22a4
Merge branch 'main' into weighted-quantile
huard Jan 12, 2022
abe253e
add docstring paragraph discussing weight interpretation
huard Jan 19, 2022
82147aa
recognize numpy names for quantile interpolation methods
huard Jan 19, 2022
784cedd
tweak to avoid warning with all nans data. simplify test
huard Jan 19, 2022
3ee62fd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 19, 2022
4d6a4fd
remove integers from quantile interpolation available methods
huard Jan 19, 2022
9f93f55
merge
huard Jan 19, 2022
8132320
Merge remote-tracking branch 'upstream/main' into weighted-quantile
huard Jan 19, 2022
4d7f5f5
remove merge artifacts
Illviljan Jan 19, 2022
c268ddd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 19, 2022
db706aa
[skip-ci] fix bad merge in whats-new
dcherian Jan 20, 2022
33ee96c
Add references
huard Jan 20, 2022
2ffd3d3
renamed htype argument to method in private functions
huard Feb 8, 2022
9806db8
Merge branch 'main' into weighted-quantile
huard Feb 8, 2022
15ee999
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 8, 2022
9559c87
Update xarray/core/weighted.py
huard Feb 9, 2022
73dde79
Add skipped test to verify equal weights quantile with methods
cjauvin Feb 17, 2022
59714af
Apply suggestions from code review
huard Feb 17, 2022
585b705
Update xarray/core/weighted.py
huard Feb 17, 2022
7443b82
modifications suggested by review: comments, remove align, clarify te…
huard Feb 17, 2022
42a6a49
adjust typing. resolve conflicts
huard Feb 17, 2022
2e0c16e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 17, 2022
4186a24
Apply suggestions from code review
huard Feb 21, 2022
1be1f92
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 21, 2022
5c251e0
use broadcast
mathause Feb 25, 2022
9060f8e
Merge branch 'main' into weighted-quantile
mathause Mar 4, 2022
c112d2c
move whatsnew entry
mathause Mar 4, 2022
d4ba8ee
Apply suggestions from code review
mathause Mar 4, 2022
fd0a54e
switch skipna determination
mathause Mar 4, 2022
8bd83f9
Merge branch 'weighted-quantile' of https://github.com/cjauvin/xarray…
mathause Mar 4, 2022
343b47e
use align and broadcast
mathause Mar 4, 2022
c298bd0
Merge branch 'main' into pr/6059
Illviljan Mar 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -943,6 +943,7 @@ Dataset

DatasetWeighted
DatasetWeighted.mean
DatasetWeighted.quantile
DatasetWeighted.sum
DatasetWeighted.std
DatasetWeighted.var
Expand All @@ -957,6 +958,7 @@ DataArray

DataArrayWeighted
DataArrayWeighted.mean
DataArrayWeighted.quantile
DataArrayWeighted.sum
DataArrayWeighted.std
DataArrayWeighted.var
Expand Down
8 changes: 7 additions & 1 deletion doc/user-guide/computation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ Weighted array reductions

:py:class:`DataArray` and :py:class:`Dataset` objects include :py:meth:`DataArray.weighted`
and :py:meth:`Dataset.weighted` array reduction methods. They currently
support weighted ``sum``, ``mean``, ``std`` and ``var``.
support weighted ``sum``, ``mean``, ``std``, ``var`` and ``quantile``.

.. ipython:: python

Expand Down Expand Up @@ -291,6 +291,12 @@ Calculate the weighted mean:

weighted_prec.mean(dim="month")

Calculate the weighted quantile:

.. ipython:: python

weighted_prec.quantile(q=0.5, dim="month")

The weighted sum corresponds to:

.. ipython:: python
Expand Down
2 changes: 2 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,8 @@ New Features
~~~~~~~~~~~~
- New top-level function :py:func:`cross`. (:issue:`3279`, :pull:`5365`).
By `Jimmy Westling <https://github.com/illviljan>`_.
- Add ``quantile`` method to :py:class:`~core.weighted.DatasetWeighted` and :py:class:`~core.weighted.DataArrayWeighted` (:pull:`6059`).
huard marked this conversation as resolved.
Show resolved Hide resolved
By `Christian Jauvin <https://github.com/cjauvin>`_ and `David Huard <https://github.com/huard>`_.
- ``keep_attrs`` support for :py:func:`where` (:issue:`4141`, :issue:`4682`, :pull:`4687`).
By `Justus Magin <https://github.com/keewis>`_.
- Enable the limit option for dask array in the following methods :py:meth:`DataArray.ffill`, :py:meth:`DataArray.bfill`, :py:meth:`Dataset.ffill` and :py:meth:`Dataset.bfill` (:issue:`6112`)
Expand Down
212 changes: 209 additions & 3 deletions xarray/core/weighted.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,25 @@
from __future__ import annotations

from typing import TYPE_CHECKING, Generic, Hashable, Iterable, cast
from typing import TYPE_CHECKING, Generic, Hashable, Iterable, Literal, Sequence, cast

import numpy as np

from . import duck_array_ops
from .computation import dot
from . import duck_array_ops, utils
from .computation import apply_ufunc, dot
from .npcompat import ArrayLike
from .pycompat import is_duck_dask_array
from .types import T_Xarray

# Weighted quantile methods are a subset of the numpy supported quantile methods.
QUANTILE_METHODS = Literal[
"linear",
"interpolated_inverted_cdf",
"hazen",
"weibull",
"median_unbiased",
"normal_unbiased",
]

_WEIGHTED_REDUCE_DOCSTRING_TEMPLATE = """
Reduce this {cls}'s data by a weighted ``{fcn}`` along some dimension(s).

Expand Down Expand Up @@ -56,6 +67,62 @@
New {cls} object with the sum of the weights over the given dimension.
"""

_WEIGHTED_QUANTILE_DOCSTRING_TEMPLATE = """
Apply a weighted ``quantile`` to this {cls}'s data along some dimension(s).

Weights are interpreted as *sampling weights* (or probability weights) and
describe how a sample is scaled to the whole population [1]_. Although there are
huard marked this conversation as resolved.
Show resolved Hide resolved
other possible interpretations for weights, *precision weights* describing the
precision of observations, or *frequency weights* counting the number of identical
observations, they are not implemented here.
huard marked this conversation as resolved.
Show resolved Hide resolved

For compatibility with NumPy's non-weighted ``quantile`` (which is used by
``DataArray.quantile`` and ``Dataset.quantile``), the only interpolation
method supported by this weighted version corresponds to the default "linear"
option of ``numpy.quantile``. This is "Type 7" option, described in Hyndman
and Fan (1996) [2]_. The implementation is largely inspired by a blog post
from A. Akinshin's [3]_.

Parameters
----------
q : float or sequence of float
Quantile to compute, which must be between 0 and 1 inclusive.
dim : str or sequence of str, optional
Dimension(s) over which to apply the weighted ``quantile``.
skipna : bool, optional
If True, skip missing values (as marked by NaN). By default, only
skips missing values for float dtypes; other dtypes either do not
have a sentinel missing value (int) or skipna=True has not been
implemented (object, datetime64 or timedelta64).
keep_attrs : bool, optional
If True, the attributes (``attrs``) will be copied from the original
object to the new one. If False (default), the new object will be
returned without attributes.

Returns
-------
quantiles : {cls}
New {cls} object with weighted ``quantile`` applied to its data and
the indicated dimension(s) removed.

See Also
--------
numpy.nanquantile, pandas.Series.quantile, Dataset.quantile
DataArray.quantile
cjauvin marked this conversation as resolved.
Show resolved Hide resolved
huard marked this conversation as resolved.
Show resolved Hide resolved

Notes
-----
Returns NaN if the ``weights`` sum to 0.0 along the reduced
dimension(s).

References
----------
.. [1] https://notstatschat.rbind.io/2020/08/04/weights-in-statistics/
.. [2] Hyndman, R. J. & Fan, Y. (1996). Sample Quantiles in Statistical Packages.
The American Statistician, 50(4), 361–365. https://doi.org/10.2307/2684934
.. [3] https://aakinshin.net/posts/weighted-quantiles
"""


if TYPE_CHECKING:
from .dataarray import DataArray
Expand Down Expand Up @@ -241,6 +308,130 @@ def _weighted_std(

return cast("DataArray", np.sqrt(self._weighted_var(da, dim, skipna)))

def _weighted_quantile(
self,
da: DataArray,
q: ArrayLike,
dim: Hashable | Iterable[Hashable] = None,
huard marked this conversation as resolved.
Show resolved Hide resolved
skipna: bool = None,
) -> DataArray:
"""Apply a weighted ``quantile`` to a DataArray along some dimension(s)."""

def _get_h(n: float, q: np.ndarray, method: QUANTILE_METHODS) -> np.ndarray:
"""Return the interpolation parameter."""
# Note that options are not yet exposed in the public API.
if method == "linear":
h = (n - 1) * q + 1
elif method == "interpolated_inverted_cdf":
h = n * q
elif method == "hazen":
h = n * q + 0.5
elif method == "weibull":
h = (n + 1) * q
elif method == "median_unbiased":
h = (n + 1 / 3) * q + 1 / 3
elif method == "normal_unbiased":
h = (n + 1 / 4) * q + 3 / 8
else:
raise ValueError(f"Invalid method: {method}.")
return h.clip(1, n)

def _weighted_quantile_1d(
data: np.ndarray,
weights: np.ndarray,
q: np.ndarray,
skipna: bool,
method: QUANTILE_METHODS = "linear",
) -> np.ndarray:
# This algorithm has been adapted from:
# https://aakinshin.net/posts/weighted-quantiles/#reference-implementation
is_nan = np.isnan(data)
if skipna:
# Remove nans from data and weights
not_nan = ~is_nan
data = data[not_nan]
weights = weights[not_nan]
elif is_nan.any():
# Return nan if data contains any nan
return np.full(q.size, np.nan)

# Filter out data (and weights) associated with zero weights, which also flattens them
nonzero_weights = weights != 0
data = data[nonzero_weights]
weights = weights[nonzero_weights]
n = data.size

if n == 0:
# Possibly empty after nan or zero weight filtering above
return np.full(q.size, np.nan)

# Kish's effective sample size
nw = weights.sum() ** 2 / (weights**2).sum()

# Sort data and weights
sorter = np.argsort(data)
data = data[sorter]
weights = weights[sorter]

# Normalize and sum the weights
weights = weights / weights.sum()
weights_cum = np.append(0, weights.cumsum())

# Vectorize the computation by transposing q with respect to weights
q = np.atleast_2d(q).T

# Get the interpolation parameter for each q
h = _get_h(nw, q, method)

# Find the samples contributing to the quantile computation (at *positions* between (h-1)/nw and h/nw)
u = np.maximum((h - 1) / nw, np.minimum(h / nw, weights_cum))

# Compute their relative weight
v = u * nw - h + 1
w = np.diff(v)

# Apply the weights
return (data * w).sum(axis=1)

if da.shape != self.weights.shape:
raise ValueError("da and weights must have the same shape")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think da and self.weights must be broadcast-able and not necessarily have the same shape. Potentially apply_ufunc takes care of this? (But I am not sure and would need to test).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to explicitly call align to make sure coordinates for da and weights match.

Copy link
Collaborator

@mathause mathause Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still disagree with this - align="inner" will take care of the alignment. Or you need to convince me with a counter-example. I am still sure the following has to work:

import numpy as np
import xarray as xr

da = xr.DataArray(np.arange(6).reshape(3, 2))
w = xr.DataArray([1, 1, 1])

da.weighted(w).quantile(0.5)

Check also:

da = xr.DataArray(np.arange(6).reshape(3, 2), coords={"dim_0": [1, 2, 3]})
w = xr.DataArray([1, 1, 1, 100_000], coords={"dim_0": [0, 1, 2, 3]})

da.weighted(w).mean()

or if you absolutely disagree with the second example do align="exact" this should allow the first one but not the second.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the principle, but in practice I'm having trouble getting it to work. Your first example doesn't work out of the box. I get

ValueError: operand to apply_ufunc has required core dimensions ['dim_0', 'dim_1'], but some of these dimensions are absent on an input variable: ['dim_1']


q = np.atleast_1d(np.asarray(q, dtype=np.float64))

if q.ndim > 1:
raise ValueError("q must be a scalar or 1d")

if np.any((q < 0) | (q > 1)):
raise ValueError("q values must be between 0 and 1")

if dim is None:
dim = da.dims

if utils.is_scalar(dim):
dim = [dim]

# To satisfy mypy
dim = cast(Sequence, dim)

result = apply_ufunc(
_weighted_quantile_1d,
da,
self.weights,
input_core_dims=[dim, dim],
output_core_dims=[["quantile"]],
output_dtypes=[np.float64],
join="inner",
dask_gufunc_kwargs=dict(output_sizes={"quantile": len(q)}),
dask="parallelized",
vectorize=True,
kwargs={"q": q, "skipna": skipna},
)

result = result.transpose("quantile", ...)
result = result.assign_coords(quantile=q).squeeze()

return result

def _implementation(self, func, dim, **kwargs):

raise NotImplementedError("Use `Dataset.weighted` or `DataArray.weighted`")
Expand Down Expand Up @@ -310,6 +501,19 @@ def std(
self._weighted_std, dim=dim, skipna=skipna, keep_attrs=keep_attrs
)

def quantile(
self,
q: ArrayLike,
*,
dim: Hashable | Sequence[Hashable] | None = None,
keep_attrs: bool = None,
skipna: bool = True,
) -> T_Xarray:

return self._implementation(
self._weighted_quantile, q=q, dim=dim, skipna=skipna, keep_attrs=keep_attrs
)

def __repr__(self):
"""provide a nice str repr of our Weighted object"""

Expand Down Expand Up @@ -360,6 +564,8 @@ def _inject_docstring(cls, cls_name):
cls=cls_name, fcn="std", on_zero="NaN"
)

cls.quantile.__doc__ = _WEIGHTED_QUANTILE_DOCSTRING_TEMPLATE.format(cls=cls_name)


_inject_docstring(DataArrayWeighted, "DataArray")
_inject_docstring(DatasetWeighted, "Dataset")
Loading