Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: masked ops for reductions (sum) #30982

Merged
merged 24 commits into from
Mar 27, 2020
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
9795f35
POC masked ops for reductions
jorisvandenbossche Jan 13, 2020
7fb1f88
Merge remote-tracking branch 'upstream/master' into masked-ops
jorisvandenbossche Feb 13, 2020
cd26920
fix mask for older numpy
jorisvandenbossche Feb 13, 2020
28cd331
also use in boolean
jorisvandenbossche Feb 13, 2020
6298fbd
add min_count support
jorisvandenbossche Feb 13, 2020
735a741
fix preserve_dtypes test
jorisvandenbossche Feb 13, 2020
6df454f
passthrough min_count for boolean as well
jorisvandenbossche Feb 13, 2020
5eb48d6
fix comment
jorisvandenbossche Feb 13, 2020
3ef1331
Merge remote-tracking branch 'upstream/master' into masked-ops
jorisvandenbossche Feb 20, 2020
d2230fd
add object to empty reduction test case
jorisvandenbossche Feb 20, 2020
19ac821
test platform int
jorisvandenbossche Feb 21, 2020
2776436
Test sum separately with platform int
jorisvandenbossche Feb 21, 2020
68b4dc2
Merge remote-tracking branch 'upstream/master' into masked-ops
jorisvandenbossche Mar 20, 2020
18d5bfa
share min_count checking helper function with nanops
jorisvandenbossche Mar 20, 2020
457efb1
Merge remote-tracking branch 'upstream/master' into masked-ops
jorisvandenbossche Mar 23, 2020
4df858f
type + add docstring for min_count
jorisvandenbossche Mar 23, 2020
f5120db
Merge remote-tracking branch 'upstream/master' into masked-ops
jorisvandenbossche Mar 25, 2020
76c5149
move sum algo from ops to array_algos
jorisvandenbossche Mar 25, 2020
476f768
Merge remote-tracking branch 'upstream/master' into masked-ops
jorisvandenbossche Mar 26, 2020
b2162dc
add Int64/boolean to some benchmarks
jorisvandenbossche Mar 26, 2020
d4746f5
add whatsnew
jorisvandenbossche Mar 26, 2020
d9c2cbf
add skipna default in function signature
jorisvandenbossche Mar 26, 2020
f8705c2
update type hint + deprivatize
jorisvandenbossche Mar 26, 2020
1a43e10
update another type hint
jorisvandenbossche Mar 26, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions pandas/core/array_algos/masked_reductions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
"""
masked_reductions.py is for reduction algorithms using a mask-based approach
for missing values.
"""

import numpy as np

from pandas._libs import missing as libmissing
from pandas.compat.numpy import _np_version_under1p17

from pandas.core.nanops import _below_min_count
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved


def sum(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid overriding a builtin-in name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid overriding a builtin-in name?

Do we care? In practice this is only used as masked_reductions.sum(..), so there is no risk of confusing it (eg numpy also has a sum ..).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't have a problem with renaming it to eg masked_sum, though (but all together it then becomes a bit long masked_reductions.masked_sum(..))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nansum? Not a strong enough opinion to be a blocker.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no nan's involved here ;), that's exactly the difference with nanops

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, sounds good

values: np.ndarray, mask: np.ndarray, skipna: bool, min_count: int = 0,
):
"""
Sum for 1D masked array.

Parameters
----------
values : np.ndarray
Numpy array with the values (can be of any dtype that support the
operation).
mask : np.ndarray
Boolean numpy array (True values indicate missing values).
skipna : bool, default True
simonjayhawkins marked this conversation as resolved.
Show resolved Hide resolved
Whether to skip NA.
min_count : int, default 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required >=0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a count, so yes

The required number of valid values to perform the operation. If fewer than
``min_count`` non-NA values are present the result will be NA.
"""
if not skipna:
if mask.any():
return libmissing.NA
else:
if _below_min_count(values.shape, None, min_count):
return libmissing.NA
return np.sum(values)
else:
if _below_min_count(values.shape, mask, min_count):
return libmissing.NA

if _np_version_under1p17:
return np.sum(values[~mask])
else:
return np.sum(values, where=~mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche This seems like a useful pattern for other reductions like min and max (maybe a special case is needed if we have a min_count?), wondering if it might make sense to generalize this to those? Could be more performant than using nanops, and could also be done in a way that's extensible to string dtype (where nanops can't be used)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's indeed the goal of this PR (I just wanted to limit it to a single function for the initial PR). For min/max, I think it will be better to write a separate function in module since they don't use min_count, but prod should be able to reuse this.
I was planning to do follow-up PRs the and of this week, but feel free to pick it up!

6 changes: 5 additions & 1 deletion pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
from pandas.core.dtypes.missing import isna, notna

from pandas.core import nanops, ops
from pandas.core.array_algos import masked_reductions
from pandas.core.indexers import check_array_indexer

from .masked import BaseMaskedArray
Expand Down Expand Up @@ -695,6 +696,9 @@ def _reduce(self, name: str, skipna: bool = True, **kwargs):
data = self._data
mask = self._mask

if name == "sum":
return masked_reductions.sum(data, mask, skipna=skipna, **kwargs)

# coerce to a nan-aware float if needed
if self._hasna:
data = self.to_numpy("float64", na_value=np.nan)
Expand All @@ -706,7 +710,7 @@ def _reduce(self, name: str, skipna: bool = True, **kwargs):
return libmissing.NA

# if we have numeric op that would result in an int, coerce to int if possible
if name in ["sum", "prod"] and notna(result):
if name == "prod" and notna(result):
jreback marked this conversation as resolved.
Show resolved Hide resolved
int_result = np.int64(result)
if int_result == result:
result = int_result
Expand Down
6 changes: 5 additions & 1 deletion pandas/core/arrays/integer.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
from pandas.core.dtypes.missing import isna

from pandas.core import nanops, ops
from pandas.core.array_algos import masked_reductions
import pandas.core.common as com
from pandas.core.indexers import check_array_indexer
from pandas.core.ops import invalid_comparison
Expand Down Expand Up @@ -560,6 +561,9 @@ def _reduce(self, name: str, skipna: bool = True, **kwargs):
data = self._data
mask = self._mask

if name == "sum":
return masked_reductions.sum(data, mask, skipna=skipna, **kwargs)

# coerce to a nan-aware float if needed
# (we explicitly use NaN within reductions)
if self._hasna:
Expand All @@ -577,7 +581,7 @@ def _reduce(self, name: str, skipna: bool = True, **kwargs):

# if we have a preservable numeric op,
# provide coercion back to an integer type if possible
elif name in ["sum", "min", "max", "prod"]:
elif name in ["min", "max", "prod"]:
# GH#31409 more performant than casting-then-checking
result = com.cast_scalar_indexer(result)

Expand Down
37 changes: 31 additions & 6 deletions pandas/core/nanops.py
Original file line number Diff line number Diff line change
Expand Up @@ -1238,7 +1238,7 @@ def _maybe_null_out(
result: np.ndarray,
axis: Optional[int],
mask: Optional[np.ndarray],
shape: Tuple,
shape: Tuple[int],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does Tuple[int] describe a 1-tuple, or any-length-tuple?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, apparently that needs to be Tuple[int, ...] for a variable length tuple

min_count: int = 1,
) -> float:
"""
Expand All @@ -1260,16 +1260,41 @@ def _maybe_null_out(
# GH12941, use None to auto cast null
result[null_mask] = None
elif result is not NaT:
if mask is not None:
null_mask = mask.size - mask.sum()
else:
null_mask = np.prod(shape)
if null_mask < min_count:
if _below_min_count(shape, mask, min_count):
result = np.nan

return result


def _below_min_count(shape: Tuple[int], mask: Optional[np.ndarray], min_count: int):
jreback marked this conversation as resolved.
Show resolved Hide resolved
"""
Check for the `min_count` keyword. Returns True if below `min_count` (when
missing value should be returned from the reduction).

Parameters
----------
shape : tuple
The shape of the values (`values.shape`).
mask : ndarray or None
Boolean numpy array (typically of same shape as `shape`) or None.
min_count : int
Keyword passed through from sum/prod call.

Returns
-------
bool
"""
if min_count > 0:
if mask is None:
# no missing values, only check size
non_nulls = np.prod(shape)
else:
non_nulls = mask.size - mask.sum()
if non_nulls < min_count:
return True
return False


def _zero_out_fperr(arg):
# #18044 reference this behavior to fix rolling skew/kurt issue
if isinstance(arg, np.ndarray):
Expand Down
4 changes: 3 additions & 1 deletion pandas/tests/arrays/boolean/test_reduction.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,9 @@ def test_reductions_return_types(dropna, data, all_numeric_reductions):
if dropna:
s = s.dropna()

if op in ("sum", "prod"):
if op == "sum":
assert isinstance(getattr(s, op)(), np.int_)
elif op == "prod":
assert isinstance(getattr(s, op)(), np.int64)
elif op in ("min", "max"):
assert isinstance(getattr(s, op)(), np.bool_)
Expand Down
5 changes: 4 additions & 1 deletion pandas/tests/arrays/integer/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,10 @@ def test_preserve_dtypes(op):

# op
result = getattr(df.C, op)()
assert isinstance(result, int)
if op == "sum":
assert isinstance(result, np.int64)
else:
assert isinstance(result, int)

# groupby
result = getattr(df.groupby("A"), op)()
Expand Down
29 changes: 18 additions & 11 deletions pandas/tests/reductions/test_reductions.py
Original file line number Diff line number Diff line change
Expand Up @@ -531,13 +531,14 @@ def test_sum_inf(self):
res = nanops.nansum(arr, axis=1)
assert np.isinf(res).all()

@pytest.mark.parametrize("dtype", ["float64", "Int64", "boolean", "object"])
@pytest.mark.parametrize("use_bottleneck", [True, False])
@pytest.mark.parametrize("method, unit", [("sum", 0.0), ("prod", 1.0)])
def test_empty(self, method, unit, use_bottleneck):
def test_empty(self, method, unit, use_bottleneck, dtype):
with pd.option_context("use_bottleneck", use_bottleneck):
# GH#9422 / GH#18921
# Entirely empty
s = Series([], dtype=object)
s = Series([], dtype=dtype)
# NA by default
result = getattr(s, method)()
assert result == unit
Expand All @@ -560,8 +561,14 @@ def test_empty(self, method, unit, use_bottleneck):
result = getattr(s, method)(skipna=True, min_count=1)
assert pd.isna(result)

result = getattr(s, method)(skipna=False, min_count=0)
assert result == unit

result = getattr(s, method)(skipna=False, min_count=1)
assert pd.isna(result)

# All-NA
s = Series([np.nan])
s = Series([np.nan], dtype=dtype)
# NA by default
result = getattr(s, method)()
assert result == unit
Expand All @@ -585,7 +592,7 @@ def test_empty(self, method, unit, use_bottleneck):
assert pd.isna(result)

# Mix of valid, empty
s = Series([np.nan, 1])
s = Series([np.nan, 1], dtype=dtype)
# Default
result = getattr(s, method)()
assert result == 1.0
Expand All @@ -604,22 +611,22 @@ def test_empty(self, method, unit, use_bottleneck):
result = getattr(s, method)(skipna=True, min_count=0)
assert result == 1.0

result = getattr(s, method)(skipna=True, min_count=1)
assert result == 1.0

# GH#844 (changed in GH#9422)
df = DataFrame(np.empty((10, 0)))
df = DataFrame(np.empty((10, 0)), dtype=dtype)
assert (getattr(df, method)(1) == unit).all()

s = pd.Series([1])
s = pd.Series([1], dtype=dtype)
result = getattr(s, method)(min_count=2)
assert pd.isna(result)

s = pd.Series([np.nan])
result = getattr(s, method)(skipna=False, min_count=2)
assert pd.isna(result)

s = pd.Series([np.nan], dtype=dtype)
result = getattr(s, method)(min_count=2)
assert pd.isna(result)

s = pd.Series([np.nan, 1])
s = pd.Series([np.nan, 1], dtype=dtype)
result = getattr(s, method)(min_count=2)
assert pd.isna(result)

Expand Down