Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: avoid try/except in wrapping in cython_agg_blocks #38164

Merged
merged 4 commits into from
Dec 2, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 45 additions & 47 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
validate_func_kwargs,
)
import pandas.core.algorithms as algorithms
from pandas.core.arrays import ExtensionArray
from pandas.core.arrays import Categorical, ExtensionArray
from pandas.core.base import DataError, SpecificationError
import pandas.core.common as com
from pandas.core.construction import create_series_with_explicit_dtype
Expand Down Expand Up @@ -1026,38 +1026,64 @@ def _cython_agg_blocks(
if numeric_only:
data = data.get_numeric_data(copy=False)

no_result = object()

def cast_agg_result(result, values: ArrayLike, how: str) -> ArrayLike:
# see if we can cast the values to the desired dtype
# this may not be the original dtype
assert not isinstance(result, DataFrame)
assert result is not no_result

dtype = maybe_cast_result_dtype(values.dtype, how)
result = maybe_downcast_numeric(result, dtype)

if isinstance(values, ExtensionArray) and isinstance(result, np.ndarray):
# e.g. values was an IntegerArray
# (1, N) case can occur if values was Categorical
# and result is ndarray[object]
# TODO(EA2D): special casing not needed with 2D EAs
assert result.ndim == 1 or result.shape[0] == 1
try:
# Cast back if feasible
result = type(values)._from_sequence(
result.ravel(), dtype=values.dtype
)
Comment on lines -1047 to -1050
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this no longer needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because its only relevant for Categorical, and never raises

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the comment that values could be IntegerArray wasn't correct anymore?
And why is cast_agg_result no longer be called with other extension arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntegerArray is now handled correctly within _cython_operation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see, thanks for the explanation

But so that means that for a generic EA (not implemented by ourselves), we won't try to cast back? (not sure if there are good use cases for that, but still)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how we can avoid that in general.

I'm not clear what this is referring to.

That's potentially a regression? What's the reason for this? (now, I don't know how much of the groupby algos support object dtype, or would actually be faster than the python versions if they would)

We fall back to the SeriesGroupBy implementation so there shouldn't be a regression.

Many of these are ordinal-only, so using something like values_for_argsort is likely to be better than casting to object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to your "i am trying to get rid of that altogether"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to your "i am trying to get rid of that altogether"

Thanks.

For the functions that go through _cython_operation, there are only a handful of them and we can identify which are dtype-preserving and which are not. Some custom logic is in dtypes.cast.maybe_cast_result_dtype, and the rest at this point is in _ea_wrap_cython_operation. Basically anything that is "custom logic" represents a way we are treating pandas-internal EAs different from 3rd-party, which is something we want to avoid. Identifying and isolating that custom logic will make it easier to de-custom.

The only other place where the "try to cast back" that i am looking to get rid of altogether is used is in _aggregate_series_pure_python, and #38254 removes that (buggy) usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we want no custom handing of our own EAs at all?
For example what you wrote now in _ea_wrap_cython_operation, are this things you want to further remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we want no custom handing of our own EAs at all?

I mean, in the sense that we also want world peace and unlimited ice cream, yes. For the foreseeable future we're going to have some non-zero amount of custom handling.

For example what you wrote now in _ea_wrap_cython_operation, are this things you want to further remove?

In the short run I'm adding handling for FloatingArray and improving the handling for IntegerArray, so not removing at all.

except (ValueError, TypeError):
# reshape to be valid for non-Extension Block
result = result.reshape(1, -1)
if isinstance(values, Categorical) and isinstance(result, np.ndarray):
# If the Categorical op didn't raise, it is dtype-preserving
result = type(values)._from_sequence(result.ravel(), dtype=values.dtype)
# Note this will have result.dtype == dtype from above

elif isinstance(result, np.ndarray) and result.ndim == 1:
# We went through a SeriesGroupByPath and need to reshape
# GH#32223 includes case with IntegerArray values
result = result.reshape(1, -1)
# test_groupby_duplicate_columns gets here with
# result.dtype == int64, values.dtype=object, how="min"

return result

def py_fallback(bvalues: ArrayLike) -> ArrayLike:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a general note, inline functions are really hard to follow (of course in-line really lone if/else are even harder).....request would be to move this code to module level functions (later is totally fine)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah, there are a few follow-ups in the works

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(later is totally fine)

yah lets do it in a follow-up rather than clog the CI

# if self.grouper.aggregate fails, we fall back to a pure-python
# solution

# We get here with a) EADtypes and b) object dtype
obj: FrameOrSeriesUnion

# call our grouper again with only this block
if isinstance(bvalues, ExtensionArray):
# TODO(EA2D): special case not needed with 2D EAs
obj = Series(bvalues)
else:
obj = DataFrame(bvalues.T)
if obj.shape[1] == 1:
# Avoid call to self.values that can occur in DataFrame
# reductions; see GH#28949
obj = obj.iloc[:, 0]

# Create SeriesGroupBy with observed=True so that it does
# not try to add missing categories if grouping over multiple
# Categoricals. This will done by later self._reindex_output()
# Doing it here creates an error. See GH#34951
sgb = get_groupby(obj, self.grouper, observed=True)
result = sgb.aggregate(lambda x: alt(x, axis=self.axis))

assert isinstance(result, (Series, DataFrame)) # for mypy
# In the case of object dtype block, it may have been split
# in the operation. We un-split here.
result = result._consolidate()
assert isinstance(result, (Series, DataFrame)) # for mypy
assert len(result._mgr.blocks) == 1

# unwrap DataFrame to get array
result = result._mgr.blocks[0].values
return result

def blk_func(bvalues: ArrayLike) -> ArrayLike:

try:
Expand All @@ -1075,35 +1101,7 @@ def blk_func(bvalues: ArrayLike) -> ArrayLike:
assert how == "ohlc"
raise

# We get here with a) EADtypes and b) object dtype
obj: FrameOrSeriesUnion
# call our grouper again with only this block
if isinstance(bvalues, ExtensionArray):
# TODO(EA2D): special case not needed with 2D EAs
obj = Series(bvalues)
else:
obj = DataFrame(bvalues.T)
if obj.shape[1] == 1:
# Avoid call to self.values that can occur in DataFrame
# reductions; see GH#28949
obj = obj.iloc[:, 0]

# Create SeriesGroupBy with observed=True so that it does
# not try to add missing categories if grouping over multiple
# Categoricals. This will done by later self._reindex_output()
# Doing it here creates an error. See GH#34951
sgb = get_groupby(obj, self.grouper, observed=True)
result = sgb.aggregate(lambda x: alt(x, axis=self.axis))

assert isinstance(result, (Series, DataFrame)) # for mypy
# In the case of object dtype block, it may have been split
# in the operation. We un-split here.
result = result._consolidate()
assert isinstance(result, (Series, DataFrame)) # for mypy
assert len(result._mgr.blocks) == 1

# unwrap DataFrame to get array
result = result._mgr.blocks[0].values
result = py_fallback(bvalues)

return cast_agg_result(result, bvalues, how)

Expand Down