Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies between python/cython groupby.agg behavior #35928

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
4c5eddd
REF: remove unnecesary try/except
jbrockmendel Aug 21, 2020
c632c9f
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Aug 21, 2020
9e64be3
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Aug 21, 2020
42649fb
TST: add test for agg on ordered categorical cols (#35630)
mathurk1 Aug 21, 2020
47121dd
TST: resample does not yield empty groups (#10603) (#35799)
tkmz-n Aug 21, 2020
1decb3e
revert accidental rebase
jbrockmendel Aug 22, 2020
57c5dd3
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 22, 2020
a358463
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 23, 2020
ffa7ad7
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 23, 2020
e5e98d4
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 24, 2020
408db5a
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 24, 2020
d3493cf
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 25, 2020
75a805a
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 25, 2020
9f61070
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 25, 2020
2d10f6e
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 26, 2020
3e20187
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 26, 2020
51205a5
REF/BUG: don't go through cython for EA indexes
jbrockmendel Aug 27, 2020
f453c5b
Implement _aggregate_maybe_named
jbrockmendel Aug 27, 2020
2ae2124
de-duplicate
jbrockmendel Aug 27, 2020
98a91a3
avoid passing RangeIndex to libreduction
jbrockmendel Aug 27, 2020
5f73b03
Merge branch 'master' of https://github.com/pandas-dev/pandas into ca…
jbrockmendel Aug 30, 2020
065fc69
Merge branch 'master' of https://github.com/pandas-dev/pandas into ca…
jbrockmendel Sep 1, 2020
c230f72
simplify
jbrockmendel Sep 2, 2020
bf2e171
Merge branch 'master' of https://github.com/pandas-dev/pandas into ca…
jbrockmendel Sep 5, 2020
a4bcf43
Merge branch 'master' of https://github.com/pandas-dev/pandas into ca…
jbrockmendel Sep 8, 2020
607aea8
Merge branch 'master' of https://github.com/pandas-dev/pandas into ca…
jbrockmendel Sep 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 67 additions & 11 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,14 @@
get_groupby,
)
from pandas.core.groupby.numba_ import generate_numba_func, split_for_numba
from pandas.core.indexes.api import Index, MultiIndex, all_indexes_same
from pandas.core.indexes.api import (
DatetimeIndex,
Index,
MultiIndex,
PeriodIndex,
TimedeltaIndex,
all_indexes_same,
)
import pandas.core.indexes.base as ibase
from pandas.core.internals import BlockManager, make_block
from pandas.core.series import Series
Expand Down Expand Up @@ -262,17 +269,46 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs)
if self.grouper.nkeys > 1:
return self._python_agg_general(func, *args, **kwargs)

try:
return self._python_agg_general(func, *args, **kwargs)
except (ValueError, KeyError):
# TODO: KeyError is raised in _python_agg_general,
# see see test_groupby.test_basic
result = self._aggregate_named(func, *args, **kwargs)
if isinstance(
self._selected_obj.index, (DatetimeIndex, TimedeltaIndex, PeriodIndex)
):
# using _python_agg_general would end up incorrectly patching
# _index_data in reduction.pyx
result = self._aggregate_maybe_named(func, *args, **kwargs)
else:
try:
return self._python_agg_general(func, *args, **kwargs)
except (ValueError, KeyError):
# TODO: KeyError is raised in _python_agg_general,
# see see test_groupby.test_basic
result = self._aggregate_maybe_named(func, *args, **kwargs)

index = Index(sorted(result), name=self.grouper.names[0])
if isinstance(index, (DatetimeIndex, TimedeltaIndex)):
# TODO: do we _always_ want to do this?
# shouldnt this be done later in eg _wrap_aggregated_output?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd it looks like this whole block L288-L307 can be replaced by setting index = self.grouper.result_index without breaking any tests. That de-facto works for existing tests, but I'd like to confirm that it works in general. do you have a good read on if it does?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say go for it if it passes the tests - it's certainly a lot simpler

index = index._with_freq("infer")

result_index = self.grouper.result_index

if (
result_index.dtype == index.dtype
and result_index.freq is not None
and index.freq is None
):
# TODO: will dtype equality always hold?
if len(index) == 1:
index.freq = result_index.freq

elif len(index) == 2:
if index[0] + result_index.freq == index[1]:
# infer_freq doesn't handle length-2 indexes
index.freq = result_index.freq

ret = create_series_with_explicit_dtype(
result, index=index, dtype_if_empty=object
)
ret.name = self._selected_obj.name # test_metadata_propagation_indiv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pass the name in L284


if not self.as_index: # pragma: no cover
print("Warning, ignoring as_index=True")
Expand Down Expand Up @@ -475,14 +511,34 @@ def _get_index() -> Index:
)
return self._reindex_output(result)

def _aggregate_named(self, func, *args, **kwargs):
def _aggregate_maybe_named(self, func, *args, **kwargs):
"""
Try the named-aggregator first, then unnamed, which better matches
what libreduction does.
"""
try:
return self._aggregate_named(func, *args, named=True, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this actually doing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to track down an answer to why i did this, may take a bit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK now i recall. in e.g. test_apply_columns_multilevel we do:

        cols = pd.MultiIndex.from_tuples([("A", "a", "", "one"), ("B", "b", "i", "two")])
        ind = date_range(start="2017-01-01", freq="15Min", periods=8)
        df = DataFrame(np.array([0] * 16).reshape(8, 2), index=ind, columns=cols)
        agg_dict = {col: (np.sum if col[3] == "one" else np.mean) for col in df.columns}
        result = df.resample("H").apply(lambda x: agg_dict[x.name](x))

so the function we are passing to apply is lambda x: agg_dict[x.name](x) depends on the Series name x.name but what we're actually doing ATM (and since this is tested, i guess it means we support it on purpose?) is patching the Series name to match the group name so that we end up applying a different function to each group.

But the user could also pass a function that depends on the non-patched Series name, so we end up guessing which regime we're in, which this is doing.

If it were up to me I'd shoot this name-patching thing into the sun.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grr, this is really complex.

except KeyError:
return self._aggregate_named(func, *args, named=False, **kwargs)

def _aggregate_named(self, func, *args, named: bool = True, **kwargs):
result = {}

for name, group in self:
group.name = name
for name, group in self: # TODO: could we have duplicate names?
if named:
group.name = name

output = func(group, *args, **kwargs)
if isinstance(output, (Series, Index, np.ndarray)):
raise ValueError("Must produce aggregated value")
if (
isinstance(output, Series)
and len(output) == 1
and name in output.index
):
# FIXME: kludge for test_resampler_grouper.test_apply
output = output.iloc[0]
else:
raise ValueError("Must produce aggregated value")
result[name] = output

return result
Expand Down
6 changes: 4 additions & 2 deletions pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
from pandas.core.frame import DataFrame
from pandas.core.generic import NDFrame
from pandas.core.groupby import base, grouper
from pandas.core.indexes.api import Index, MultiIndex, ensure_index
from pandas.core.indexes.api import Index, MultiIndex, RangeIndex, ensure_index
from pandas.core.series import Series
from pandas.core.sorting import (
compress_group_index,
Expand Down Expand Up @@ -620,8 +620,10 @@ def agg_series(
# TODO: can we get a performant workaround for EAs backed by ndarray?
return self._aggregate_series_pure_python(obj, func)

elif obj.index._has_complex_internals:
elif obj.index._has_complex_internals or isinstance(obj.index, RangeIndex):
# Preempt TypeError in _aggregate_series_fast
# exclude RangeIndex because patching it in libreduction would
# silently be incorrect
return self._aggregate_series_pure_python(obj, func)

try:
Expand Down