Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: aggregations were getting overwritten if they had the same name #30858

Merged
merged 34 commits into from
Jul 14, 2020
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
20049c1
:bug: aggregations were getting overwritten if they had the same name
Jan 9, 2020
ab685fd
:art: shorten test for the sake of legibility
Jan 21, 2020
e38e450
:art: handle empty in , make whatsnewentry public-facing
Jan 21, 2020
cb849a2
:pencil: move whatsnew entry to v1.1.0
Jan 23, 2020
521bc1d
remove accidentally added whatsnewentry
MarcoGorelli Feb 2, 2020
ec93c4f
Merge branch 'master' into multiple-aggregations
MarcoGorelli Mar 3, 2020
6f9aac8
Update v1.1.0.rst
MarcoGorelli Mar 3, 2020
a8e9121
remove dataframe constructor
Mar 4, 2020
b857c6d
Dict instead of Mapping
Mar 4, 2020
44d00df
Merge branch 'master' into multiple-aggregations
MarcoGorelli Mar 5, 2020
523effb
Merge remote-tracking branch 'upstream/master' into multiple-aggregat…
MarcoGorelli Mar 15, 2020
552063a
remove no longer necessary setting of random seed
MarcoGorelli Mar 15, 2020
5e2e7d2
Merge remote-tracking branch 'upstream/master' into multiple-aggregat…
MarcoGorelli Apr 19, 2020
40f7e31
don't return slice in concat
MarcoGorelli Apr 19, 2020
f8f2d7f
Add test containing ohlc
MarcoGorelli Apr 19, 2020
dba7dde
Add named aggregation resample test, add to whatsnew
MarcoGorelli Apr 19, 2020
1b43ed1
revert empty line change
MarcoGorelli Apr 19, 2020
868a680
remove 30092 from whatsnew as the issue is already fixed in 1.0.3 and…
MarcoGorelli Apr 19, 2020
5d7f3db
Merge branch 'master' into multiple-aggregations
MarcoGorelli May 2, 2020
14b2402
catch performancewarning in test
MarcoGorelli May 2, 2020
829dce8
Merge remote-tracking branch 'upstream/master' into multiple-aggregat…
MarcoGorelli May 3, 2020
3469f5d
Merge remote-tracking branch 'upstream/master' into multiple-aggregat…
MarcoGorelli May 9, 2020
862b39e
make test same as in OP
MarcoGorelli May 10, 2020
5e3f333
make test match OP exactly
MarcoGorelli May 10, 2020
e7629f3
Merge remote-tracking branch 'upstream/master' into multiple-aggregat…
MarcoGorelli May 13, 2020
51158ef
split into two tests
MarcoGorelli May 18, 2020
447dfea
split into two tests
MarcoGorelli May 18, 2020
2693956
Merge remote-tracking branch 'upstream/master' into multiple-aggregat…
MarcoGorelli May 18, 2020
aa988a4
add test with namedtuple
MarcoGorelli May 27, 2020
7a62f5f
better layout
MarcoGorelli May 27, 2020
d80ddc5
better layout
MarcoGorelli May 27, 2020
4f954d4
Merge remote-tracking branch 'upstream/master' into multiple-aggregat…
MarcoGorelli Jun 27, 2020
62d91d1
dont special case empty output
MarcoGorelli Jun 27, 2020
fb3ba5c
Merge remote-tracking branch 'upstream/master' into multiple-aggregat…
MarcoGorelli Jul 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -824,6 +824,7 @@ Reshaping
- Bug in :func:`crosstab` when inputs are two Series and have tuple names, the output will keep dummy MultiIndex as columns. (:issue:`18321`)
- :meth:`DataFrame.pivot` can now take lists for ``index`` and ``columns`` arguments (:issue:`21425`)
- Bug in :func:`concat` where the resulting indices are not copied when ``copy=True`` (:issue:`29879`)
- Bug in :meth:`SeriesGroupBy.aggregate` was resulting in aggregations being overwritten when they shared the same name (:issue:`30880`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: the link to this method won't render, since SeriesGroupBy isn't in the pands namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that - will make sure the build the whatsnew file in the future to check

- Bug where :meth:`Index.astype` would lose the name attribute when converting from ``Float64Index`` to ``Int64Index``, or when casting to an ``ExtensionArray`` dtype (:issue:`32013`)
- :meth:`Series.append` will now raise a ``TypeError`` when passed a DataFrame or a sequence containing Dataframe (:issue:`31413`)
- :meth:`DataFrame.replace` and :meth:`Series.replace` will raise a ``TypeError`` if ``to_replace`` is not an expected type. Previously the ``replace`` would fail silently (:issue:`18634`)
Expand Down
13 changes: 8 additions & 5 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,7 +282,7 @@ def aggregate(
if isinstance(ret, dict):
from pandas import concat

ret = concat(ret, axis=1)
ret = concat(ret.values(), axis=1, keys=[key.label for key in ret.keys()])
return ret

agg = aggregate
Expand Down Expand Up @@ -311,8 +311,8 @@ def _aggregate_multiple_funcs(self, arg):

arg = zip(columns, arg)

results = {}
for name, func in arg:
results: Dict[base.OutputKey, Union[Series, DataFrame]] = {}
for idx, (name, func) in enumerate(arg):
obj = self

# reset the cache so that we
Expand All @@ -321,13 +321,16 @@ def _aggregate_multiple_funcs(self, arg):
obj = copy.copy(obj)
obj._reset_cache()
obj._selection = name
results[name] = obj.aggregate(func)
results[base.OutputKey(label=name, position=idx)] = obj.aggregate(func)
MarcoGorelli marked this conversation as resolved.
Show resolved Hide resolved

if any(isinstance(x, DataFrame) for x in results.values()):
# let higher level handle
return results

return self.obj._constructor_expanddim(results, columns=columns)
if not results:
return DataFrame()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm is this correct? do we have tests that hit this. I would think we would have somthing e.g. columns even if this is empty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also why is this not just handled in wrap_aggregated_output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a test that hits it: pandas/tests/groupby/aggregate/test_aggregate.py::TestNamedAggregationSeries::test_no_args_raises

When trying to move this to wrap_aggregated_output I ran into #34977, so I'll try to address that first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still is quite fishy . if you pass en empty result to self._wrap_aggregated_output what do you get as output? I really don't like special cases like this which inevitably hide errors and make groking code way more complex.

So prefer to have _wrap_aggregated_output handle this correctly. you may not even need L333, its possible to pass columns to _wrap_aggregated_output

Copy link
Member Author

@MarcoGorelli MarcoGorelli Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback if we pass {} to _wrap_aggregated_output we get a KeyError.

Here's the traceback:

============================= test session starts ==============================
platform linux -- Python 3.8.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1
rootdir: /home/marco/pandas-dev, inifile: setup.cfg
plugins: xdist-1.32.0, cov-2.10.0, asyncio-0.12.0, hypothesis-5.16.1, forked-1.1.2
collected 1 item

pandas/tests/groupby/aggregate/test_aggregate.py F                       [100%]

=================================== FAILURES ===================================
________________ TestNamedAggregationSeries.test_no_args_raises ________________

self = <pandas.tests.groupby.aggregate.test_aggregate.TestNamedAggregationSeries object at 0x7f8835d975b0>

    def test_no_args_raises(self):
        gr = pd.Series([1, 2]).groupby([0, 1])
        with pytest.raises(TypeError, match="Must provide"):
            gr.agg()
    
        # but we do allow this
>       result = gr.agg([])

pandas/tests/groupby/aggregate/test_aggregate.py:555: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/groupby/generic.py:247: in aggregate
    ret = self._aggregate_multiple_funcs(func)
pandas/core/groupby/generic.py:328: in _aggregate_multiple_funcs
    output = self._wrap_aggregated_output(results)
pandas/core/groupby/generic.py:387: in _wrap_aggregated_output
    result = self._wrap_series_output(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pandas.core.groupby.generic.SeriesGroupBy object at 0x7f8835d97ac0>
output = {}, index = Int64Index([0, 1], dtype='int64')

    def _wrap_series_output(
        self, output: Mapping[base.OutputKey, Union[Series, np.ndarray]], index: Index,
    ) -> Union[Series, DataFrame]:
        """
        Wraps the output of a SeriesGroupBy operation into the expected result.
    
        Parameters
        ----------
        output : Mapping[base.OutputKey, Union[Series, np.ndarray]]
            Data to wrap.
        index : pd.Index
            Index to apply to the output.
    
        Returns
        -------
        Series or DataFrame
    
        Notes
        -----
        In the vast majority of cases output and columns will only contain one
        element. The exception is operations that expand dimensions, like ohlc.
        """
        indexed_output = {key.position: val for key, val in output.items()}
        columns = Index(key.label for key in output)
    
        result: Union[Series, DataFrame]
        if len(output) > 1:
            result = self.obj._constructor_expanddim(indexed_output, index=index)
            result.columns = columns
        else:
            result = self.obj._constructor(
>               indexed_output[0], index=index, name=columns[0]
            )
E           KeyError: 0

pandas/core/groupby/generic.py:362: KeyError
-------------- generated xml file: /tmp/tmp-31663hvopHHCRFEu.xml ---------------
=========================== short test summary info ============================
FAILED pandas/tests/groupby/aggregate/test_aggregate.py::TestNamedAggregationSeries::test_no_args_raises
============================== 1 failed in 0.22s ===============================

The problem is this line which access [0] on an empty object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would just fix this, need to check if len(indexed_output)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So,

elif len(indexed_output):
    result = self.obj._constructor(
        indexed_output[0], index=index, name=columns[0]
    )
else:
    result = self.obj._constructor()

?

I can do that, but then I'll still have to address #34977 when the output of _wrap_aggregated_output is passed to self.obj._constructor_expanddim(results, columns=columns).

its possible to pass columns to _wrap_aggregated_output

Are you sure? It seems to only take on argument (other than self)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback would

elif len(indexed_output):
    result = self.obj._constructor(
        indexed_output[0], index=index, name=columns[0]
    )
else:
    return None

be an acceptable solution?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually,

        elif not columns.empty:
            result = self.obj._constructor(
                indexed_output[0], index=index, name=columns[0]
            )
        else:
            result = self.obj._constructor_expanddim()

works, because

pd.DataFrame(pd.DataFrame(), columns=[])

is allowed.

No need to modify the return types like this :)

output = self._wrap_aggregated_output(results)
return self.obj._constructor_expanddim(output, columns=columns)

def _wrap_series_output(
self, output: Mapping[base.OutputKey, Union[Series, np.ndarray]], index: Index,
Expand Down
45 changes: 45 additions & 0 deletions pandas/tests/groupby/aggregate/test_aggregate.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,13 @@
test .agg behavior / note that .apply is tested generally in test_groupby.py
"""
import functools
from functools import partial

import numpy as np
import pytest

from pandas.errors import PerformanceWarning

from pandas.core.dtypes.common import is_integer_dtype

import pandas as pd
Expand Down Expand Up @@ -252,6 +255,48 @@ def test_agg_multiple_functions_maintain_order(df):
tm.assert_index_equal(result.columns, exp_cols)


def test_agg_multiple_functions_same_name(df):
MarcoGorelli marked this conversation as resolved.
Show resolved Hide resolved
# GH 30880
df = pd.DataFrame(
np.random.randn(1000, 3),
index=pd.date_range("1/1/2012", freq="S", periods=1000),
columns=["A", "B", "C"],
)
result = df.resample("3T").agg(
{"A": [partial(np.quantile, q=0.9999), partial(np.quantile, q=0.1111)]}
)
expected_index = pd.date_range("1/1/2012", freq="3T", periods=6)
expected_columns = MultiIndex.from_tuples([("A", "quantile"), ("A", "quantile")])
expected_values = expected_values = np.array(
[df.resample("3T").A.quantile(q=q).values for q in [0.9999, 0.1111]]
).T
expected = pd.DataFrame(
expected_values, columns=expected_columns, index=expected_index
)
tm.assert_frame_equal(result, expected)

# check what happens if ohlc (which expands dimensions) is present
MarcoGorelli marked this conversation as resolved.
Show resolved Hide resolved
result = df.resample("3T").agg(
{"A": ["ohlc", partial(np.quantile, q=0.9999), partial(np.quantile, q=0.1111)]}
)
expected_columns = pd.MultiIndex.from_tuples(
[
("A", "ohlc", "open"),
("A", "ohlc", "high"),
("A", "ohlc", "low"),
("A", "ohlc", "close"),
("A", "quantile", "A"),
("A", "quantile", "A"),
]
)
expected_values = np.hstack([df.resample("3T").A.ohlc(), expected_values])
expected = pd.DataFrame(
expected_values, columns=expected_columns, index=expected_index
)
with tm.assert_produces_warning(PerformanceWarning):
MarcoGorelli marked this conversation as resolved.
Show resolved Hide resolved
tm.assert_frame_equal(result, expected)
MarcoGorelli marked this conversation as resolved.
Show resolved Hide resolved


def test_multiple_functions_tuples_and_non_tuples(df):
# #1359
funcs = [("foo", "mean"), "std"]
Expand Down