ENH: add Series & DataFrame .agg/.aggregate #14668

jreback · 2016-11-16T02:03:39Z

to provide convienent function application that mimics the groupby(..).agg/.aggregate
interface
.apply is now a synonym for .agg/.aggregate, and will accept dict/list-likes
for aggregations
Automatic handling of both reductive and transformation functions, e.g.Series.agg(['min', 'sqrt'])
interpretation of string names on series (with a fallback to numpy), e.g. sqrt, log)

custom .describe. I included these issues because its is quite easy to now do custom .describe.
closes #14483
closes #7014

TODO:

docs (new section in basics, with linking to groupby, rolling, resample aggregates sections)
tests forSeries.agg({'foo' : ['min', 'max']})
update doc-strings
example from: sparse resampling not working with dictionary of columns? #15386

Series:

In [2]: s = Series(range(6))

In [3]: s
Out[3]: 
0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [4]: s.agg(['min', 'max'])
Out[4]: 
min    0
max    5
dtype: int64

In [5]: s.agg(['sqrt', 'min'])
ValueError: cannot combine transform and aggregation operations

In [6]: s.agg({'foo' : 'min'})
Out[6]: 
foo    0
dtype: int64

In [7]: s.agg({'foo' : ['min','max']})
Out[7]: 
     foo
min    0
max    5

In [8]: s.agg({'foo' : ['min','max'], 'bar' : ['sum', 'mean']})
Out[8]: 
      foo   bar
max   5.0   NaN
mean  NaN   2.5
min   0.0   NaN
sum   NaN  15.0

DataFrame

In [8]: df = pd.DataFrame({'A': range(5), 'B': 5})

In [9]: df
Out[9]: 
   A  B
0  0  5
1  1  5
2  2  5
3  3  5
4  4  5

In [10]: df.agg(['min', 'max'])
Out[10]: 
     A  B
min  0  5
max  4  5

In [11]: df.agg({'A': ['min', 'max'], 'B': ['sum', 'max']})
Out[11]: 
       A     B
max  4.0   5.0
min  0.0   NaN
sum  NaN  25.0

# df.agg is equivalent to a tranfsorm
In [15]: df.transform([np.sqrt, np.abs, lambda x: x**2])
Out[15]: 
          A                           B                  
       sqrt absolute <lambda>      sqrt absolute <lambda>
0  0.000000        0        0  2.236068        5       25
1  1.000000        1        1  2.236068        5       25
2  1.414214        2        4  2.236068        5       25
3  1.732051        3        9  2.236068        5       25
4  2.000000        4       16  2.236068        5       25

Not sure what I should do in cases like this. could skip, or maybe raise better message.

In [16]: df = pd.DataFrame({'A': range(5), 'B': 5, 'C':'foo'})

In [17]: df
Out[17]: 
   A  B    C
0  0  5  foo
1  1  5  foo
2  2  5  foo
3  3  5  foo
4  4  5  foo

In [18]: df.transform(['log', 'abs'])
AttributeError: 'str' object has no attribute 'log'

jreback · 2016-11-16T02:10:01Z

cc @wesm @jorisvandenbossche @shoyer @TomAugspurger @sinhrks

jorisvandenbossche · 2016-11-16T09:16:04Z

@jreback Cool!

Some quick feedback based on your examples above:

I would keep agg/aggregate strictly for aggregations (so only allow functions that reduce the values to a single scalar). IMO, this will make the scope of this function easier to grasp and less corner cases (eg no varying output shape depending on what kind of function is passed)
This would mean that apply is not an exact synonym, and that both reductive and transformation functions are not automatically handled (but we can do this for apply, this would then be the more general and flexible version. Maybe we could also have a transform method to fully mimic groupby methods?)
What happens when you pass a single function (not in a list) like s.agg('min')? I suppose you get a scalar result like s.min() ?

codecov-io · 2016-11-16T12:19:27Z

Codecov Report

Merging #14668 into master will decrease coverage by 0.08%.
The diff coverage is 98.52%.

@@            Coverage Diff             @@
##           master   #14668      +/-   ##
==========================================
- Coverage   91.11%   91.02%   -0.09%     
==========================================
  Files         145      145              
  Lines       50332    50391      +59     
==========================================
+ Hits        45858    45869      +11     
- Misses       4474     4522      +48

Flag	Coverage Δ
#multiple	`88.82% <98.52%> (-0.09%)`	⬇️
#single	`40.33% <35.29%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/generic.py	`96.28% <100%> (+0.02%)`	⬆️
pandas/core/base.py	`96.19% <100%> (+3.87%)`	⬆️
pandas/core/series.py	`95.08% <100%> (+0.09%)`	⬆️
pandas/core/frame.py	`97.65% <96.55%> (+0.07%)`	⬆️
pandas/core/groupby.py	`92.03% <0%> (-3.48%)`	⬇️
pandas/core/algorithms.py	`94.46% <0%> (-0.16%)`	⬇️
pandas/types/cast.py	`86.89% <0%> (+0.74%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9991579...ae6c6f6. Read the comment docs.

jreback · 2016-11-16T13:26:09Z

@jorisvandenbossche

In [2]: s = Series(range(6))

In [3]: s.agg('min')
Out[3]: 0

In [4]: s.agg(['min'])
Out[4]:
min    0
dtype: int64

so typically in groupby/rolling/resample we don't do transform operations, unless we explicity want them (e.g. by using .transform).

With a series/dataframe OTOH, I think this is fairly common. .apply does both ATM. So easy enough to separate out this behavior for reductions to .agg and transforms to .transform.

I may actually have to do the computations before can determine this though (and raise an appropriate error).

jreback · 2016-11-17T00:15:30Z

update the top section a bit.

Not sure what I should do in cases like this. could skip, or maybe raise better message.

In [16]: df = pd.DataFrame({'A': range(5), 'B': 5, 'C':'foo'})

In [17]: df
Out[17]: 
   A  B    C
0  0  5  foo
1  1  5  foo
2  2  5  foo
3  3  5  foo
4  4  5  foo

In [18]: df.transform(['log', 'abs'])
AttributeError: 'str' object has no attribute 'log'

jreback · 2016-11-17T00:16:30Z

we could add a numeric_only=True default.

jreback · 2016-11-17T00:17:25Z

pandas/core/generic.py

+    .. versionadded:: 0.19.2
+
+    Parameters
+    ----------


havent' really edited these yet

jorisvandenbossche

Bunch of comments

jorisvandenbossche · 2016-12-10T11:34:18Z

doc/source/basics.rst

+
+.. versionadded:: 0.20.0
+
+The aggregation APi allows one to express possibly multiple aggregation operations in a single concise way.


jorisvandenbossche · 2016-12-10T11:35:25Z

doc/source/basics.rst

+
+   tsdf.agg(np.sum)
+
+   tsdf.agg('sum')


maybe show here that this is the same as tsdf.sum() ?

jorisvandenbossche · 2016-12-10T11:43:21Z

doc/source/basics.rst

+
+.. ipython:: python
+
+   tsdf.A.agg({'foo' : ['sum', 'mean']})


This feels a bit strange. IMO the more logical example would be

In [11]: tsdf.A.agg({'foo' : 'sum', 'bar':'mean'}) Out[11]: foo -2.019230 bar -0.336538 Name: A, dtype: float64

ok I added that too (conceptually both are useful)

jorisvandenbossche · 2016-12-10T11:46:10Z

doc/source/basics.rst

+
+.. ipython:: python
+
+    tsdf.A.agg({'foo' : ['sum', 'mean'], 'bar': ['min', 'max', lambda x: x.sum()+1]})


Another option is that this would give a MultiIndexed series instead of a DataFrame:

So now with PR:

In [12]: tsdf.A.agg({'foo' : ['sum', 'mean'], 'bar': ['min', 'max', lambda x: x.sum()+1]}) Out[12]: foo bar <lambda> NaN -1.019230 max NaN 1.118963 mean -0.336538 NaN min NaN -1.476450 sum -2.019230 NaN

But could also give something like:

In [18]: tsdf.A.agg({'foo' : ['sum', 'mean'], 'bar': ['min', 'max', lambda x: x.sum()+1]}).stack().swaplevel(0,1).sort_index() Out[18]: foo mean -0.336538 sum -2.019230 bar <lambda> -1.019230 max 1.118963 min -1.476450 dtype: float64

yes, I was just trying to produce frames, but maybe this is more useful. let me see what I can do.

chaning this fails a couple of the groupby tests, so going to leave it for now. I think the logic is simpler anyhow.

jorisvandenbossche · 2016-12-10T11:48:49Z

doc/source/basics.rst

+
+.. versionadded:: 0.20.0
+
+The ``transform`` method returns an object that is indexed the same (same size)


transform -> :method:`~DataFrame.transform`

jorisvandenbossche · 2016-12-10T12:03:47Z

pandas/core/generic.py

+          - function
+          - list of functions
+          - dict of columns -> functions
+          - nested dict of names -> dicts of functions


What is meant with this?

going to remove, i copied from somewhere :>

jorisvandenbossche · 2016-12-10T12:05:00Z

pandas/core/generic.py

+            result = self.agg(func, *args, **kwargs)
+            if is_scalar(result) or len(result) != len(self):
+                raise ValueError("transforms cannot produce "
+                                 "aggregated results")


I think the aggregated result should be broadcasted to the full DataFrame? (as is done for groupby.transform)

jorisvandenbossche · 2016-12-10T12:09:31Z

pandas/core/frame.py

+            except TypeError:
+                pass
+        if result is None:
+            return self.apply(func, axis=axis, args=args, **kwargs)


As I said before, I don't think we should do this (allow non-aggregating results in agg, I suppose that is the meaning of result=None?)

Hmm, I see that this is also for aggregating lambda's .. and not only for non-aggregating functions

yeah it doesn't broadcast things

jorisvandenbossche · 2016-12-10T12:27:04Z

pandas/core/series.py

+            try:
+                result = self.apply(func, *args, **kwargs)
+            except (ValueError, AttributeError, TypeError):
+                result = func(self, *args, **kwargs)


Do you have an example of where apply fails and this is needed? (i.e. is this covered in the test cases?)

yes, something like this (its because Series will row-by-row application)

In [4]: Series(range(5)).apply(lambda x: x-x.min()) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-4-e65bceda88a4> in <module>() ----> 1 Series(range(5)).apply(lambda x: x-x.min()) /Users/jreback/miniconda3/envs/pandas/lib/python3.5/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds) 2290 else: 2291 values = self.asobject -> 2292 mapped = lib.map_infer(values, f, convert=convert_dtype) 2293 2294 if len(mapped) and isinstance(mapped[0], Series): pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66116)() <ipython-input-4-e65bceda88a4> in <lambda>(x) ----> 1 Series(range(5)).apply(lambda x: x-x.min()) AttributeError: 'int' object has no attribute 'min'

jorisvandenbossche · 2016-12-10T12:31:05Z

pandas/tests/frame/test_apply.py

+
+            result = self.frame.apply(np.sqrt)
+            assert_frame_equal(result, expected)
+


Can you add here also frame.transform('sqrt')?

jreback · 2016-12-15T23:10:34Z

@jorisvandenbossche any more comments?

this is for 0.20 so will be in master for a bit.

jreback · 2016-12-26T21:27:42Z

any remaining comments? @TomAugspurger @wesm @sinhrks

@jorisvandenbossche all of your points were addressed.

This may not be perfect and need tweaking, but I think banging on it in master for a while is useful.

jreback · 2017-02-14T22:58:10Z

to fully illustrate the open points (on .groupby), here is a simple set of examples.

setup

In [11]: df = pd.DataFrame({'A':[1,1,1,2,2],'B':range(5),'C':range(5)})
    
    In [12]: df
    Out[12]: 
       A  B  C
    0  1  0  0
    1  1  1  1
    2  1  2  2
    3  2  3  3
    4  2  4  4

agg on a DataFrame vs renaming from a Series

In [13]: df.groupby('A').agg({'B':'mean'})
Out[13]: 
     B
A     
1  1.0
2  3.5

In [14]: df.groupby('A').B.agg({'avg':'mean'})
Out[14]: 
   avg
A     
1  1.0
2  3.5

nested syntax is a bit much

In [16]: df.groupby('A').agg({'B':{'foo':'mean','bar':'count'},'C':{'foo2':'mean'}})
Out[16]: 
     B        C
   foo bar foo2
A              
1  1.0   3  1.0
2  3.5   2  3.5

this is useful as well (ideally this would be: df.groupby('A').agg(avg='mean')

In [22]: df.groupby('A').agg({col: {'avg':'mean'} for col in df.columns.difference(['A'])}).sort_index(axis=1)
Out[22]: 
     B    C
   avg  avg
A          
1  1.0  1.0
2  3.5  3.5

leifwalsh · 2017-03-12T00:07:18Z

doc/source/basics.rst

+
+.. ipython:: python
+
+    tsdf.A.agg({'foo' : 'sum', 'bar': 'mean'})


extra space after 'foo'

jreback · 2017-03-25T19:12:54Z

so since haven't had much conversation on this. I think we should simply merging as is.

If there is objection to the current .groupby aggregation API, well then that is a separate and distinct issue. see #14668 (comment)

Certainly open to changing / fixing this. But need a concrete proposal.

I don't see any hard in exactly replicating a current, well-defined API (even if there are some issues around the edges).

Otherwise forward progress is lost.

wesm · 2017-03-27T00:21:23Z

Sorry for my radio silence on this. Since there have been many opinions expressed let me dig in to the questions raised and try to weigh in with an additional perspective.

jreback · 2017-04-06T12:55:36Z

mini-example

In [9]: df = DataFrame({'A':[1,2,3,4]})

In [10]: df.groupby(df.index // 2).agg(['sum', 'min', 'max'])
Out[10]: 
    A        
  sum min max
0   3   1   2
1   7   3   4

Current

In [11]: df.agg(['sum', 'min', 'max'])
Out[11]: 
      A
sum  10
min   1
max   4

In [12]: df.describe()
Out[12]: 
              A
count  4.000000
mean   2.500000
std    1.290994
min    1.000000
25%    1.750000
50%    2.500000
75%    3.250000
max    4.000000

Alternate

In [13]: df.agg(['sum', 'min', 'max']).T
Out[13]: 
   sum  min  max
A   10    1    4

In [14]: df.agg(['sum', 'min', 'max']).unstack()
Out[14]: 
A  sum    10
   min     1
   max     4
dtype: int64

function application that mimics the groupby(..).agg/.aggregate interface .apply is now a synonym for .agg, and will accept dict/list-likes for aggregations CLN: rename .name attr -> ._selection_name from SeriesGroupby for compat (didn't exist on DataFrameGroupBy) resolves conflicts w.r.t. setting .name on a groupby object closes pandas-dev#1623 closes pandas-dev#14464 custom .describe closes pandas-dev#14483 closes pandas-dev#15015 closes pandas-dev#7014

additional doc updates

jreback · 2017-04-14T15:37:12Z

merging.

eduardochaves1 · 2023-02-08T23:02:51Z

After all these years here I am... I was doing a dashboard for my job, and I wanted to show some statistical values for the revenues data frame using streamlit. But to add a sum line on the df.describe() I had to do it manually.

So in my opinion, there should have at least a parameter do enable it, cause I know in most cases we don't want to use sum. But letting the user have an option would be great in my opinion.

jreback added API Design Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 16, 2016

jreback force-pushed the agg2 branch from dec6354 to f357e05 Compare November 16, 2016 02:07

jorisvandenbossche added this to the 0.20.0 milestone Nov 16, 2016

jreback force-pushed the agg2 branch from f357e05 to 94265e2 Compare November 16, 2016 12:19

jreback force-pushed the agg2 branch from 94265e2 to 66b4a01 Compare November 17, 2016 00:05

jreback commented Nov 17, 2016

View reviewed changes

pandas/core/generic.py Outdated

.. versionadded:: 0.19.2

Parameters

----------

Copy link

Contributor Author

jreback Nov 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

havent' really edited these yet

jorisvandenbossche mentioned this pull request Nov 23, 2016

Enhancement: add "sum" to pandas.describe() #14483

Closed

jreback force-pushed the agg2 branch 5 times, most recently from 6a59922 to 92eb36c Compare November 30, 2016 23:41

jreback force-pushed the agg2 branch from 92eb36c to b17034b Compare December 6, 2016 23:34

jorisvandenbossche reviewed Dec 10, 2016

View reviewed changes

jreback force-pushed the agg2 branch 4 times, most recently from d3bc362 to 345dce8 Compare December 15, 2016 11:38

jreback force-pushed the agg2 branch 2 times, most recently from ecc1339 to e58ead6 Compare December 26, 2016 17:32

jreback mentioned this pull request Feb 14, 2017

sparse resampling not working with dictionary of columns? #15386

Closed

jreback force-pushed the agg2 branch 2 times, most recently from 73432fa to be3c065 Compare March 9, 2017 21:22

leifwalsh reviewed Mar 12, 2017

View reviewed changes

doc/source/basics.rst Outdated

.. ipython:: python

tsdf.A.agg({'foo' : 'sum', 'bar': 'mean'})

Copy link

Contributor

leifwalsh Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space after 'foo'

jreback force-pushed the agg2 branch from be3c065 to 90930eb Compare March 22, 2017 20:14

jreback force-pushed the agg2 branch from 90930eb to 0af0871 Compare March 25, 2017 19:13

jreback force-pushed the agg2 branch 4 times, most recently from 3f0a869 to f08371b Compare April 3, 2017 21:56

jreback mentioned this pull request Apr 7, 2017

DEPR: deprecate relableling dicts in groupby.agg #15931

Merged

jreback force-pushed the agg2 branch 3 times, most recently from 41636b0 to 83f0499 Compare April 14, 2017 13:27

jreback and others added 3 commits April 14, 2017 09:38

DOC/TST: test for deprecation in .agg

dfb4675

additional doc updates

whatsnew fixes

ae6c6f6

jreback force-pushed the agg2 branch from 83f0499 to ae6c6f6 Compare April 14, 2017 13:45

jreback merged commit 8b40453 into pandas-dev:master Apr 14, 2017

zertrin mentioned this pull request Nov 19, 2017

Deprecation of relabeling dicts in groupby.agg brings many issues #18366

Closed

ghost mentioned this pull request Jul 14, 2019

Discuss: transformation vs. aggregation in agg vs. transform #27389

Closed

jbrockmendel mentioned this pull request Nov 14, 2019

DEPR: enforce nested-renaming deprecation #29608

Merged

endremborza mentioned this pull request Feb 12, 2020

ValueError on df.agg if a list of functions is given #31851

Open

WillAyd mentioned this pull request Feb 13, 2020

Why does Series.transform() exist? #31937

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add Series & DataFrame .agg/.aggregate #14668

ENH: add Series & DataFrame .agg/.aggregate #14668

jreback commented Nov 16, 2016 •

edited

Loading

jreback commented Nov 16, 2016

jorisvandenbossche commented Nov 16, 2016 •

edited

Loading

codecov-io commented Nov 16, 2016 •

edited by codecov bot

Loading

jreback commented Nov 16, 2016

jreback commented Nov 17, 2016

jreback commented Nov 17, 2016

jreback Nov 17, 2016

jorisvandenbossche left a comment

jorisvandenbossche Dec 10, 2016

jorisvandenbossche Dec 10, 2016

jorisvandenbossche Dec 10, 2016

jreback Dec 10, 2016

jorisvandenbossche Dec 10, 2016

jreback Dec 10, 2016

jreback Dec 10, 2016

jorisvandenbossche Dec 10, 2016

jorisvandenbossche Dec 10, 2016

jreback Dec 10, 2016

jorisvandenbossche Dec 10, 2016

jorisvandenbossche Dec 10, 2016

jorisvandenbossche Dec 10, 2016

jreback Dec 10, 2016

jorisvandenbossche Dec 10, 2016

jreback Dec 10, 2016

jorisvandenbossche Dec 10, 2016

jreback commented Dec 15, 2016

jreback commented Dec 26, 2016

jreback commented Feb 14, 2017 •

edited

Loading

leifwalsh Mar 12, 2017

jreback commented Mar 25, 2017

wesm commented Mar 27, 2017

jreback commented Apr 6, 2017

jreback commented Apr 14, 2017

eduardochaves1 commented Feb 8, 2023


		.. versionadded:: 0.20.0

		The aggregation APi allows one to express possibly multiple aggregation operations in a single concise way.


		.. ipython:: python

		tsdf.A.agg({'foo' : ['sum', 'mean'], 'bar': ['min', 'max', lambda x: x.sum()+1]})


		.. versionadded:: 0.20.0

		The ``transform`` method returns an object that is indexed the same (same size)


		result = self.frame.apply(np.sqrt)
		assert_frame_equal(result, expected)


		.. ipython:: python

		tsdf.A.agg({'foo' : 'sum', 'bar': 'mean'})

ENH: add Series & DataFrame .agg/.aggregate #14668

ENH: add Series & DataFrame .agg/.aggregate #14668

Conversation

jreback commented Nov 16, 2016 • edited Loading

jreback commented Nov 16, 2016

jorisvandenbossche commented Nov 16, 2016 • edited Loading

codecov-io commented Nov 16, 2016 • edited by codecov bot Loading

Codecov Report

jreback commented Nov 16, 2016

jreback commented Nov 17, 2016

jreback commented Nov 17, 2016

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 15, 2016

jreback commented Dec 26, 2016

jreback commented Feb 14, 2017 • edited Loading

Choose a reason for hiding this comment

jreback commented Mar 25, 2017

wesm commented Mar 27, 2017

jreback commented Apr 6, 2017

jreback commented Apr 14, 2017

eduardochaves1 commented Feb 8, 2023

jreback commented Nov 16, 2016 •

edited

Loading

jorisvandenbossche commented Nov 16, 2016 •

edited

Loading

codecov-io commented Nov 16, 2016 •

edited by codecov bot

Loading

jreback commented Feb 14, 2017 •

edited

Loading