Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add Series & DataFrame .agg/.aggregate #14668

Merged
merged 3 commits into from
Apr 14, 2017
Merged

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Nov 16, 2016

  • to provide convienent function application that mimics the groupby(..).agg/.aggregate
    interface
  • .apply is now a synonym for .agg/.aggregate, and will accept dict/list-likes
    for aggregations
  • Automatic handling of both reductive and transformation functions, e.g.Series.agg(['min', 'sqrt'])
  • interpretation of string names on series (with a fallback to numpy), e.g. sqrt, log)

closes #1623
closes #14464

custom .describe. I included these issues because its is quite easy to now do custom .describe.
closes #14483
closes #7014

TODO:

Series:

In [2]: s = Series(range(6))

In [3]: s
Out[3]: 
0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [4]: s.agg(['min', 'max'])
Out[4]: 
min    0
max    5
dtype: int64

In [5]: s.agg(['sqrt', 'min'])
ValueError: cannot combine transform and aggregation operations

In [6]: s.agg({'foo' : 'min'})
Out[6]: 
foo    0
dtype: int64

In [7]: s.agg({'foo' : ['min','max']})
Out[7]: 
     foo
min    0
max    5

In [8]: s.agg({'foo' : ['min','max'], 'bar' : ['sum', 'mean']})
Out[8]: 
      foo   bar
max   5.0   NaN
mean  NaN   2.5
min   0.0   NaN
sum   NaN  15.0

DataFrame

In [8]: df = pd.DataFrame({'A': range(5), 'B': 5})

In [9]: df
Out[9]: 
   A  B
0  0  5
1  1  5
2  2  5
3  3  5
4  4  5

In [10]: df.agg(['min', 'max'])
Out[10]: 
     A  B
min  0  5
max  4  5

In [11]: df.agg({'A': ['min', 'max'], 'B': ['sum', 'max']})
Out[11]: 
       A     B
max  4.0   5.0
min  0.0   NaN
sum  NaN  25.0

# df.agg is equivalent to a tranfsorm
In [15]: df.transform([np.sqrt, np.abs, lambda x: x**2])
Out[15]: 
          A                           B                  
       sqrt absolute <lambda>      sqrt absolute <lambda>
0  0.000000        0        0  2.236068        5       25
1  1.000000        1        1  2.236068        5       25
2  1.414214        2        4  2.236068        5       25
3  1.732051        3        9  2.236068        5       25
4  2.000000        4       16  2.236068        5       25

Not sure what I should do in cases like this. could skip, or maybe raise better message.

In [16]: df = pd.DataFrame({'A': range(5), 'B': 5, 'C':'foo'})

In [17]: df
Out[17]: 
   A  B    C
0  0  5  foo
1  1  5  foo
2  2  5  foo
3  3  5  foo
4  4  5  foo

In [18]: df.transform(['log', 'abs'])
AttributeError: 'str' object has no attribute 'log'

@jreback
Copy link
Contributor Author

jreback commented Nov 16, 2016

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 16, 2016

@jreback Cool!

Some quick feedback based on your examples above:

  • I would keep agg/aggregate strictly for aggregations (so only allow functions that reduce the values to a single scalar). IMO, this will make the scope of this function easier to grasp and less corner cases (eg no varying output shape depending on what kind of function is passed)
  • This would mean that apply is not an exact synonym, and that both reductive and transformation functions are not automatically handled (but we can do this for apply, this would then be the more general and flexible version. Maybe we could also have a transform method to fully mimic groupby methods?)
  • What happens when you pass a single function (not in a list) like s.agg('min')? I suppose you get a scalar result like s.min() ?

@codecov-io
Copy link

codecov-io commented Nov 16, 2016

Codecov Report

Merging #14668 into master will decrease coverage by 0.08%.
The diff coverage is 98.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #14668      +/-   ##
==========================================
- Coverage   91.11%   91.02%   -0.09%     
==========================================
  Files         145      145              
  Lines       50332    50391      +59     
==========================================
+ Hits        45858    45869      +11     
- Misses       4474     4522      +48
Flag Coverage Δ
#multiple 88.82% <98.52%> (-0.09%) ⬇️
#single 40.33% <35.29%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 96.28% <100%> (+0.02%) ⬆️
pandas/core/base.py 96.19% <100%> (+3.87%) ⬆️
pandas/core/series.py 95.08% <100%> (+0.09%) ⬆️
pandas/core/frame.py 97.65% <96.55%> (+0.07%) ⬆️
pandas/core/groupby.py 92.03% <0%> (-3.48%) ⬇️
pandas/core/algorithms.py 94.46% <0%> (-0.16%) ⬇️
pandas/types/cast.py 86.89% <0%> (+0.74%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9991579...ae6c6f6. Read the comment docs.

@jreback
Copy link
Contributor Author

jreback commented Nov 16, 2016

@jorisvandenbossche

In [2]: s = Series(range(6))

In [3]: s.agg('min')
Out[3]: 0

In [4]: s.agg(['min'])
Out[4]:
min    0
dtype: int64

so typically in groupby/rolling/resample we don't do transform operations, unless we explicity want them (e.g. by using .transform).

With a series/dataframe OTOH, I think this is fairly common. .apply does both ATM. So easy enough to separate out this behavior for reductions to .agg and transforms to .transform.

I may actually have to do the computations before can determine this though (and raise an appropriate error).

@jreback
Copy link
Contributor Author

jreback commented Nov 17, 2016

update the top section a bit.

Not sure what I should do in cases like this. could skip, or maybe raise better message.

In [16]: df = pd.DataFrame({'A': range(5), 'B': 5, 'C':'foo'})

In [17]: df
Out[17]: 
   A  B    C
0  0  5  foo
1  1  5  foo
2  2  5  foo
3  3  5  foo
4  4  5  foo

In [18]: df.transform(['log', 'abs'])
AttributeError: 'str' object has no attribute 'log'

@jreback
Copy link
Contributor Author

jreback commented Nov 17, 2016

we could add a numeric_only=True default.

.. versionadded:: 0.19.2

Parameters
----------
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

havent' really edited these yet

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch of comments


.. versionadded:: 0.20.0

The aggregation APi allows one to express possibly multiple aggregation operations in a single concise way.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APi -> API


tsdf.agg(np.sum)

tsdf.agg('sum')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe show here that this is the same as tsdf.sum() ?


.. ipython:: python

tsdf.A.agg({'foo' : ['sum', 'mean']})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit strange. IMO the more logical example would be

In [11]: tsdf.A.agg({'foo' : 'sum', 'bar':'mean'})
Out[11]: 
foo   -2.019230
bar   -0.336538
Name: A, dtype: float64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I added that too (conceptually both are useful)


.. ipython:: python

tsdf.A.agg({'foo' : ['sum', 'mean'], 'bar': ['min', 'max', lambda x: x.sum()+1]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is that this would give a MultiIndexed series instead of a DataFrame:

So now with PR:

In [12]: tsdf.A.agg({'foo' : ['sum', 'mean'], 'bar': ['min', 'max', lambda x: x.sum()+1]})
Out[12]: 
               foo       bar
<lambda>       NaN -1.019230
max            NaN  1.118963
mean     -0.336538       NaN
min            NaN -1.476450
sum      -2.019230       NaN

But could also give something like:

In [18]: tsdf.A.agg({'foo' : ['sum', 'mean'], 'bar': ['min', 'max', lambda x: x.sum()+1]}).stack().swaplevel(0,1).sort_index()
Out[18]: 
foo  mean       -0.336538
     sum        -2.019230
bar  <lambda>   -1.019230
     max         1.118963
     min        -1.476450
dtype: float64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I was just trying to produce frames, but maybe this is more useful. let me see what I can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chaning this fails a couple of the groupby tests, so going to leave it for now. I think the logic is simpler anyhow.


.. versionadded:: 0.20.0

The ``transform`` method returns an object that is indexed the same (same size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transform -> :method:`~DataFrame.transform`

- function
- list of functions
- dict of columns -> functions
- nested dict of names -> dicts of functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to remove, i copied from somewhere :>

result = self.agg(func, *args, **kwargs)
if is_scalar(result) or len(result) != len(self):
raise ValueError("transforms cannot produce "
"aggregated results")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the aggregated result should be broadcasted to the full DataFrame? (as is done for groupby.transform)

except TypeError:
pass
if result is None:
return self.apply(func, axis=axis, args=args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said before, I don't think we should do this (allow non-aggregating results in agg, I suppose that is the meaning of result=None?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see that this is also for aggregating lambda's .. and not only for non-aggregating functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it doesn't broadcast things

try:
result = self.apply(func, *args, **kwargs)
except (ValueError, AttributeError, TypeError):
result = func(self, *args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of where apply fails and this is needed? (i.e. is this covered in the test cases?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, something like this (its because Series will row-by-row application)

In [4]: Series(range(5)).apply(lambda x: x-x.min())
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-e65bceda88a4> in <module>()
----> 1 Series(range(5)).apply(lambda x: x-x.min())

/Users/jreback/miniconda3/envs/pandas/lib/python3.5/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   2290             else:
   2291                 values = self.asobject
-> 2292                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2293 
   2294         if len(mapped) and isinstance(mapped[0], Series):

pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66116)()

<ipython-input-4-e65bceda88a4> in <lambda>(x)
----> 1 Series(range(5)).apply(lambda x: x-x.min())

AttributeError: 'int' object has no attribute 'min'


result = self.frame.apply(np.sqrt)
assert_frame_equal(result, expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here also frame.transform('sqrt')?

@jreback
Copy link
Contributor Author

jreback commented Dec 15, 2016

@jorisvandenbossche any more comments?

this is for 0.20 so will be in master for a bit.

@jreback
Copy link
Contributor Author

jreback commented Dec 26, 2016

any remaining comments? @TomAugspurger @wesm @sinhrks

@jorisvandenbossche all of your points were addressed.

This may not be perfect and need tweaking, but I think banging on it in master for a while is useful.

@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2017

to fully illustrate the open points (on .groupby), here is a simple set of examples.

setup

In [11]: df = pd.DataFrame({'A':[1,1,1,2,2],'B':range(5),'C':range(5)})
    
    In [12]: df
    Out[12]: 
       A  B  C
    0  1  0  0
    1  1  1  1
    2  1  2  2
    3  2  3  3
    4  2  4  4
  1. agg on a DataFrame vs renaming from a Series
In [13]: df.groupby('A').agg({'B':'mean'})
Out[13]: 
     B
A     
1  1.0
2  3.5

In [14]: df.groupby('A').B.agg({'avg':'mean'})
Out[14]: 
   avg
A     
1  1.0
2  3.5
  1. nested syntax is a bit much
In [16]: df.groupby('A').agg({'B':{'foo':'mean','bar':'count'},'C':{'foo2':'mean'}})
Out[16]: 
     B        C
   foo bar foo2
A              
1  1.0   3  1.0
2  3.5   2  3.5
  1. this is useful as well (ideally this would be: df.groupby('A').agg(avg='mean')
In [22]: df.groupby('A').agg({col: {'avg':'mean'} for col in df.columns.difference(['A'])}).sort_index(axis=1)
Out[22]: 
     B    C
   avg  avg
A          
1  1.0  1.0
2  3.5  3.5

@jreback jreback force-pushed the agg2 branch 2 times, most recently from 73432fa to be3c065 Compare March 9, 2017 21:22

.. ipython:: python

tsdf.A.agg({'foo' : 'sum', 'bar': 'mean'})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space after 'foo'

@jreback
Copy link
Contributor Author

jreback commented Mar 25, 2017

so since haven't had much conversation on this. I think we should simply merging as is.

If there is objection to the current .groupby aggregation API, well then that is a separate and distinct issue. see #14668 (comment)

Certainly open to changing / fixing this. But need a concrete proposal.

I don't see any hard in exactly replicating a current, well-defined API (even if there are some issues around the edges).

Otherwise forward progress is lost.

@wesm
Copy link
Member

wesm commented Mar 27, 2017

Sorry for my radio silence on this. Since there have been many opinions expressed let me dig in to the questions raised and try to weigh in with an additional perspective.

@jreback jreback force-pushed the agg2 branch 4 times, most recently from 3f0a869 to f08371b Compare April 3, 2017 21:56
@jreback
Copy link
Contributor Author

jreback commented Apr 6, 2017

mini-example

In [9]: df = DataFrame({'A':[1,2,3,4]})

In [10]: df.groupby(df.index // 2).agg(['sum', 'min', 'max'])
Out[10]: 
    A        
  sum min max
0   3   1   2
1   7   3   4

Current

In [11]: df.agg(['sum', 'min', 'max'])
Out[11]: 
      A
sum  10
min   1
max   4

In [12]: df.describe()
Out[12]: 
              A
count  4.000000
mean   2.500000
std    1.290994
min    1.000000
25%    1.750000
50%    2.500000
75%    3.250000
max    4.000000

Alternate

In [13]: df.agg(['sum', 'min', 'max']).T
Out[13]: 
   sum  min  max
A   10    1    4

In [14]: df.agg(['sum', 'min', 'max']).unstack()
Out[14]: 
A  sum    10
   min     1
   max     4
dtype: int64

jreback and others added 3 commits April 14, 2017 09:38
function application that mimics the groupby(..).agg/.aggregate
interface

.apply is now a synonym for .agg, and will accept dict/list-likes
for aggregations

CLN: rename .name attr -> ._selection_name from SeriesGroupby for compat (didn't exist on DataFrameGroupBy)
resolves conflicts w.r.t. setting .name on a groupby object

closes pandas-dev#1623
closes pandas-dev#14464

custom .describe
closes pandas-dev#14483
closes pandas-dev#15015
closes pandas-dev#7014
@jreback jreback merged commit 8b40453 into pandas-dev:master Apr 14, 2017
@jreback
Copy link
Contributor Author

jreback commented Apr 14, 2017

merging.

@eduardochaves1
Copy link
Contributor

After all these years here I am... I was doing a dashboard for my job, and I wanted to show some statistical values for the revenues data frame using streamlit. But to add a sum line on the df.describe() I had to do it manually.

So in my opinion, there should have at least a parameter do enable it, cause I know in most cases we don't want to use sum. But letting the user have an option would be great in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
8 participants