Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Expanded resample #13961

Closed
wants to merge 10 commits into from
Closed

Conversation

chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Aug 11, 2016

My only question here is if on= is the keyword name for picking a column - in a TimeGrouper that's called key. But on is what was used in the rolling selection (and elsewhere) so seems consistent.

In [63]: df = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5),
   ....:                    'a': np.arange(5)},
   ....:                   index=pd.MultiIndex.from_arrays([
   ....:                            [1,2,3,4,5],
   ....:                            pd.date_range('2015-01-01', freq='W', periods=5)],
   ....:                        names=['v','d']))
   ....: 

In [64]: df
Out[64]: 
              a       date
v d                       
1 2015-01-04  0 2015-01-04
2 2015-01-11  1 2015-01-11
3 2015-01-18  2 2015-01-18
4 2015-01-25  3 2015-01-25
5 2015-02-01  4 2015-02-01

In [65]: df.resample('M', on='date').sum()
Out[65]: 
            a
date         
2015-01-31  6
2015-02-28  4

In [66]: df.resample('M', level='d').sum()
Out[66]: 
            a
d            
2015-01-31  6
2015-02-28  4

@@ -4164,12 +4169,16 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None,
"""
from pandas.tseries.resample import (resample,
_maybe_process_deprecations)
if is_list_like(on):
raise ValueError("Only a single column may be passed to on")
if is_list_like(level):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move these inside resample

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I think might be able to remove these entirely. When TimeGrouper._set_grouper get's called, these are validated (same as in groupby). e.g. other fixes, #13907 should work for this as well.

@jreback
Copy link
Contributor

jreback commented Aug 11, 2016

would be ok with deprecating 'key' in favor of 'on' for Timegrouper as well

df = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5),
'a': np.arange(5)},
index=pd.MultiIndex.from_arrays([
[1,2,3,4,5],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would add to the main docs a similar example

@codecov-io
Copy link

codecov-io commented Aug 11, 2016

Current coverage is 85.27% (diff: 83.33%)

Merging #13961 into master will decrease coverage by 0.02%

@@             master     #13961   diff @@
==========================================
  Files           139        139          
  Lines         50164      50510   +346   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42788      43071   +283   
- Misses         7376       7439    +63   
  Partials          0          0          

Powered by Codecov. Last update 257ac88...10c7280

@@ -377,6 +377,20 @@ Other enhancements

pd.Timestamp(year=2012, month=1, day=1, hour=8, minute=30)

- the ``.resample()`` function now accepts a ``on=`` or ``key=`` parameter for resampling on a column or ``MultiIndex`` level (:issue:`13500`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key -> level

@jorisvandenbossche
Copy link
Member

API design question: when using on, should that columns be set as the index, or left as a column? (for rolling, we kept it as a column, but of course, there the original index is left intact, which is not the case for resample)

@jorisvandenbossche jorisvandenbossche added this to the 0.19.0 milestone Aug 11, 2016
@jreback
Copy link
Contributor

jreback commented Aug 11, 2016

@jorisvandenbossche I don't think we actually set the index for anything that accepts key/on (.groupby / .merge) ATM, nor do I think we should. That's exactly the point, the index setting needs to be explicit.

@chris-b1
Copy link
Contributor Author

But this will have the index set in the results. I do think that's the right thing to do - consistent with how groupby works elsewhere It'd be nice if this was an option, but doesn't seem implemented:

In [11]: df.groupby(pd.TimeGrouper(key='date', freq='M'), as_index=False).sum()
---------------------------------------------------------------------------

AttributeError: 'BinGrouper' object has no attribute 'compressed'

@chris-b1
Copy link
Contributor Author

chris-b1 commented Aug 11, 2016

I actually don't have a big problem with the key / on inconsistency. They do sort of represent different things - key is part of an abstract mapping, where on is a selection out of a concrete object?

@jreback
Copy link
Contributor

jreback commented Aug 11, 2016

@chris-b1 I don't think we set when on is indicated (in merging), and doesnt' exist in .groupby / .resample ATM (except via key). So can you show an example.

@chris-b1
Copy link
Contributor Author

I'm probably not being clear. By "setting" what I mean is that we return this:

In [19]: df.resample('M', on='date').sum()
Out[19]: 
            a
date         
2015-01-31  6
2015-02-28  4

Instead of this - i.e., the on column becomes an index.

In [20]: df.resample('M', on='date').sum().reset_index()
Out[20]: 
        date  a
0 2015-01-31  6
1 2015-02-28  4

@jreback
Copy link
Contributor

jreback commented Aug 11, 2016

@chris-b1 I understand. And I think the point IS to return [20]. we had this same discussion in #13358.

Certainly there is a case for this. E.g. Time is the grouper here, so we should set the index. And maybe this is different than the merging case rationale. Counterpoint is that the point of on is to indicate a grouper that we don't want to set in the first place (eg. this defeats the purpose of the .set_index(on_column).resample(...).reset_index(on_column) idiom.

Just want to clearly delineate cases where we should and should not set the resulting index.
I think having a table (in the dev docs?) where the philosophy / rationale would be good. (or can just lay it out here is fine too).

@chris-b1
Copy link
Contributor Author

xref #5755 - whether this is exactly what should be done, the basic rules now seem to be:

  • Transformations (sort, rolling, groupby.transform) - never set index
  • Reductions (df.sum, df.groupby.sum) - set index to reducing grouping. In some places (groupby is all I can think of) you have an option to make the grouping column rather than an index
  • Relational Ops
    • index based (concat, join, align) - set index to resulting set
    • column based (merge_*) - discard existing index

@chris-b1
Copy link
Contributor Author

Small updates pushed for the comments. I'd propose the that setting the on as an index (current impl) would be the most consistent API design, but certainly can change that if you feel otherwise.

@jorisvandenbossche
Copy link
Member

@chris-b1 With the current PR, if you have a non-reducing method, you get the following:

In [4]: df.resample('M', on='date').transform(lambda x: x)
Out[4]: 
   a
0  0
1  1
2  2
3  3
4  4

Which is I think not what we want? (the 'date' column should still be there?)

I think either we should follow the logic from groupby (reducing -> set as index (current PR for reducing methods), transforming -> keep as column) or either keep it as column in all cases (to distinguish it as use case from set_index('date').resample(..)

@chris-b1
Copy link
Contributor Author

You must not be using the same frame as my example? I show the index always being preserved, which is expected. I could see an argument that 'date' should be preserved as a column in [3],

In [1]: pd.__version__
Out[1]: '0.18.1+345.gc4db0e7'

In [2]: df
Out[2]: 
              a       date
v d                       
1 2015-01-04  0 2015-01-04
2 2015-01-11  1 2015-01-11
3 2015-01-18  2 2015-01-18
4 2015-01-25  3 2015-01-25
5 2015-02-01  4 2015-02-01

In [3]: df.resample('M', on='date').transform(lambda x: x)
Out[3]: 
              a
v d            
1 2015-01-04  0
2 2015-01-11  1
3 2015-01-18  2
4 2015-01-25  3
5 2015-02-01  4

In [4]: df.resample('M', level='d').transform(lambda x: x)
Out[4]: 
              a       date
v d                       
1 2015-01-04  0 2015-01-04
2 2015-01-11  1 2015-01-11
3 2015-01-18  2 2015-01-18
4 2015-01-25  3 2015-01-25
5 2015-02-01  4 2015-02-01

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 15, 2016

@chris-b1 ah yes sorry, I used the same example dataframe, but without the index. It's because the index is identical to the column in your example that the results 'looks' good I think (because the index including the same values as the dropped column is still there):

In [63]: df2 = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5),
   ....:                     'a': np.arange(5)})

In [64]: df2.resample('M', on='date').transform(lambda x: x)
Out[64]: 
   a
0  0
1  1
2  2
3  3
4  4

@chris-b1
Copy link
Contributor Author

I actually still think the current approach may be right. Is it any different than 'a' being excluded from the results here?

In [11]: df = pd.DataFrame({'a': [1, 1, 2, 2], 'b': [1, 2, 3, 4]})

In [12]: df.groupby('a').transform(lambda x: x)
Out[12]: 
   b
0  1
1  2
2  3
3  4

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 15, 2016

Hmm, OK, I was in the belief that that preserved the 'a' column, apparantly not ... Oh, what a jumble this grouping api :-)
Other 'transforming' methods keep the grouping column (eg ffill or head)

EDIT: after some thought, it is indeed useful transform does not preserve the grouping column, this way assigning the result to a new column of the original dataframe is much easier.

@@ -1473,6 +1473,28 @@ Furthermore, you can also specify multiple aggregation functions for each column
r.agg({'A' : ['sum','std'], 'B' : ['mean','std'] })


If a ``DataFrame`` does not have a ``DatetimeIndex``, but instead you want
to resample based on column in the frame, it can passed to the ``on`` keyword.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it clear that the on (currently) still must be a datetimelike (so we of course accept PeriodIndex/TimedeltaIndex here as well (add tests if we don't have them for those as well)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use datetimelike rather than DatetimeIndex

@jreback
Copy link
Contributor

jreback commented Aug 15, 2016

I think this is a fine extension of the API. It is in-line with other methods of how it produces output.

@chris-b1
Copy link
Contributor Author

Ok, one (hopefully) last api question.

Right now this blows up in various ways if an upsample is attempted, e.g. df.resample('D', level='d').ffill(). Maybe that's a valid use case, but it's a lot less compelling to me, and doesn't map to groupby semantics, so what I'm proposing is to raise NotImplementedError for now,

@chris-b1
Copy link
Contributor Author

@jreback - expanded tests like you suggested, moving as much the api to the Base class as possible, so all three index types are hit.

This also now closes #14008, it's kind of a hack, but not sure there's a much better way without rethinking the whole approach, xref #12884.

I'm using the NotImplementedError approach for up-sampling suggested above.


on : string, optional
For a DataFrame, column to use instead of index for resampling.
Column must be datetime-like.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add versionadded tags

@jreback
Copy link
Contributor

jreback commented Aug 18, 2016

pls remove the 'hack'. make this PR simpler.

@chris-b1
Copy link
Contributor Author

I'm very open to suggestions, but this is as clean of approach as I can think of. I can always back out changes and just leave #14008 open if the fix is worse than the bug.

@jreback
Copy link
Contributor

jreback commented Aug 18, 2016

yes pls back out

the period resampling needs to be fixed in s more core way

@chris-b1
Copy link
Contributor Author

If you'd like to have another look, I've removed the PeriodIndex changes. I did leave in the from_selection state (may be a better name for this? though internal only), but it's only used to raise sensible errors for not-implemented things.

@@ -255,7 +255,8 @@ def _set_grouper(self, obj, sort=False):
Parameters
----------
obj : the subject object

sort : bool, default False
whether the resulting grouper should be sorted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was missing I guess?

# upsampling and PeriodIndex resampling do not work
# if resampling on a column or mi level
# this state used to catch and raise an error
self._from_selection = (self.groupby.key is not None or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant make this a property

@property
def _from_selection(self):
     return self.groupby is not None and (self.group.key is not None or self.groupy.level is not None)

@chris-b1
Copy link
Contributor Author

@jreback - updated for that last comment, let me know if you see anything else.

@jreback jreback closed this in 8654a9e Aug 31, 2016
@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

thanks @chris-b1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Expanded resample
4 participants