Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Richer options for interpolate and resample #4434

Closed
TomAugspurger opened this issue Aug 1, 2013 · 27 comments
Closed

ENH: Richer options for interpolate and resample #4434

TomAugspurger opened this issue Aug 1, 2013 · 27 comments
Labels
API Design Enhancement Internals Related to non-user accessible pandas implementation Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@TomAugspurger
Copy link
Contributor

related #1892, #1479

Is there any interest in giving interpolate and resample (to higher frequency) some additional methods?

For example:

from scipy import interpolate
df = pd.DataFrame({'A': np.arange(10), 'B': np.exp(np.arange(10) + np.random.randn())})
xnew = np.arange(10) + .5

In [46]: df.interpolate(xnew, method='spline')

Could return something like

In [47]: pd.DataFrame(interpolate.spline(df.A, df.B, xnew, order=4), index=xnew)
Out[47]: 
               0
0.5     1.044413
1.5     0.798392
2.5     3.341909
3.5     8.000314
4.5    22.822819
5.5    60.957659
6.5   166.844351
7.5   451.760621
8.5  1235.969910
9.5     0.000000  # falls outside the original range so interpolate.spline sets it to 0.

I have never used the DataFrame's interpolate, but a quick glance says that something like the above wouldn't be backwards compatible with the current calling convention. Maybe a different name? This may be confusing two issues: interpolating over missing values and interpolating / predicting non-existent values. Or are they similar enought that they can be treated the same. I would think so.

These are just some quick thoughts before I forget. I haven't spent much time thinking a design through yet. I'd be happy to work on this in a month or so.

Also does this fall in the realm of statsmodels?

@cpcloud
Copy link
Member

cpcloud commented Aug 2, 2013

@jreback thought we deprecated DataFrame.interpolate...? should we bring it back? splines sort of blur the line between pandas and statsmodels (i think leaning more towards statsmodels) but i like the idea.

@danielballan
Copy link
Contributor

Yes, this is a basic task that really should [edit:] not call for statsmodels, in my opinion.

@danielballan
Copy link
Contributor

Ugly workaround I offered a few days ago: http://stackoverflow.com/a/18276030/1221924

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

since we do use statsmodels/scipy in other parts of the code why don't u peruse sm 5.0 for some available functions here?

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

@jseabold do u have direct support in sm 5.0 for interpolation? or do u defer to scipy?

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html

should be straightforward to directly call these

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

via a kind argument (to pandas interpolate) with some kinds passing to scipy functions which are then wrapped on the return

@TomAugspurger
Copy link
Contributor Author

@jreback agreed about the ease of wrapping scipy.interpolate. My example in the first post is just calling

interpolate.spline(df.index, df['A'], xnew)

to get the interpolated values and then wrapping them up in a Series.

I've assumed that the DatFrame's index is the original x-values, which is probably fine for a default but we'd want an argument to say "use this column".

I could probably start on this in a few weeks (I have to finish a paper, then I promised the statsmodels guys that I'd setup a vbench for them).

@danielballan
Copy link
Contributor

Can we incorporate this into resample and reindex? Anywhere that ffill and bfill are accepted, linear and cubic should also be accepted.

And if we do that, can we give the same options that Series.interpolate provides?

@TomAugspurger
Copy link
Contributor Author

@danielballan +1 on resuing parts or all of this for resample and reindex (and possibly fillna?). I think that it would be relatively easy to handle.

Not sure how this would fit in with Jeff's refactor of Series.

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

so there exists right now a Series.interpolate and a generic.interpolate; Series.interpolate should be basically scrapped and it will then use the generic one (needs only a very slight mod to do this).

interpolate calls the pandas.core.internals.interpolate (which is the same routine actually called by method ffill/bfill), so this can be handled at a lower level (e.g. other kinds of fillers)

its a bit to wrap your head around, but is pretty straightforward

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

the key is that both Series and DataFrame BOTH now have ._data (which is the BlockManager) and then call the methods on the blocks, so they end up calling the same methods

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

lmk if you want to take a stab (or I could set it up for you with the structure and you can add in the other methods)

@TomAugspurger
Copy link
Contributor Author

If there's no urgency, I'm fine with going through the code to refactor generic.interpolate. I may have to bug you with a few questions though!

If Series.interpolate is refactored though, backwards compatibility may be a problem? Or on second thought maybe not... Right now Series.interploate is just for filling in NaNs. That could still be the default behavior but we could still accept an array of new values at which you want to interpolate, but default that to None. I think that should work.

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

@TomAugspurger that sounds fine; there should be no back compat issue (well...have to make sure, but in theory we have tests for that.....)

if the Series.interpolate is doing some that we want to keep (as the default?) which is prob linear interpolation that then can be moved to a lower level part of the code (e.g. core.common.interpolate_2d), where all of the interpolations will eventually happen

probl need some more tests to validate this

@jreback
Copy link
Contributor

jreback commented Aug 21, 2013

@TomAugspurger see #1892 as well; this is not conceptually much harder as the limit kw is already passed thru to these methods; actually implemented it might be a bit trickier. E.g. you might have to interpolate then throw away all but the number of limited values (using a mask for the prior-to NaN values)...sounds more complicated that actually doing it (but this is an add-on feature)

@jseabold
Copy link
Contributor

Statsmodels uses scipy and will likely continue to do so. There is support for "benchmarking" in statsmodels, but this is such a specialized case, I don't think it's worth supporting on your end.

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

@jseabold good to know; I think it makes sense for pandas for have some built in methods, and a dispatch to scipy/and or sm to use other methods...

@TomAugspurger
Copy link
Contributor Author

Starting to take a look at this. Just to get some of the scaffolding straight in my head:

  • Series and DataFrame will both have .interpolate methods which will call generic.interpolate
  • generic.interpolate will call core.internals.interpolate
  • core.internals.interpolate will call core.common.interpolate_2d, which will handle both interpolation of a new array of values given by the user and filling of NaN values. This is where I'll be adding wrapper for the various new methods.

So I'll be adding bits along the way to point things down to core.common.interpolate_2d before handing it off to a scipy or stats models method, capturing that result, and reconstructing either a new Series/DataFrame in the case of interpolate and filling in an existing Series/DataFrame in the case of fillna (or resample to a higher frequency?).

A couple questions:

  • Should new unit tests should go in test_common.py? Or in test_frame.py and test_series.py?
  • Do you want a new top level function pd.interpolate? Or will Series and Frame methods suffice?
  • What about Panels? I'd need to think more about what that would look like. Maybe hold off on that for now

@jtratner
Copy link
Contributor

jtratner commented Sep 7, 2013

If they are calling generic.interpolate, why not just define it once in core/generic and use the axes abstractions there? If you want to opt-out panel, you could just have Panel raise an error...

@TomAugspurger
Copy link
Contributor Author

I could be wrong but I think generic.interpolate is an abstraction for Series and DataFrame (and Panel) interpolate methods and core.common contains the abstraction for interpolate and fillna.

@jreback
Copy link
Contributor

jreback commented Sep 7, 2013

@TomAugspurger

right now core.generic.NDFrame.interpolate is where the action is. You need to eliminate core.series.Series.interpolate in favor of that. Its treated the same in the BlockManager so should be straightforward.

However, their may exist some behavior in the core.series.Series.interpolate that does not yet exist in the NDFrame one, so need to integrate this.

See core.generic.NDFrame.fillna/replace for some strategies on this.

You can do a new generic tester in test_generic and move the existing testing in test_series/frame to generic. You can put it under the appropriate classes (e.g. TestSeries) and such.

You can easily not support Panel now (and leave till the end of can do later), but just checking ndim in core.generic.NDFrame and raising. (similary you don't want to allow invalid axis, see how fillna does this.

As far as actually making the useful change (the point of this PR!). I would simply allow method to be different values (this will need some validation, prob can be done in common.interpolate_2d). And if you need other parameters you can either have them passed in (as all kwargs are already propogated down to the Block.interpolate anyhow), or maybe have method a callable. Not sure exactly what the scipy functions do.

So most of the 'real' changes will occurr in common.interpolate_2d. This deals always deals with 2d things. In fact, I would actually have you separate out into interpolate_1d, 2d, 3d.... I think (in common), and avoid some of the boilerplate, though not stricly necessary, and could still be done in one function.

This common.interpolate_2d is called from the individual Block (e.g. a single dtype). which returns an array of the same size (which could be the same one or not, but it doesn't actually matter). as the Block then wraps it up in a new Block and returns it. (This is how all of the inplace stuff is handled).

You should be getting your hands dirty here. Shout out if you need help.

@jreback
Copy link
Contributor

jreback commented Sep 20, 2013

@TomAugspurger how's this coming?

@TomAugspurger
Copy link
Contributor Author

I've been a bit intimidated about the internals. I'll give it some time this weekend and maybe waive the white flag if I fail. Is it on the schedule for the next release?

@jreback
Copy link
Contributor

jreback commented Sep 20, 2013

I think it should be.....lmk if you need help

internals are all about breaking stuff!!! lol

@jtratner
Copy link
Contributor

the test suite's pretty good, so that's a helpful guide.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2013

closed via #4915

@jreback jreback closed this as completed Oct 9, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Internals Related to non-user accessible pandas implementation Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants