ENH: Richer options for `interpolate` and `resample` #4434

TomAugspurger · 2013-08-01T20:21:41Z

Is there any interest in giving interpolate and resample (to higher frequency) some additional methods?

For example:

from scipy import interpolate
df = pd.DataFrame({'A': np.arange(10), 'B': np.exp(np.arange(10) + np.random.randn())})
xnew = np.arange(10) + .5

In [46]: df.interpolate(xnew, method='spline')

Could return something like

In [47]: pd.DataFrame(interpolate.spline(df.A, df.B, xnew, order=4), index=xnew)
Out[47]: 
               0
0.5     1.044413
1.5     0.798392
2.5     3.341909
3.5     8.000314
4.5    22.822819
5.5    60.957659
6.5   166.844351
7.5   451.760621
8.5  1235.969910
9.5     0.000000  # falls outside the original range so interpolate.spline sets it to 0.

I have never used the DataFrame's interpolate, but a quick glance says that something like the above wouldn't be backwards compatible with the current calling convention. Maybe a different name? This may be confusing two issues: interpolating over missing values and interpolating / predicting non-existent values. Or are they similar enought that they can be treated the same. I would think so.

These are just some quick thoughts before I forget. I haven't spent much time thinking a design through yet. I'd be happy to work on this in a month or so.

Also does this fall in the realm of statsmodels?

cpcloud · 2013-08-02T03:46:06Z

@jreback thought we deprecated DataFrame.interpolate...? should we bring it back? splines sort of blur the line between pandas and statsmodels (i think leaning more towards statsmodels) but i like the idea.

danielballan · 2013-08-21T17:24:41Z

Yes, this is a basic task that really should [edit:] not call for statsmodels, in my opinion.

danielballan · 2013-08-21T17:33:21Z

Ugly workaround I offered a few days ago: http://stackoverflow.com/a/18276030/1221924

jreback · 2013-08-21T17:33:42Z

since we do use statsmodels/scipy in other parts of the code why don't u peruse sm 5.0 for some available functions here?

jreback · 2013-08-21T17:35:09Z

@jseabold do u have direct support in sm 5.0 for interpolation? or do u defer to scipy?

jreback · 2013-08-21T17:52:49Z

http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html

should be straightforward to directly call these

jreback · 2013-08-21T17:53:53Z

via a kind argument (to pandas interpolate) with some kinds passing to scipy functions which are then wrapped on the return

TomAugspurger · 2013-08-21T18:04:46Z

@jreback agreed about the ease of wrapping scipy.interpolate. My example in the first post is just calling

interpolate.spline(df.index, df['A'], xnew)

to get the interpolated values and then wrapping them up in a Series.

I've assumed that the DatFrame's index is the original x-values, which is probably fine for a default but we'd want an argument to say "use this column".

I could probably start on this in a few weeks (I have to finish a paper, then I promised the statsmodels guys that I'd setup a vbench for them).

danielballan · 2013-08-21T18:07:18Z

Can we incorporate this into resample and reindex? Anywhere that ffill and bfill are accepted, linear and cubic should also be accepted.

And if we do that, can we give the same options that Series.interpolate provides?

TomAugspurger · 2013-08-21T18:10:55Z

@danielballan +1 on resuing parts or all of this for resample and reindex (and possibly fillna?). I think that it would be relatively easy to handle.

Not sure how this would fit in with Jeff's refactor of Series.

jreback · 2013-08-21T18:17:39Z

so there exists right now a Series.interpolate and a generic.interpolate; Series.interpolate should be basically scrapped and it will then use the generic one (needs only a very slight mod to do this).

interpolate calls the pandas.core.internals.interpolate (which is the same routine actually called by method ffill/bfill), so this can be handled at a lower level (e.g. other kinds of fillers)

its a bit to wrap your head around, but is pretty straightforward

jreback · 2013-08-21T18:18:47Z

the key is that both Series and DataFrame BOTH now have ._data (which is the BlockManager) and then call the methods on the blocks, so they end up calling the same methods

jreback · 2013-08-21T18:19:09Z

lmk if you want to take a stab (or I could set it up for you with the structure and you can add in the other methods)

TomAugspurger · 2013-08-21T18:27:00Z

If there's no urgency, I'm fine with going through the code to refactor generic.interpolate. I may have to bug you with a few questions though!

If Series.interpolate is refactored though, backwards compatibility may be a problem? Or on second thought maybe not... Right now Series.interploate is just for filling in NaNs. That could still be the default behavior but we could still accept an array of new values at which you want to interpolate, but default that to None. I think that should work.

jreback · 2013-08-21T18:43:52Z

@TomAugspurger that sounds fine; there should be no back compat issue (well...have to make sure, but in theory we have tests for that.....)

if the Series.interpolate is doing some that we want to keep (as the default?) which is prob linear interpolation that then can be moved to a lower level part of the code (e.g. core.common.interpolate_2d), where all of the interpolations will eventually happen

probl need some more tests to validate this

jreback · 2013-08-21T19:11:41Z

@TomAugspurger see #1892 as well; this is not conceptually much harder as the limit kw is already passed thru to these methods; actually implemented it might be a bit trickier. E.g. you might have to interpolate then throw away all but the number of limited values (using a mask for the prior-to NaN values)...sounds more complicated that actually doing it (but this is an add-on feature)

jseabold · 2013-08-22T00:03:37Z

Statsmodels uses scipy and will likely continue to do so. There is support for "benchmarking" in statsmodels, but this is such a specialized case, I don't think it's worth supporting on your end.

jreback · 2013-08-22T00:05:08Z

@jseabold good to know; I think it makes sense for pandas for have some built in methods, and a dispatch to scipy/and or sm to use other methods...

TomAugspurger · 2013-09-07T17:13:14Z

Starting to take a look at this. Just to get some of the scaffolding straight in my head:

Series and DataFrame will both have .interpolate methods which will call generic.interpolate
generic.interpolate will call core.internals.interpolate
core.internals.interpolate will call core.common.interpolate_2d, which will handle both interpolation of a new array of values given by the user and filling of NaN values. This is where I'll be adding wrapper for the various new methods.

So I'll be adding bits along the way to point things down to core.common.interpolate_2d before handing it off to a scipy or stats models method, capturing that result, and reconstructing either a new Series/DataFrame in the case of interpolate and filling in an existing Series/DataFrame in the case of fillna (or resample to a higher frequency?).

A couple questions:

Should new unit tests should go in test_common.py? Or in test_frame.py and test_series.py?
Do you want a new top level function pd.interpolate? Or will Series and Frame methods suffice?
What about Panels? I'd need to think more about what that would look like. Maybe hold off on that for now

jtratner · 2013-09-07T17:41:51Z

If they are calling generic.interpolate, why not just define it once in core/generic and use the axes abstractions there? If you want to opt-out panel, you could just have Panel raise an error...

TomAugspurger · 2013-09-07T18:37:02Z

I could be wrong but I think generic.interpolate is an abstraction for Series and DataFrame (and Panel) interpolate methods and core.common contains the abstraction for interpolate and fillna.

jreback · 2013-09-07T18:54:19Z

@TomAugspurger

right now core.generic.NDFrame.interpolate is where the action is. You need to eliminate core.series.Series.interpolate in favor of that. Its treated the same in the BlockManager so should be straightforward.

However, their may exist some behavior in the core.series.Series.interpolate that does not yet exist in the NDFrame one, so need to integrate this.

See core.generic.NDFrame.fillna/replace for some strategies on this.

You can do a new generic tester in test_generic and move the existing testing in test_series/frame to generic. You can put it under the appropriate classes (e.g. TestSeries) and such.

You can easily not support Panel now (and leave till the end of can do later), but just checking ndim in core.generic.NDFrame and raising. (similary you don't want to allow invalid axis, see how fillna does this.

As far as actually making the useful change (the point of this PR!). I would simply allow method to be different values (this will need some validation, prob can be done in common.interpolate_2d). And if you need other parameters you can either have them passed in (as all kwargs are already propogated down to the Block.interpolate anyhow), or maybe have method a callable. Not sure exactly what the scipy functions do.

So most of the 'real' changes will occurr in common.interpolate_2d. This deals always deals with 2d things. In fact, I would actually have you separate out into interpolate_1d, 2d, 3d.... I think (in common), and avoid some of the boilerplate, though not stricly necessary, and could still be done in one function.

This common.interpolate_2d is called from the individual Block (e.g. a single dtype). which returns an array of the same size (which could be the same one or not, but it doesn't actually matter). as the Block then wraps it up in a new Block and returns it. (This is how all of the inplace stuff is handled).

You should be getting your hands dirty here. Shout out if you need help.

jreback · 2013-09-20T22:55:27Z

@TomAugspurger how's this coming?

TomAugspurger · 2013-09-20T23:04:17Z

I've been a bit intimidated about the internals. I'll give it some time this weekend and maybe waive the white flag if I fail. Is it on the schedule for the next release?

jreback · 2013-09-20T23:08:33Z

I think it should be.....lmk if you need help

internals are all about breaking stuff!!! lol

jtratner · 2013-09-20T23:40:21Z

the test suite's pretty good, so that's a helpful guide.

jreback · 2013-10-09T20:30:19Z

closed via #4915

jreback mentioned this issue Aug 21, 2013

CLN: Post Series subclass from NDFrame #4324

Closed

18 tasks

jreback mentioned this issue Sep 21, 2013

Additional methods for replace/fillna #1479

Closed

TomAugspurger mentioned this issue Sep 21, 2013

ENH/REF: More options for interpolation and fillna #4915

Closed

jreback mentioned this issue Sep 29, 2013

CLN: 0.14 NDFrame cleaning #5044

Closed

7 tasks

jreback closed this as completed Oct 9, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Richer options for `interpolate` and `resample` #4434

ENH: Richer options for `interpolate` and `resample` #4434

TomAugspurger commented Aug 1, 2013

cpcloud commented Aug 2, 2013

danielballan commented Aug 21, 2013

danielballan commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

TomAugspurger commented Aug 21, 2013

danielballan commented Aug 21, 2013

TomAugspurger commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

TomAugspurger commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jseabold commented Aug 22, 2013

jreback commented Aug 22, 2013

TomAugspurger commented Sep 7, 2013

jtratner commented Sep 7, 2013

TomAugspurger commented Sep 7, 2013

jreback commented Sep 7, 2013

jreback commented Sep 20, 2013

TomAugspurger commented Sep 20, 2013

jreback commented Sep 20, 2013

jtratner commented Sep 20, 2013

jreback commented Oct 9, 2013

ENH: Richer options for interpolate and resample #4434

ENH: Richer options for interpolate and resample #4434

Comments

TomAugspurger commented Aug 1, 2013

cpcloud commented Aug 2, 2013

danielballan commented Aug 21, 2013

danielballan commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

TomAugspurger commented Aug 21, 2013

danielballan commented Aug 21, 2013

TomAugspurger commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

TomAugspurger commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jseabold commented Aug 22, 2013

jreback commented Aug 22, 2013

TomAugspurger commented Sep 7, 2013

jtratner commented Sep 7, 2013

TomAugspurger commented Sep 7, 2013

jreback commented Sep 7, 2013

jreback commented Sep 20, 2013

TomAugspurger commented Sep 20, 2013

jreback commented Sep 20, 2013

jtratner commented Sep 20, 2013

jreback commented Oct 9, 2013

ENH: Richer options for `interpolate` and `resample` #4434

ENH: Richer options for `interpolate` and `resample` #4434