Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for padding with sentinel value for roll* #24

Open
bkamins opened this issue Oct 5, 2022 · 57 comments
Open

Add support for padding with sentinel value for roll* #24

bkamins opened this issue Oct 5, 2022 · 57 comments

Comments

@bkamins
Copy link

bkamins commented Oct 5, 2022

Example:

julia> rollmean(1:5, 3)
3-element Vector{Float64}:
 2.0
 3.0
 4.0

julia> runmean(1:5, 3)
5-element Vector{Float64}:
 1.0
 1.5
 2.0
 3.0
 4.0

julia> [fill(missing, 2); rollmean(1:5, 3)]
5-element Vector{Union{Missing, Float64}}:
  missing
  missing
 2.0
 3.0
 4.0

It would be nice to add a kwarg to roll* to allow such padding, e.g. by writing rollmean(1:5, 3; pad=missing).

@JeffreySarnoff
Copy link
Owner

What would the sentinel value do, simply provide the value with which to pad?
Is this different from the idea put forth on Discourse, or is it a reminder (which would be fine)?
If it is different, please recap what the other functionality is -- how it ought behave, Thanks.

@bkamins
Copy link
Author

bkamins commented Oct 5, 2022

This is a new idea. The point is that e.g. rollmean shortens a vector. While runmean keeps the vector length, but shortens the window on initial values. Often one needs rollmean (i.e. no window shortening), but also to keep the original length of the vector (e.g. if you store it in a data frame that has a fixed number of rows). In such cases you need to add some sentinel (usually missing to make the vector returned by rollmean have the original length)

@JeffreySarnoff
Copy link
Owner

I can see how missing may want to occupy the low indices (in the past) or may want to occupy the high indices (current data, perhaps unresolvable with fewer observations than the window length), does this filler ever have reason to be within the data sequence rather than at the start or at the end? Does this filler ever have reason to be interspersed with the observations -- if so, to what end?

@bkamins
Copy link
Author

bkamins commented Oct 5, 2022

This is how pandas does it in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html:
image
(I did not request for this full functionality, but something simpler, but if you want to investigate it then this is I think a standard reference)

In short:

  • pandas allows to specify when values are filled with np.nan (I proposed to be more flexible and allow any filler) this decision is set in min_periods parameter (I proposed not to add it rollingmean assumes that min_periods is just window size), but we could add it if you wanted
  • pandas puts filler at the beginning or evenly in beginning and at the end; this is handled by center argument (we could have three values - fill at beginning, fill at end, fill evenly on both sides; I do not see a reason to have a filler in the middle)

@JeffreySarnoff
Copy link
Owner

Introducing missing suggests that each application of a rolled function must first check ismissing or dispatch differently on encountering x::Missing. I could keep a displacement count instead, that would have no slow-down. If the displacement were signed to indicate from which place to offset, the early indices or the recent indices, then perhaps your request would be met without disrupting processing. For situations where there may be missing entries within the body of the data -- no events to observe on a weekend, but all dates are tracked -- a different approach is required. Maybe a choice of LOCF or avg of most recent 2/3/? observations? Any thoughts.

@JeffreySarnoff
Copy link
Owner

JeffreySarnoff commented Oct 5, 2022

I prefer keeping advances to that which can be stated simply, and having specific requests gel together.
Nonetheless, there will be resituating the underlying computational-ness required. So, best to have a clear view of the essentials.

Is the discourse item replaced by this refinement, or is there a distinction I may have forgotten?

@bkamins
Copy link
Author

bkamins commented Oct 5, 2022

So my original proposal was an essential. Currently user needs to do padding manually:

julia> [fill(missing, 2); rollmean(1:5, 3)] # fill at front
5-element Vector{Union{Missing, Float64}}:
  missing
  missing
 2.0
 3.0
 4.0

or

julia> [missing; rollmean(1:5, 3); missing] # centered
5-element Vector{Union{Missing, Float64}}:
  missing
  missing
 2.0
 3.0
 4.0

This is quite inconvenient, especially for new users of Julia.

What I would think makes sense:

  • keep the "default" implementation as is and make it as fast as possible
  • accept that if user passes some "padding"-related keyword argument then things can be slower, but should do what user needs

This is not a restatement of https://discourse.julialang.org/t/rollingfunctions-with-a-variable-window-width/87601.
In that thread OP wanted to be able to specify different window width for each observation. So this was a different requirement (this issue is a requirement for padding not for variable width).

@JeffreySarnoff
Copy link
Owner

I don't know how to read "specify a different window for each observation." Windows span multiple observations by definition. Does this mean at each observation use however many earlier observations to determine the current function value and that however many may be unique to each observation [index]? If so, "no". Your request above is much more widely useful.

@JeffreySarnoff
Copy link
Owner

JeffreySarnoff commented Oct 5, 2022

Is there a reason to actually pad rather than know where to begin/end? The data sequence belongs to the client. I would be more comfortable providing a very easy, perhaps nearly invisible way to copy with padding inserted rather than grab the information and alter its e.g. type. A keyword or two would allow this. Another function that actually modifies the source data sequence could be exported for advanced users.

roll(fn, window, dataseq; pad = :first, skip = 20)
roll(fn, window, dataseq; padding = (from = :first, count = 20))

then currying or structifying (fn, window) is possible

@bkamins
Copy link
Author

bkamins commented Oct 5, 2022

Does this mean at each observation use however many earlier observations to determine the current function value and that however many may be unique to each observation [index]?

This is what OP on Discourse wanted.

@bkamins
Copy link
Author

bkamins commented Oct 5, 2022

Is there a reason to actually pad rather than know where to begin/end?

This is also OK, but would require an additional step later. The point is often users want to do something like:

julia> df = DataFrame(a=1:5)
5×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4
   5 │     5

julia> df.m = rollmean(df.a, 2)
ERROR: ArgumentError: New columns must have the same length as old columns

and now, as you see, they get an error.

They could do:

julia> df.m = runmean(df.a, 2)
5-element Vector{Float64}:
 1.0
 1.5
 2.5
 3.5
 4.5

julia> df
5×2 DataFrame
 Row │ a      m
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      1.5
   3 │     3      2.5
   4 │     4      3.5
   5 │     5      4.5

but often they want the first observation that does not have a full window size to be some sentinel (like missing) rather than applied function just with a shorter window.

@JeffreySarnoff
Copy link
Owner

ahh. ic

@kpa28-git
Copy link

@bkamins
I'm developing a fork of RollingFunctions that has padded rolling function mappadroll: https://github.com/kevindirect/MapSlide.jl
It's a WIP, but it does work.

@bkamins
Copy link
Author

bkamins commented Jan 9, 2023

Thank you for this work.

@JeffreySarnoff - what is the status of your efforts towards 1.0 release of RollingFunctions.jl?

@JeffreySarnoff
Copy link
Owner

JeffreySarnoff commented Jan 10, 2023

reasonably good -- external constraints put the prerelease at Saturday Jan 21 and the release (or release candidate) at Sunday Feb 5th.

@bkamins
Copy link
Author

bkamins commented Jan 18, 2023

@JeffreySarnoff - I saw a mention in this issue, but now I do not see it, so I am not sure what it was exactly. However, let me re-state the original request. When using rolling functions to get the result of the same length as input vector padded with some sentinel (missing is a reasonable default).

@JeffreySarnoff
Copy link
Owner

JeffreySarnoff commented Jan 18, 2023 via email

@bkamins
Copy link
Author

bkamins commented Jan 18, 2023

What is bugging me is that I cannot process your views without including all of data frames.

Can you please explain the problem? I would assume that RollingFunctions.jl does not need to be aware of DataFrames.jl at all. It takes a vector as an input and should produce vector in output.

@JeffreySarnoff
Copy link
Owner

That is true. People have requested that it work with DataFrame column views, and just for general symmetry -- that we have two well developed sorts of view and there is no common superview type nor a good way to fake it with a Union is less than ideal. I will forget about it though.

@bkamins
Copy link
Author

bkamins commented Jan 18, 2023

Indeed you can get a view as an input to your rolling function. But it should not matter I thought. I assumed that I would have a signature that restricts input type to AbstractVector without caring if it is a view or not.

@JeffreySarnoff
Copy link
Owner

Yes, I use AbstractVector{T} and also AbstractMatrix{T} (for running some function over multiple columns at one go).

@bkamins
Copy link
Author

bkamins commented Jan 18, 2023

And why this is not enough?

@JeffreySarnoff
Copy link
Owner

these are vector examples (fyi)

@JeffreySarnoff
Copy link
Owner

it is -- really it all goes well.
I have been musing about some shallow type hierarchies for Julia v2.

@bkamins
Copy link
Author

bkamins commented Jan 18, 2023

Thank you, https://github.com/JeffreySarnoff/RollingFunctions.jl/blob/v1/docs/src/intro/padding.md seems exactly what is needed (in other frameworks there is also an option to pad in the middle so that half of padding is in front and half of padding is at end).

Conceptually this means that if we move the window is the observation to which we attach the result in:

  • the beginning of the window (pad last)
  • in the center of the window (pad both ends equally)
  • in the end of the window (pad at beggining)

Padding to the center can be added later if it is a problem (but it would be good to keep in mind that people sometimes want to say that Tuesday is a mean of Monday, Tuesday and Wednesday).

Regarding the types challenge - I discussed the issue with Jeff Bezanson yesterday - and it is a tough thing.

@JeffreySarnoff
Copy link
Owner

with padding in the middle and an odd number of paddings, which end gets more or is that another settable choice?

@JeffreySarnoff
Copy link
Owner

I had considered letting people specify n1 pads at the start and n2 pads at the end -- that seemed an opportunity for people to get it wrong.

@bkamins
Copy link
Author

bkamins commented Jan 18, 2023

which end gets more or is that another settable choice?

😄. We seem to think alike. I was copying the functionality that people asked for, but when I did this I started thinking about exactly the same problem. Maybe a more general solution would be to add a numeric padding_offset (tentative name). Where when it is set to 0 (the default and minimum allowed size) we get padding first, and if you set it to window_size-1 (the maximum allowed size) you get padding last.

(now I see it is similar to what you said with n1 and n2, but I think it is clearer to use window_size and offset).

@JeffreySarnoff
Copy link
Owner

At the next-to-last minute, I decided to keep compatibility with current code.
Realigning to that -- still expect to unveil on Sunday Feb 5th.

@JeffreySarnoff
Copy link
Owner

Here be something (unregistered while I heal).

using Pkg
Pkg.add(url="https://github.com/JeffreySarnoff/RollingFunctions.jl",rev="main")

The entire edifice has been reworked.
You can pad at the start or at the end with almost any value.
You can copy padding values from a vector, in order.
This helps with connecting one roll to another.
This keeps the result clean when a function is not well defined until it has more than 1, 2, .. values.

The current docs reflect the tests and other working details in need of tests.

IncrementalAccumulators exists to be incorporated, and will replace the older implementations (rollmean, ...).



this would add substantive performance

I have not been able to figure out how to get @turbo to work with this. Experiments show @turbo would improve the performance of many internal functions by 2x-3x. In the example below, data is a view, and using @views on the result line adds some throughput to the more direct internal functions.

lo = 1
hi = window_span
@turbo for idx in 1:n
  result[idx] = window_fn(data[lo:hi])
  lo += 1
  hi += 1
end

fails with

ERROR: MethodError: no method matching 
    _gesp(::LayoutPointers.StridedPointer{Float64, 1, …}, 
              ::Static.StaticInt{1}, ::UnitRange{Int64}, ::Static.StaticInt{1})

@bkamins
Copy link
Author

bkamins commented Feb 12, 2023

Thank you for working on this. I have checked the code and have the following comments:

Sentinel value in padding

You use Nothing as sentinel. It would be better to define an internal:

struct NoPadding end

structure and set NoPadding() as a default (and do not export NoPadding).

The reason is that Nothing is a valid value in Base Julia. In a case when someone wanted to pad with it it will not work as expected (I agree it is a corner case, but I usually prefer a clean design):

julia> rolling(sum, [1, 2, 3], 2, padding=Missing)
3-element Vector{Union{Int64, DataType}}:
  Missing
 3
 5

julia> rolling(sum, [1, 2, 3], 2, padding=Nothing) # unexpected
2-element Vector{Int64}:
 3
 5

In some cases rolling errors unexpectedly

Here is an example:

julia> rollmax([1, 2, 3], 2)
2-element Vector{Int64}:
 2
 3

julia> rolling(max, [1, 2, 3], 2)
ERROR: BoundsError: attempt to access 0-element Vector{Any} at index [1]

I would have expected these two operations to be equivalent (but I might be wrong).

padding is supported only for rolling

Example:

julia> rolling(mean, [1, 2, 3], 2, padding=missing)
3-element Vector{Union{Missing, Float64}}:
  missing
 1.5
 2.5

julia> rollmean([1, 2, 3], 2, padding=missing)
ERROR: MethodError: no method matching rollmean(::Vector{Int64}, ::Int64; padding::Missing)

I would expect padding and padlast work in both cases

In general functions lack docstrings

Example:

help?> rolling
search: rolling RollingFunctions rollmin rollmedian rollvariation rollkurtosis rollmad_normalized

  No documentation found.

  RollingFunctions.rolling is a Function.

I think it would be very useful to add them.


If you would like my help with any of these points please comment and I will gladly do it.

@JeffreySarnoff
Copy link
Owner

JeffreySarnoff commented Feb 15, 2023

supplemental information attached (to be updated on occasion)

RollingFunctions.jl v1

notes for JeffreySarnoff, bkamins 2023-02-15

Looking Back

tapering

When I researched rolling functions over windowed data initially (2017), there were packages and products that provided padding with a given value. I saw nothing about tapering to provide values for incomplete window spans. To the best of my knowledge (without claiming anything more) RunningFunctions.jl introduced efficient, well-structured, and functionally broad tapering to many. Now, this capability appears in many packages and in multiple languages.

genericity

The implementation followed a few stylized approaches, allowing reasonably concise and compact code to cover many sorts of windowed summarization.

omissions

most often noted

  • There was no internal support for padding.
  • There was no type-specific transfer of eltype(data) to eltype(result)
  • There were no function-specific accelerations
    • incremental sum and other incrementable statistics

others of import

  • [add here]

Looking Forward

design goals

api consistancy

  • throughout

    • style
    • naming
    • utilization
  • features

    • organization
    • syntax
    • semantics

argument order

  • positional args

    • datasequence must be positional to support dispatch properly.
    • summarizer should be positional to support curried compilation.
    • windowspan should be positional to support optimized library building.
  • compatible with v0.7

exportedfunction(summarizer, datasequence, windowspan)
  • preferred over v0.7 order

Many clients will use some preferred set of windowspans with some selected set of summarization functions, each specialized to the nature of the datasequence. This suggests that the datasequence appear after the windowspan and the summarizer args. These two orderings meet that constraint. Which is preferable in use with respect to currying and maybe some use of closures?

exportedfunction(summarizer, windowspan, datasequence)
exportedfunction(windowspan, summarizer, datasequence)
  • keyword args
    The padding specifier could be a kwarg. As a positional arg with a default value, pads that not given and pads that are a given value and pads that are a vector all support distinct direct dispatch. The padding qualifier padlast::Bool=false seems best as a kwarg. Does that mean making padding=NoPadding a kwarg is better for consistancy? Or is the built-in dispatch as a presumably final positioned arg more beneficial?

performance

  • current best practice

    • and better practices for situational advantage
  • SIMD almost everywhere

    • If reasonable to do
      • @[t]turbo with function application over an indexed span
        • using views over the indexing
    • Otherwise
      • Take advantage of SIMD.jl using an approach that is
        • robustly applicable
        • smooth / clear / natural
  • Multithreading wherever helpful

Technical Notes

taper

  • current approach

  • enhancements

pad

  • with a specified value

  • with an AbstractVector

  • with an NTuple?

pad then taper

  • with a specified value

  • with an AbstractVector

  • with an NTuple?

consistant relationships
- the unit is _tally of successive indices_

n = length(data[column])

s = window_span
u = window_prespan == s - 1

p = count of indices to pad (0 or initial indices in 1..u)
t = count of indices to taper (0 or u-p indices adjacent to the initial p)

u == p + t   ∧   p, t ∈ [0..u]

@JeffreySarnoff
Copy link
Owner

re: padding is supported only for rolling
This was intentional (not the lack of support) -- the prior versions of roll<op> (v0.7) are to be improved usiing implementation logic from IncrementalAccumulators. These should perform much better.

They all are of a consistent, flexible design and share a nice api (thank you).

@JeffreySarnoff
Copy link
Owner

JeffreySarnoff commented Feb 16, 2023

The new, more capable api is call signature compatible with release v0.7.x -- that is, call signatures used with high level exports of v0.7,x (rolling and running) yield very nearly the same results when used with the new api as it is specified at present.

We could support greater generality with fewer lines of code by changing from distinguishing functions of arity1, 2, 3, and 4 by dispatch over 1, 2, 3, or 4 vectors to a Tuple form and capturing the number of args (as given by N, the number of vectors):

op( windowfunc:::Function, windowspan:::Integer, 
      data::Tuple{Vararg{<:AbstractVector,N}})  where {N}

We would code unary window functions explicitly: (a) providing well-grounded reference implementations code (b) providing evaluative optimizations and applicative speed-ups.

op( windowfunc:::Function, windowspan:::Integer, data::AbstractVector[T})  where {T}

And we could do the same with window functions of 2 args, as e.g. cov, cor are widely used
so some code optimization makes sense. While we may choose code 2 arg functions explicitly, that decision is not complexity driven. Using a generalized , shared signature implementation for 2+ arg functions may make better sense. Either way, we process the generalized input using looping with reductive computation in concert with gobbling observations to supply the args simultaneous-ish-ly.

Gaining performance (2x-3x in throughput generally, 5x-10x and in reappearing client situations) is available (sometimes not straightforward, sometimes requires additional dev work with third party packages). Work applying e.g. LoopVectorization, SIMD, ..? ?.. to our task is necessary.

@JeffreySarnoff
Copy link
Owner

@bkamins The previous note explains my remaining high-level API decision for v1.

(+) I prefer to support rolling/running a multi-arity functions that way.
(-) Clients now rolling cov or cor would need to wrap data as 2-tuples.

 - (old) `cor(window_span, datavec1, datavec2)
 - (new) `cor(window_span, (datavec1, datavec2))

What do see?

@bkamins
Copy link
Author

bkamins commented Feb 16, 2023

I am not fully clear what the benefit of cor(window_span, (datavec1, datavec2)) over cor(window_span, datavec1, datavec2) is.

Two considerations:

  1. With Tuple notation it seems not consistent if you should pass a single vector unwrapped or wrapped in a tuple like this (vector,).
  2. Also you do not mention how you would want to pass weight vectors, which I think also should be thought of.
  3. Finally, I think we should try to ensure that we can deprecate old syntax (but keep supporting it with depwarn + add a new syntax).

For example, looking at the signature of rolling I imagine that the way you could define new signature could be:

rolling(window_fn<:Function, data::AbstractVector...;
        window_span::Int, # kwarg, required
        weights::Union{AbstractVector, Nothing}=nothing, # kwarg, optional
        padding=NoPadding(): # NoPadding unexported sentinel type instance,
        padlast::Bool=false)

also with this approach the pre-1.0 release signatures could be kept as deprecated

@JeffreySarnoff
Copy link
Owner

Tuple wrapping the two data vectors provided for two-arg window_fns is less compelling than is tuple wrapping for 3..12 arg window_fns. Clients working with timestamped and time-aggregated (the reading at 5min intervals or the trimmed mean taken through 5min intervals) financial market data often obtain many field values for every stamped or marked time.

These 6 data items provide market information about the current trading day as it concludes: PriorClose, Open, High, Low, Close, Volume. Estimators, Indicators, and Accuracy/Loss Measures are developed through their Arithmetic and Algebraic interconnections. The more compelling use provides clients with a semantically consistent, syntactically uniform, generalized, and robust mechanism for working with the whole of it through any parts partitioning deemed worthwhile.

@bkamins
Copy link
Author

bkamins commented Feb 16, 2023

Tuple wrapping the two data vectors provided for two-arg window_fns is less compelling than is tuple wrapping for 3..12 arg window_fns.

Can you please explain on an example what you mean exactly?

If we discuss convenience then I believe that not requiring to wrap values in a tuple is more convenient.
The reason is that if you have a tuple you can easily splat it. However, if you do not have a tuple and it is required then you would need to create it.

@JeffreySarnoff
Copy link
Owner

Why imo window_span is better as an early positional arg than as a kwarg:
window_fn and window_span in
rolling(window_fn<:Function, window_span::Int, args...; kwargs...)
are a natural subspecification pairing (roll or run some fn over some span for various data) it is convenient to simplify currying over the two when reusing an approach that has been developed for those specs (e.g. a 200-day exponential moving average).

There may be more than one data vector and there may be one weighting for each data vector [multiple weight vecs is a feature that seems would simplify some windowed summarizations]. Only one sort can be a Vararg. The data vecs are likely to change more often than the weight vecs. However weight vecs are not always used, while data vecs are. Having the weight vec[s] available as a positional arg simplifies dispatch somewhat -- so perhaps

rolling(window_fn<:Function, window_span::Int, data::AbstractVector...;
padding=NoPadding(): # NoPadding unexported sentinel type instance,
weights::Union{AbstractVector, NTuple{N,AbstractVector}, Nothing}=nothing, # kwarg, optional
)

@bkamins
Copy link
Author

bkamins commented Mar 2, 2023

OK - I see your point. Let me ask on Slack on #data what users would think. Thank you!

@bkamins
Copy link
Author

bkamins commented Mar 2, 2023

Ah - and also we discussed that the window_span could be a vector (different window span for different observations).

@nilshg
Copy link

nilshg commented Mar 2, 2023

Just to bring over what I said on Slack:

I voted for positional arg originally, as I thought footguns of the type optimize(f, lower, upper, initial) where one easily can confuse the three last positional args don't apply given that window_span is an integer. I didn't realize that both window_span and data can be vectors now, somewhat increasing the likelihood of footguns.

I then wondered whether there are any constraints about the arguments which could be used to error early and prevent users from silently getting wrong results, e.g. if we require length(data) > length(window_span) that could be used to prevent users getting it wrong.

My assumption is that the vast majority of use cases is still the simple rolling(f, 3, data) so it would be nice to keep the more succinct positional arg version.

@jeremiedb
Copy link

Would the integration of the equivalent of R's roll's min_obs be desirable?
I for example have use cases where I want to roll over a length of given width, but still prefer to have non missing returned if less data than the width is present:

> roll_sum(1:3, width = 2)
[1] NA  3  5
> roll_sum(1:3, width = 2, min_obs = 1)
[1] 1 3 5
> roll_sum(1:3, width = 4, min_obs = 2)
[1] NA  3  6

Note that in the last case, although it may seem wrong to specify a width longer than the actual data, such situation may legitimately happens in cases where such rolling function is passed to a grouped DataFrame whose length isn't guaranteed to be >= window_span.

@JeffreySarnoff
Copy link
Owner

JeffreySarnoff commented Mar 2, 2023

To roll a function F over a data vector of length N using a window_span of W where W > N is the same as applying F to the (entire) data vector. That is reasonable iff N is large enough so F(data) is numeric (e.g. N > k for some k that depends on the nature of F .. which of F([]), F([x]), F([x1,x2]) is legit). The responsibility would be on the caller.


Note that there are analytic issues with doing this; consider situations where the result is compared or combined with other results that have been applied to data of length N >= W. This "feature" should not be used in such situations. I think it may be overkill to provide a keyword arg to permit this -- so it would be all on the user. For less experienced users, that is problematic. What do you think?

@JeffreySarnoff
Copy link
Owner

from @nilshg : Maybe my point wasn't clear - length(window_span) < length(data) to me means the vector of window spans is shorter than the data, i.e. it wouldn't make sense to supply 10 window lenghts if I only have 5 observations.

Functions will ignore any extra spans (spans in a vector of window spans beyond where cumsum(spans) exceeds length(data). If the last useful span extends beyond length(data), either the prior span could be extended or the that last span could be shrunk or that last span could be trimmed. Preferences?

@jeremiedb
Copy link

I'm not strongly opiniated on this, just thought that min_obs could be a nice to have flexibility. Since it's meant to be an optional kwarg that defaults to width, I wouldn't expect this to be exposed so much to foot shooting as the user needs to be intentional in its usage of this option.

@JeffreySarnoff
Copy link
Owner

There was no post giving an example for wanting to process distinct window_spans while traversing a vector (e.g. the first 12 obs, then the next 18 obs ..). My guess is that the request was to process a sequence of window_spans, one-at-a-time, without explicitly coding the loop.

@bkamins
Copy link
Author

bkamins commented Mar 8, 2023

If it is not clear how winow_spans as a vector should be processed now I guess we can skip it, but I would propose to have a design that ensures it can be added (in general: whatever is available in R or pandas sooner or later will be asked for from my DataFrames.jl experience 😄).

@JeffreySarnoff
Copy link
Owner

One way to allow support for that: there would be a kwarg windows_spans or some such that would override the positional arg if it were set to a vector or tuple.

@JeffreySarnoff
Copy link
Owner

Otherwise, it becomes difficult to know the data vectors from the span vector (without always making the span a kwarg -- which we discussed above)

@bkamins
Copy link
Author

bkamins commented Mar 8, 2023

Otherwise, it becomes difficult to know the data vectors from the span vector (without always making the span a kwarg -- which we discussed above)

This is the reason I preferred kwarg over positional arg as I believed it is more flexible for future development.

@JeffreySarnoff
Copy link
Owner

I will allow an Int and allow NTuple{N,Int} for window_span without implementing the NTuple version yet.

@BeitianMa
Copy link

BeitianMa commented Mar 11, 2023

In the future, will the keyword of padding be applied to rolling functions in general? As I do research in asset pricing, I often run rolling regressions of stock returns on certain other variables, and it would be handy if the following procedure worked

beta = rolling(get_beta, df.return, df.char, rolling_window, padding=missing)

But despite now needing to manually add padding values, your package has already given me a 20x+ speedup over Python's rolling.apply() solution, so thank you very much for your work!

@JeffreySarnoff
Copy link
Owner

@BeitianMa yes and it is working in the dev v1 Pkg.add(url="https://github.com/JeffreySarnoff/RollingFunctions.jl",rev="v1")
maybe like this beta = rolling(get_beta, rolling_window, df.return, df.char; padding=missing) JUST FOR TESTING
I assume your get_beta works this way get_beta(df.return[..], df.char[..])

@JeffreySarnoff
Copy link
Owner

JeffreySarnoff commented Apr 9, 2023

@bkamins I have settled on an implementation pattern

rolling(func, span, datavec1; padding<=nopadding>, padlast<=false>)
and, where weights<:StatsBase.AbstractWeights
rolling(func, span, datavec1, weights1; padding<=nopadding>, padlast<=false>)
are passing their tests, however more comprehensive tests are indicated

@JeffreySarnoff
Copy link
Owner

JeffreySarnoff commented Jun 19, 2023

On development version 0.9.75 v1 adjacent
Test coverage is decent and all tests pass.

the image of v1
image

Next is to comb the docs into relevance. meanwhile:

# rolling advances one step at a time

rolling(func, window_span, data_vec [, data_vec2[, data_vec3]] 
           [, StatsBase.weights [,weights2[, weights3]] ];
           padding=nopadding, atend=false)

rolling(func, window_span, data_mat
            [, StatsBase.weights [...]]; 
            padding=nopadding, atend=false)
#          func is applied to each [[individually] weighted] column of the data_mat

# tiling gulps span elements and then advanced over those same elements,
#    to gulp the successor span elements ..

tiling(func, window_span_is_tile_span, data_vec [, data_vec2[, data_vec3]] 
           [, StatsBase.weights [,weights2[, weights3]] ];
           padding=nopadding, atend=false)

tiling(func, window_span_is_tile_span, data_mat 
            [, StatsBase.weights [...]]; 
           padding=nopadding, atend=false)
#         func is applied to each [[individually] weighted] column of the data_mat

# running tapers the function at the start or at the end of the data

running(func, window_span, data_vec [, data_vec2[, data_vec3]] 
              [, StatsBase.weights [,weights2[, weights3]] ];
             atend=false)

running(func, window_span, data_mat \
             [, StatsBase.weights [...]];
             atend = false)
#           func is applied to each [[individually] weighted] column of the data_mat

[roll|tile|run]_
where _ is {min, max, extrema, mean, sum, var, std, cov, cor}

@JeffreySarnoff
Copy link
Owner

Early access to RollingFunctions.jl v1 is given as WindowedFunctions.jl.
This version does more, has padding, and provides tiling. (see the docs).

add the package
Pkg.add(url="https://github.com/JeffreySarnoff/WindowedFunctions.jl")

This version is radically redesigned. The new approach offers the most requested capabilities,
provides much flexibility, and performs well. All of the widely used call signatures will remain
supported for many months following the official release of v1, identified as deprecated calls
with detailed information about how to revise them.

It would be a great help if you would take this prerelease for a walk.
Any issues or suggestions should be posted here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants