Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISC: Remove DTA/TDA/PA freq #24566

Open
jbrockmendel opened this issue Jan 2, 2019 · 15 comments
Open

DISC: Remove DTA/TDA/PA freq #24566

jbrockmendel opened this issue Jan 2, 2019 · 15 comments
Labels
Deprecate Functionality to remove in pandas Frequency DateOffsets Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@jbrockmendel
Copy link
Member

For PA, freq is redundant with dtype. For DTA/TDA, it means something very different from what it means for PA. The only thing it is really used for is _time_shift (and along with it, add/sub of integers, which is deprecated in DTI/TDI anyway)

Without freq going into the constructors, DTA/TDA/PA would all just need dtype, so some simplification/unification could be done.

frequency validation in __init__ could be saved for DTI/TDI and omitted from DTA/TDA.

If we ever come to our senses and allow 2D EAs (so we can simplify a ton of internals), freq stops making sense for 2D, so would be unused in those cases anyway.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 2, 2019

I'd probably support this, but would need to think things through a bit more :) The different meaning of freq between period and timedelta / datetime is quite confusing.

Additionally

  1. There's a (perhaps surprising) performance penalty to specifying freq when creating the new array. It's hard to say in the abstract whether downstream operations would benefit from it.
  2. It's hard to know when operations should invalidate / re-infer the freq (Bad freq invalidation in DatetimeIndex.where #24555)

Thinking more though, those apply equally to the index classes, and I suppose that freq would not be deprecated there (nor would we want to?).

@jorisvandenbossche
Copy link
Member

The idea would be that freq is then a cached attribute on DatetimeIndex/TimedeltaIndex, just like it has other cached attributes (is_unique, ..) ?

Some use cases of freq is in timeseries are resampling operations and plotting. For both of these, it would be useful to cache the freq to avoid recomputing it (but also a way to never have to compute it, if a regular DatetimeIndex is created)

@TomAugspurger
Copy link
Contributor

In theory, I suppose it could be cached on DatetimeArray/TimedeltaArray as well? I believe that __setitem__ is the only method that mutates inplace, and we already invalidate the freq there.

@mroeschke mroeschke added Frequency DateOffsets Needs Discussion Requires discussion from core team before further action labels Jan 13, 2019
@jbrockmendel
Copy link
Member Author

Thinking about this again following 1.0 deprecation policy discussion. I'm considering this in conjunction with removing Timestamp.freq (#15146). The difficult part of the implementation for both DTA and Timestamp is that without self.freq, what do we do with the is_month_start etc properties that depend on self.freq.

One option would be to make them into methods to which you can pass a freq. But it isnt obvious what warning to issue for that deprecation. We want to tell users "do X now and it will be compatible with this version and future versions" but AFAICT there is no way to do that.

@mroeschke
Copy link
Member

I did something vaguely similar with weekday_name. We deprecated the weekday_name attribute and implemented day_name as a function that takes a locale at the same time.

Similarly we can deprecate freq, is_month_start, is_*_[start|end], and introduce a is_start(freq) method.

@jorisvandenbossche
Copy link
Member

@jbrockmendel your proposal here is to remove it from Datetime/TimedeltaArray, but keep it for the Index classes, right?

So dealing with freq, caching it, etc, would be dealt with on the Index-level ?

@jbrockmendel
Copy link
Member Author

So dealing with freq, caching it, etc, would be dealt with on the Index-level ?

Correct.

@jorisvandenbossche
Copy link
Member

And what do you see as the advantage of dealing with this on the Index level instead of the array level?

The places where freq gets used (eg resample, plotting?) also support datetimes in columns, not only in the index. For those, it would be beneficial if it was handled on the array-level?

@jbrockmendel
Copy link
Member Author

And what do you see as the advantage of dealing with this on the Index level instead of the array level?

__setitem__ on a DTA/TDA invalidates the freq, but i haven't found a good way to propagate that to views. As a result, we can into situations where the freq attr is inaccurate.

@jorisvandenbossche
Copy link
Member

setitem on a DTA/TDA invalidates the freq, but i haven't found a good way to propagate that to views.

Can you describe that issue in a bit more detail? (or give an example) How is setitem related to views?

@jbrockmendel
Copy link
Member Author

Can you describe that issue in a bit more detail? (or give an example) How is setitem related to views?

dti = pd.date_range("2016-01-01", periods=4)
dta = dti._data

view = dta[:]
dta[0] = pd.NaT

assert dta.freq is None
assert view[0] is pd.NaT
assert view.freq is None   # <-- fails

@jorisvandenbossche
Copy link
Member

Thanks for that example! That clarifies ;)

That's an interesting problem. So Index circumvents this by being "immutable". I am wondering if that doesn't point to a general issue with EAs and views.

Do we have other EAs with attributes / metadata?
IntervalArray has the closed attribute, but that is something that doesn't depend on the actual values (meaning, you can't "invalidate" closed by setting a value in the IntervalArray).
In GeoPandas, we just added a crs attribute (holding metadata information about the coordinate reference system), but there, when setting values in place (__setitem__), we assume the user is responsible to set a value that matches the CRS (so setting a value also shouldn't invalidate the crs attribute). Although there might be some esoteric cases where you can run into problems with this.

We have been thinking about having a faster hasna metadata for the masked array EAs. One idea is to have this check the mask being None, if we can make the mask optional. Another idea would be to actually "cache" this and invalidate when the array gets mutated. But, so that would run into the same problem as the freq of DTA.

@jorisvandenbossche
Copy link
Member

To put it another way: 1) would it technically be possible to implement something similar to "cache_readonly" for EAs (that can handle those problems) and 2) if possible, would it be useful?

@jbrockmendel
Copy link
Member Author

To put it another way: 1) would it technically be possible to implement something similar to "cache_readonly" for EAs (that can handle those problems) and 2) if possible, would it be useful?

So something that, when cleared/invalidated, would also clear the cache on any other EAs that are a view on the same data?

@mroeschke mroeschke added the Deprecate Functionality to remove in pandas label Jun 28, 2020
@jbrockmendel jbrockmendel added this to the 2.0 milestone Dec 21, 2021
@mroeschke mroeschke modified the milestones: 2.0, 3.0 Feb 8, 2023
@jbrockmendel
Copy link
Member Author

I've been experimenting on this, am finding that it is a lot of churn. Having second thoughts about whether this is worth it.

A DTA/TDA inside a Series/DataFrame has its freq set to none, so there isn't a real risk there. The only real risk is if a user is dealing with DTA/TDA directly, which is very rare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Frequency DateOffsets Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants