Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to i8data for DatetimeIndex #24559

Closed
TomAugspurger opened this issue Jan 2, 2019 · 68 comments
Closed

Changes to i8data for DatetimeIndex #24559

TomAugspurger opened this issue Jan 2, 2019 · 68 comments
Labels
API Design Blocker Blocking issue or pull request for an upcoming release Datetime Datetime data dtype
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 2, 2019

Master currently has an (undocumented) (maybe-) API-breaking change from 0.23.4 when passed integer values

0.23.4

In [2]: i8data = np.arange(5) * 3600 * 10**9

In [3]: pd.DatetimeIndex(i8data, tz="US/Central")
Out[3]:
DatetimeIndex(['1970-01-01 00:00:00-06:00', '1970-01-01 01:00:00-06:00',
               '1970-01-01 02:00:00-06:00', '1970-01-01 03:00:00-06:00',
               '1970-01-01 04:00:00-06:00'],
              dtype='datetime64[ns, US/Central]', freq=None)

Master

In [3]: pd.DatetimeIndex(i8data, tz="US/Central")
Out[3]:
DatetimeIndex(['1969-12-31 18:00:00-06:00', '1969-12-31 19:00:00-06:00',
               '1969-12-31 20:00:00-06:00', '1969-12-31 21:00:00-06:00',
               '1969-12-31 22:00:00-06:00'],
              dtype='datetime64[ns, US/Central]', freq=None)

Attempt to explain the behavior: In 0.23.4, passing an ndarray[i8] was equivalent to passing data.view("M8[ns]")

# 0.23.4
In [4]: pd.DatetimeIndex(i8data.view("M8[ns]"), tz="US/Central")
Out[4]:
DatetimeIndex(['1970-01-01 00:00:00-06:00', '1970-01-01 01:00:00-06:00',
               '1970-01-01 02:00:00-06:00', '1970-01-01 03:00:00-06:00',
               '1970-01-01 04:00:00-06:00'],
              dtype='datetime64[ns, US/Central]', freq=None)

On master, integer values are treated as unix timestamps, while M8[ns] values are treated as wall-times in the given timezone.

# master
In [4]: pd.DatetimeIndex(i8data.view("M8[ns]"), tz="US/Central")
Out[4]:
DatetimeIndex(['1970-01-01 00:00:00-06:00', '1970-01-01 01:00:00-06:00',
               '1970-01-01 02:00:00-06:00', '1970-01-01 03:00:00-06:00',
               '1970-01-01 04:00:00-06:00'],
              dtype='datetime64[ns, US/Central]', freq=None)

Reason for the change

There are four cases of interest:

In [4]: arr = np.arange(5) * 24 * 3600 * 10**9
In [5]: tz = 'US/Pacific'

In [6]: a = pd.DatetimeIndex(arr, tz=tz)
In [7]: b = pd.DatetimeIndex(arr.view('M8[ns]'), tz=tz)
In [8]: c = pd.DatetimeIndex._simple_new(arr, tz=tz)
In [9]: d = pd.DatetimeIndex._simple_new(arr.view('M8[ns]'), tz=tz)

In [10]: a
Out[10]: 
DatetimeIndex(['1970-01-01 00:00:00-08:00', '1970-01-02 00:00:00-08:00',
               '1970-01-03 00:00:00-08:00', '1970-01-04 00:00:00-08:00',
               '1970-01-05 00:00:00-08:00'],
              dtype='datetime64[ns, US/Pacific]', freq=None)

In [11]: b
Out[11]: 
DatetimeIndex(['1970-01-01 00:00:00-08:00', '1970-01-02 00:00:00-08:00',
               '1970-01-03 00:00:00-08:00', '1970-01-04 00:00:00-08:00',
               '1970-01-05 00:00:00-08:00'],
              dtype='datetime64[ns, US/Pacific]', freq=None)

In [12]: c
Out[12]: 
DatetimeIndex(['1969-12-31', '1970-01-01', '1970-01-02', '1970-01-03',
               '1970-01-04'],
              dtype='datetime64[ns, US/Pacific]', freq=None)

In [13]: d
Out[13]: 
DatetimeIndex(['1969-12-31', '1970-01-01', '1970-01-02', '1970-01-03',
               '1970-01-04'],
              dtype='datetime64[ns, US/Pacific]', freq=None)

In 0.23.4 we have a.equals(b) and c.equals(d) but no way to pass data in a way that was constructor-neutral. In master we now have a match c and d. At some point in the refactoring process we changed that, but off the top of my head I don't remember when or if this was the precise motivation or just a side-benefit.

BTW _simple_new was also way too much:

        if getattr(values, 'dtype', None) is None:
            # empty, but with dtype compat
            if values is None:
                values = np.empty(0, dtype=_NS_DTYPE)
                return cls(values, name=name, freq=freq, tz=tz,
                           dtype=dtype, **kwargs)
            values = np.array(values, copy=False)

        if is_object_dtype(values):
            return cls(values, name=name, freq=freq, tz=tz,
                       dtype=dtype, **kwargs).values
        elif not is_datetime64_dtype(values):
            values = _ensure_int64(values).view(_NS_DTYPE)

Was this documented?

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.html mentions that it's "represented internally as int64".

The (imprecise) type on data is "Optional datetime-like data"

I don't see anything in http://pandas.pydata.org/pandas-docs/stable/timeseries.html suggesting that integers can be passed to DatetimeIndex.

@TomAugspurger TomAugspurger added Datetime Datetime data dtype API Design labels Jan 2, 2019
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Jan 2, 2019
@TomAugspurger
Copy link
Contributor Author

@jbrockmendel could you fill in the "Reason for the change" section? IIUC, it was to simplify DatetimeIndex._simple_new?

@mroeschke
Copy link
Member

Possibly related, I worked on this topic in #21216. Rational was that DatetimeIndex(ints, tz=tz) should behave similarly as Timestamp(int, tz=tz), and ints cannot necessarily represent wall times.

I have an open issue #20257 to document passing integers into Timestamp which, similarly to DatetimeIndex, will treat integers as epoch (unix) timestamps

@jbrockmendel
Copy link
Member

@mroeschke thanks. IIRC consistency with to_datetime was also a consideration

@jorisvandenbossche
Copy link
Member

IIRC consistency with to_datetime was also a consideration

to_datetime has no tz keyword, only utc=True, but for UTC wall time or unix epochs are the same?

@jorisvandenbossche
Copy link
Member

The Timestamp / DatetimeIndex consistency is a good reason to change one of both (since they are inconsistent in 0.23.4).
And I think the unix epoch way makes sense for integer input. Although I think it can also be confusing that DatetimeIndex(int_array, ..) and DatetimeIndex(int_array.view('M8[ns]), ..) give different results.

Regarding"was this documented": it might not have been documented clearly, but it is long standing behaviour: either we think people don't use it / shouldn't use it (but then why bother changing it? Shouldn't we then rather deprecate the whole ability of passing integer data, instead of changing the behaviour?), or either people do use it, and this change will break that usage.
Of course, a break for a limited number of people might be worth the trade-off for a big win within our code base. But is this consistency between __new__ and _simple_new that important code-technical?

@jorisvandenbossche jorisvandenbossche added the Blocker Blocking issue or pull request for an upcoming release label Jan 4, 2019
@jorisvandenbossche
Copy link
Member

Any other thoughts / replies here?

@TomAugspurger
Copy link
Contributor Author

Although I think it can also be confusing that DatetimeIndex(int_array, ..) and DatetimeIndex(int_array.view('M8[ns]), ..) give different results.

I think this surprised me early on. That may have been because I was used to the old way; I'm not really sure.

Regarding"was this documented":

My intent there was "If this was documented before, then we definitely can't change it." I wasn't advocating "It wasn't documented, so we can change it.". Just "It wasn't documented, so we can maybe change it" :)

I don't really have an opinion on the technical merits of wall time vs. epochs. I don't think I know enough to vote one way or the other.

@jbrockmendel
Copy link
Member

I think the overriding internal-consistency concern is the one @mroeschke reminded us of: this should behave like Timestamp constructor.

@TomAugspurger would "fixing" this behavior be difficult? Last time I tried something similar I got test_packers failures that I couldn't figure out.

@TomAugspurger
Copy link
Contributor Author

Demo of the inconsistency on 0.23.4

In [9]: i8 = pd.Timestamp('2000', tz='CET').value

In [10]: pd.Timestamp(i8)
Out[10]: Timestamp('1999-12-31 23:00:00')

In [11]: pd.Timestamp(i8, tz="CET")
Out[11]: Timestamp('2000-01-01 00:00:00+0100', tz='CET')

In [12]: pd.DatetimeIndex(np.array([i8]))
Out[12]: DatetimeIndex(['1999-12-31 23:00:00'], dtype='datetime64[ns]', freq=None)

In [13]: pd.DatetimeIndex(np.array([i8]), tz="CET")
Out[13]: DatetimeIndex(['1999-12-31 23:00:00+01:00'], dtype='datetime64[ns, CET]', freq=None)

On master, Out[13] matches Out[11].

That consistency is certainly worth striving for.


If we wanted a graceful deprecation warning, then the DTI constructor would

  1. check for i8data & tz
  2. tell users to... what? What recourse would they have? If they want wall times (the previous behavior) then they can .view("dateime64[ns]") and pass that. If they want unix epochs (the future behavior) then they should ...?
  3. Update all of pandas to use the recommended behavior from 2 (which is TBD).

Putting that warning in place would give us an idea of how difficult this would be to change. I'll see what turns up. Since _simple_new already does the "right" (future) thing, it may not be too bad...

@jorisvandenbossche
Copy link
Member

If they want unix epochs (the future behavior) then they should ...?

to_datetime ? (if it would finally have a tz option)

The other option could also be to deprecate passing integers, if we find it too confusing what it should result in.

@mroeschke
Copy link
Member

Since tz effectively means tz_localize, in an ideal world I would only want integer data to only accept tz='UTC'/None. I think Timestamp(i8, tz=tz)'s localization to UTC then conversion to tz is a little too magic although it's the more appropriate behavior. Therefore I had thought the DatetimeIndex(i8_array, tz=tz) behavior was a bug, but since our policy wasn't set in documentation deprecation is appropriate.

For 2, if a user wants epoch timestamps (future behavior) they should use DatetimeIndex(i8_array, tz='UTC').tz_convert(tz) (or even to_datetime since it even has an origin parameter). I have a feeling we may rely on passing i8 into these constructors internally so deprecating the behavior may warrant a lot of refactoring down the line?

@jbrockmendel
Copy link
Member

#24623 is now passing. PTAL.

@jorisvandenbossche
Copy link
Member

@jbrockmendel it is not fully clear to me what #24623 exactly does related to the discussion above. Can you summarize it here?
(e.g. does it change back the behaviour of DatimeIndex? (but I thought you were opposed to that?) And I don't think it adds a deprecation warning? ..)

@jbrockmendel
Copy link
Member

it is not fully clear to me what #24623 exactly does related to the discussion above. Can you summarize it here?

There are two central changes in #24623, both of which were the status quo before #24024.

  1. make DatetimeArray.__init__ behave like DatetimeIndex.__new__ and Timestamp.__new__ by treating int64 values as unix timestamps and datetime64 as wall-times.

  2. Do frequency validation in DatetimeArray.__init__, preventing users from creating invalid instances with the public constructor.

@jorisvandenbossche
Copy link
Member

Then how is that related to this issue? This issue is about the change in behaviour of handling integer values in the DatetimeIndex constructor, not about the DatetimeArray constructor

@jbrockmendel
Copy link
Member

Then how is that related to this issue?

If it isn't closely enough related for your tastes then feel free to ignore.

@jorisvandenbossche
Copy link
Member

You wrote the PR, I think you are better placed to judge to what extent it is related to the discussion here ;)
It's just because you mentioned it here as if it is resolving this issue (but maybe I misunderstood that) that I am asking for clarification.

@jbrockmendel
Copy link
Member

I'm actually currently getting side-tracked by trying to figure out exactly what is and isn't internally-consistent. It looks like 24623 makes i8 behavior consistent between DTA/DTI/Timestamp and M8[ns] behavior consistent between DTA/DTI, but breaks the existing consistency in master between DTA/Timestamp.

I'll try to put together a grid for these...

@jorisvandenbossche
Copy link
Member

And an answer to the comment of @mroeschke

Since tz effectively means tz_localize, in an ideal world I would only want integer data to only accept tz='UTC'/None.

That sounds as a logical rule to me.

@jbrockmendel
Copy link
Member

Since tz effectively means tz_localize, in an ideal world I would only want integer data to only accept tz='UTC'/None.

I'm not sure I understand this rule. Is the idea that the constructor would only be able to create tz-aware {n \choose k}(DTA/DTI/Timestamp) with UTC and them make the user do tz_convert as another step?

@mroeschke
Copy link
Member

mroeschke commented Jan 7, 2019

In an ideal world I would prefer that since it's more explicit, but I think it would require deprecating Timestamp behavior with ints and tz.

I would be fine just propagating the current Timestamp behavior and more clearly documenting interger with tz arg behavior across the board since I think we would have to deprecate less?

@jbrockmendel
Copy link
Member

I've made things more difficult here by voicing opinions while being misinformed. Apologies all around.

Recap/Double-Checking behavior from 0.23.4 and master. In all cases passing tz/dtype corresponding to US/Pacific.

ts = pd.Timestamp.now('US/Pacific').round('s')
dti = pd.DatetimeIndex([ts])
tz = dti.tz
dtype = dti.dtype

results = [
    pd.Timestamp(np.int64(ts.value), tz=ts.tz),
    pd.Timestamp(np.datetime64(ts.value, "ns"), tz=ts.tz),
    pd.Timestamp(float(ts.value), tz=ts.tz),
]
assert all(x == ts for x in results)   # <-- holds in both 0.23.4 and in master

args = [
    dti.asi8,
    dti.asi8.view('datetime64[ns]'),
    dti.asi8.astype(np.float64),
    dti.asi8.astype(object),
    dti.asi8.view('datetime64[ns]').astype(object),
    dti.asi8.astype(np.float64).astype(object)
]

news = [pd.DatetimeIndex(x, dtype=dtype) for x in args]
simples = [pd.DatetimeIndex._simple_new(x, tz=tz)
           for x in args if x.dtype == 'i8' or x.dtype == 'M8[ns]']
inits = [pd.core.arrays.DatetimeArray(x, dtype=dtype)
         for x in args if x.dtype == 'i8' or x.dtype == 'M8[ns]']  # <-- master only

master

>>> for x in news:
...     x[0]
... 
Timestamp('2019-01-07 12:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 20:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 20:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 12:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 12:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 20:28:45-0800', tz='US/Pacific')

>>> for x in simples:
...     x[0]
... 
Timestamp('2019-01-07 12:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 12:28:45-0800', tz='US/Pacific')

>>> for x in inits:
...     x[0]
... 
Timestamp('2019-01-07 12:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 12:28:45-0800', tz='US/Pacific')

0.23.4

>>> for x in news:
...     x[0]
... 
Timestamp('2019-01-07 20:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 20:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 20:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 20:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 20:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 20:28:45-0800', tz='US/Pacific')
>>> 
>>> for x in simples:
...     x[0]
... 
Timestamp('2019-01-07 12:28:45-0800', tz='US/Pacific')
Timestamp('2019-01-07 12:28:45-0800', tz='US/Pacific')

In both 0.23.4 and master, Timestamp.__new__ treats all i8, M8[ns], and f8 as unix timestamps.

In both 0.23.4 and master, both DTA.__init__ and DTI._simple_new treat both i8 and M8[ns] as unix timestamps.

In 0.23.4 DTI.__new__ treats all the cases as wall-times.

In master, DTI.__new__ treats i8, i8+object, and M8[ns]+object as unix timestamps, and the other cases as wall times.

In master, DTA.__init__ treats both i8 and M8[ns] as unix timestamps.


I think there is a consensus that the preferred behavior is for all of these to match Timestamp, right? So the options for DTI are:

  1. revert to 0.23.4 behavior and issue warnings that this behavior is going to change.
  2. keep the half-and-half behavior, issue warnings that more is going to change, document that some has changed
  3. move all cases to match Timestamp behavior, document the change, possibly issue warnings(?)

And the options for DTA:
a) if option 3 is chosen for DTI, then DTA is fine as-is.
b) otherwise:
i) change DTA to match DTI, issue equivalent warnings and change them at the same time in the future
ii) live with them behaving differently.

The question I'd pose to the more senior folks: can we get away with option 3?

@TomAugspurger
Copy link
Contributor Author

I'm having some trouble following the samples in
#24559 (comment), but I think that your

I think there is a consensus that the preferred behavior is for all of these to match Timestamp, right?

is correct.

Can you clarify the difference between your option 1 and 3? Is the difference the handling of i8 values in DTI._simple_new?

@jbrockmendel
Copy link
Member

Can you clarify the difference between your option 1 and 3? Is the difference the handling of i8 values in DTI._simple_new?

None of the options I have in mind touch _simple_new.

In option 1, DTI.__new__ behavior is reverted to 0.23.4, where both int64 and datetime64[ns] are treated as wall-times. This is the maximally-different-from-Timestamp behavior, but the minimal-user-facing-change option.

In option 3, DTI.__new__ is behavior changed to fully match Timestamp behavior (and DTA.__init__ behavior, for the subset of inputs that it supports). This would be the clear choice if backward-compatibility and deprecation cycles weren't a consideration.

@jorisvandenbossche
Copy link
Member

And in case it might be useful, the experiment I did earlier today: https://github.com/pandas-dev/pandas/compare/master...jorisvandenbossche:i8revert?expand=1 (only changed the DatetimeIndex constructor without fixing any of the consequential failures)

@TomAugspurger
Copy link
Contributor Author

Back to the deprecation warning. What's the expected behavior of Int64Index.astype('datetime64[ns, tz]')? That errored on 0.23.4. Should it just be the same as DatetimeIndex(Index, dtype='datetime64[ns, tz]')? Should it throw a warning?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 10, 2019

Assuming we eventually want the behavior for DTI that's currently on master, how's this for a deprecation warning?

The meaning of integer-dtype 'data' in DatetimeIndex when a timezone is 
specified is changing. Previously, these were viewed as datetime64[ns] values,
and represented the wall time *in the specified timezone*.

In the future, these will be viewed as datetime64[ns] values
and represent the wall time *in UTC*.

To accept the future behavior, use

    pd.to_datetime(integer_data, utc=True).tz_convert(tz)

To keep the previous behavior, use

    pd.to_datetime(integer_data).tz_localize(tz)

@TomAugspurger
Copy link
Contributor Author

One more question: should tz='UTC' emit a warning? IIUC, the behavior of

pd.DatetimeIndex([1457537600000000000, 1458059600000000000], tz='UTC')

will always be the same.


Gotta take a break from this until this afternoon. WIP at https://github.com/TomAugspurger/pandas/tree/revert-i8

That's based off @jorisvandenbossche's start. Should be easy to swap in @jbrockmendel's changes to _from_sequence if we prefer that to the extra warn parameter.

@mroeschke
Copy link
Member

I think Int64Index.astype('datetime64[ns, tz]') should raise the same warning as DatetimeIndex(Index, dtype='datetime64[ns, tz]') as Index(int64s, dtype='datetime64[ns, tz]')

I would raise on UTC even thought the outcome is the same just to acclimate the user to the fact that passing any tz with ints will change behavior.

@jorisvandenbossche
Copy link
Member

Nice deprecation warning! (I would maybe only mention something regarding the reason for the change? integers seen as unix epoch which is defined to be UTC? (although its nanoseconds so not exactly unix epochs))

One more question: should tz='UTC' emit a warning?

Since it stays the same, ideally no, I would say

What's the expected behavior of Int64Index.astype('datetime64[ns, tz]')

Ideally the new behaviour IMO (although that might make the implementation more complicated?)
(if we are sure about the new behaviour of course .. )

@jorisvandenbossche
Copy link
Member

I would raise on UTC even thought the outcome is the same just to acclimate the user to the fact that passing any tz with ints will change behavior.

Hmm, yes, that might be a good argument (although I think, as a user, I would have the reflex to complain about it: "I already specify UTC!")

@mroeschke
Copy link
Member

Fair enough. Yeah from a user perspective, it may feel like a false positive warning. My opinion about warning with UTC isn't that strong so I'd be fine not emitting a warning in that case.

@TomAugspurger
Copy link
Contributor Author

FWIW, I'm ambivalent about warning in the UTC case. Either way has downsides.

@jbrockmendel
Copy link
Member

If long term we want option 5, do we need to get the deprecation for Timestamp behavior in for 0.24.0?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 10, 2019 via email

@jreback
Copy link
Contributor

jreback commented Jan 21, 2019

anything more on this for 0.24.0?

@TomAugspurger
Copy link
Contributor Author

I think we're good for 0.24.0.

We do need to settle on a desired long-term behavior though.

@TomAugspurger TomAugspurger modified the milestones: 0.24.0, 0.25.0 Jan 21, 2019
@jbrockmendel
Copy link
Member

How close are we to consensus on this? It looks like @mroeschke, @jreback, and I have expressed preference for options 4 or 5, with a slight preference towards 5. @jorisvandenbossche has expressed a preference for option 6.

Anyone else want to weigh in?

@jbrockmendel
Copy link
Member

@TomAugspurger @jorisvandenbossche @mroeschke @jreback I'd like to settle on a desired long-term behavior and start the appropriate deprecation warnings before 0.25.0. How far are we from consensus?

@jreback
Copy link
Contributor

jreback commented Jun 9, 2019

@jbrockmendel can u do. short summary of this and just confirm that 0.24.2
and master have the same behavior

@TomAugspurger
Copy link
Contributor Author

We discussed this on the call. IIRC someone volunteered / was volunteered to summarize, and maybe close this issue? (was I the one to volunteer?)

@TomAugspurger
Copy link
Contributor Author

IIUC, this can be closed since the changes have been made and the deprecation enforced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Blocker Blocking issue or pull request for an upcoming release Datetime Datetime data dtype
Projects
None yet
Development

No branches or pull requests

5 participants