Implement DatetimeArray._from_sequence #24074

jbrockmendel · 2018-12-03T18:04:31Z

Removes dependence of DatetimeArray.__new__ on DatetimeIndex. De-duplicated DatetimeIndex.__new__/DatetimeArray.__new__.

The contents of DatetimeArray._from_sequence are basically just moved from DatetimeIndex.__new__. This is feasible because #23675 disentangled to_datetime from DatetimeIndex.__new__.

cc @TomAugspurger this is the last thing on my todo list for DTA/TDA. LMK if I can be helpful with the composition transition.

pep8speaks · 2018-12-03T18:04:35Z

Hello @jbrockmendel! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/arrays/datetimes.py !
There are no PEP8 issues in the file pandas/core/indexes/datetimes.py !

pandas/core/arrays/datetimes.py

TomAugspurger · 2018-12-03T18:25:27Z

pandas/core/indexes/datetimes.py

-        #  go through _simple_new instead
-        warnings.simplefilter("ignore")
-        result = cls.__new__(cls, verify_integrity=False, **d)
+    if "data" in d and not isinstance(d["data"], DatetimeIndex):


Could you explain the reasoning behind this change? I don't see why just DatetimeIndex would need to have the integrity verified.

Without this we get a couple of failing pickle-based tests. tests.frame.test_block_internals.test_pickle and tests.series.test_timeseries.test_pickle

Gotcha, I've been fighting those too errors too. I suspect they were working before because DatetimeIndex accepts data=None for range-based constructors, which we don't want for the arrays.

this is because you need to define __reduce__ on in ExtensionArray. This should never be hit by a non-DTI.

It isn't. The issue is that this function calls DatetimeIndex.__new__ with verify_integrity=False (since it is unpickling a previously-constructed DTI, integrity has presumably already been verified, so we can skip that somewhat-costly step), and the pickle-tested cases raise ValueError because when we try to verify their integrity they fail

This is fixed by #24096

let's fix that one first then. this needs to be changed here.

The changes over in #24096 seem to be... different? I don't know how to explain it, but doesn't the fact that we're having to copy data over in #24096 seem disconnected from pickling?

but doesn't the fact that we're having to copy data over in #24096 seem disconnected from pickling?

Pickling turns out to be only tangentially related to the "real" problem. In the status quo, altering a datetime64tz column alters the DatetimeIndex that backs it, but doesn't set its freq to None. When that DataFrame is pickled and then unpickled, it tries to reconstruct that DatetimeIndex, but is passing arguments that should raise ValueError. ATM that gets side-stepped by passing verify_integrity=False.

So the goal of #24096 is to not corrupt the DatetimeIndex in the first place, making verify_integrity=False unnecessary.

That's still pretty roundabout. Any clearer?

Yeah, that does help.

TomAugspurger · 2018-12-03T18:36:41Z

LMK if I can be helpful with the composition transition.

Mind if I ping you on specific xfails that I've added in that PR after this,
#23601, and
#23990 are merged? I'm starting to whittle them down.

codecov · 2018-12-03T18:55:15Z

Codecov Report

Merging #24074 into master will decrease coverage by <.01%.
The diff coverage is 79.66%.

@@            Coverage Diff             @@
##           master   #24074      +/-   ##
==========================================
- Coverage   42.38%   42.38%   -0.01%     
==========================================
  Files         161      161              
  Lines       51701    51691      -10     
==========================================
- Hits        21914    21907       -7     
+ Misses      29787    29784       -3

Flag	Coverage Δ
#single	`42.38% <79.66%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/datetimes.py	`49.6% <33.33%> (-2.73%)`	⬇️
pandas/core/arrays/datetimes.py	`65.56% <88%> (+1.76%)`	⬆️
pandas/tseries/frequencies.py	`70.8% <0%> (+0.72%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08395af...be745aa. Read the comment docs.

codecov · 2018-12-03T18:55:15Z

Codecov Report

Merging #24074 into master will decrease coverage by <.01%.
The diff coverage is 79.66%.

@@            Coverage Diff             @@
##           master   #24074      +/-   ##
==========================================
- Coverage   42.38%   42.38%   -0.01%     
==========================================
  Files         161      161              
  Lines       51701    51691      -10     
==========================================
- Hits        21914    21907       -7     
+ Misses      29787    29784       -3

Flag	Coverage Δ
#single	`42.38% <79.66%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/datetimes.py	`49.6% <33.33%> (-2.73%)`	⬇️
pandas/core/arrays/datetimes.py	`65.56% <88%> (+1.76%)`	⬆️
pandas/tseries/frequencies.py	`70.8% <0%> (+0.72%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08395af...be745aa. Read the comment docs.

codecov · 2018-12-03T18:55:17Z

Codecov Report

Merging #24074 into master will decrease coverage by <.01%.
The diff coverage is 79.66%.

@@            Coverage Diff             @@
##           master   #24074      +/-   ##
==========================================
- Coverage   42.38%   42.38%   -0.01%     
==========================================
  Files         161      161              
  Lines       51701    51691      -10     
==========================================
- Hits        21914    21907       -7     
+ Misses      29787    29784       -3

Flag	Coverage Δ
#single	`42.38% <79.66%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/datetimes.py	`49.6% <33.33%> (-2.73%)`	⬇️
pandas/core/arrays/datetimes.py	`65.56% <88%> (+1.76%)`	⬆️
pandas/tseries/frequencies.py	`70.8% <0%> (+0.72%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08395af...be745aa. Read the comment docs.

codecov · 2018-12-03T18:55:18Z

Codecov Report

Merging #24074 into master will increase coverage by <.01%.
The diff coverage is 99.18%.

@@            Coverage Diff             @@
##           master   #24074      +/-   ##
==========================================
+ Coverage    92.2%    92.2%   +<.01%     
==========================================
  Files         162      162              
  Lines       51729    51717      -12     
==========================================
- Hits        47697    47686      -11     
+ Misses       4032     4031       -1

Flag	Coverage Δ
#multiple	`90.6% <99.18%> (-0.01%)`	⬇️
#single	`43.04% <77.04%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/arrays/datetimelike.py	`96.41% <100%> (+0.06%)`	⬆️
pandas/core/indexes/period.py	`93.06% <100%> (-0.02%)`	⬇️
pandas/core/arrays/timedeltas.py	`87.13% <100%> (-0.15%)`	⬇️
pandas/core/indexes/datetimes.py	`96.32% <100%> (-0.24%)`	⬇️
pandas/core/arrays/datetimes.py	`98.56% <100%> (+0.31%)`	⬆️
pandas/core/arrays/period.py	`98.29% <92.3%> (-0.18%)`	⬇️
pandas/core/indexes/base.py	`96.27% <0%> (-0.06%)`	⬇️
pandas/core/ops.py	`94.26% <0%> (+0.13%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8ea7744...9d7cb39. Read the comment docs.

jbrockmendel · 2018-12-03T19:03:47Z

Mind if I ping you on specific xfails that I've added

Sounds good.

jreback · 2018-12-04T02:36:43Z

pandas/core/arrays/datetimes.py

+            # assume this data are epoch timestamps
+            if data.dtype != _INT64_DTYPE:
+                data = data.astype(np.int64, copy=False)
+            subarr = data.view(_NS_DTYPE)


might be able to pull out
subarr = data.view(_NS_DTYPE) of the if/else and just do it always (or maybe just do it in _simple_new), but this is for later

That's what I tried originally, but it breaks the recently-implemented tests.arrays.test_datetimelike.test_from_array_keeps_base (#23956)

Hrm sorry. FWIW, that was a hackish thing that I introduced to avoid segfaults. If the code in reduction code was doing the right thing, we may be able to remove that test / requirement. But I'm not sure what the right thing is.

For the purposes of this PR it sounds like the options are to either remove that test and un-indent t his line, or keep this line how it is. It sounds like the first option would cause problems in your DTA branch. I don't have a strong preference here.

jreback · 2018-12-04T02:38:02Z

pandas/core/arrays/datetimes.py

-        if freq is None and hasattr(values, "freq"):
-            # i.e. DatetimeArray, DatetimeIndex
-            freq = values.freq
+    @classmethod


is there a reason you don't want to add verify_integrity here (as maybe _verify_integrity=True)?

We've deprecated the kwarg in the DatetimeIndex constructor to get rid of it. In cases where verify_integrity is not needed, a different constructor (e.g. simple_new) should be used.

pandas/core/arrays/datetimes.py

jreback · 2018-12-04T02:44:36Z

pandas/core/indexes/datetimes.py

-        #  go through _simple_new instead
-        warnings.simplefilter("ignore")
-        result = cls.__new__(cls, verify_integrity=False, **d)
+    if "data" in d and not isinstance(d["data"], DatetimeIndex):


this is because you need to define __reduce__ on in ExtensionArray. This should never be hit by a non-DTI.

TomAugspurger · 2018-12-04T22:29:37Z

On the function signature, it would be nice if it matched the interface, but that's not strictly necessary. Apparently, https://en.wikipedia.org/wiki/Liskov_substitution_principle is the "rule" here, and I don't think adding optional arguments like freq=None, which are mere optimizations, breaks that rule. Things like dayfirst probably do violate it though :/

Fundamentally, I think this is because we're overloading _from_sequence to handled things that ExtensionArray._from_sequence doesn't expect / require. According to the interface, the type of scalars would be, roughly, Sequence[Timestamp]. We're also allowing things like Sequence[strings-that-can-maybe-be-parsed-as-timestamps].

With PeriodArray, we got around this with a top-level period_array method that handles all the mess that people can throw at us, reserving _from_sequence for Sequence[Period], and __init__ for simply setting the attributes.

What do you think about moving this PR's current _from_sequence to a different name (not sure what to call it, maybe a top-level datetime_array to mirror period_array?

I see that TimedeltaArray._from_sequnce also has some extra args (freq, unit, and the positional arguments name is different).

jbrockmendel · 2018-12-04T23:19:05Z

and the positional arguments name is different

definitely +1 on having these match

What do you think about moving this PR's current _from_sequence to a different name (not sure what to call it, maybe a top-level datetime_array to mirror period_array?

I'm on board with separating most of the method out into a sequence_to_dt64ns mirroring the sequence_to_td64ns we have in timedeltas.

I still think having period_array instead of having the PeriodArray constructor be user-friendly is silly. I've made my peace with not getting my way on this one, but draw the line at implementing it myself. It will be easy for e.g. Joris to "fix" these in a follow-up.

(plus there's also to_datetime. -1 on proliferation when DatetimeArray.__init__ is an obvious fit and DatetimeArray._from_sequence also exists)

sequence_to_dt64ns notwithstanding, I think the main outstanding concern for this PR is the pickle/verify_integrity thing, which should be handled by either a) @jreback being OK with the change to _new_DatetimeIndex implemented here or b) #24096 getting sorted out.

…om_sequence

TomAugspurger · 2018-12-05T03:10:05Z

That's all totally fair.

I think the main outstanding concern for this PR is the pickle/verify_integrity thing, which should be handled by either a) @jreback being OK with the change to _new_DatetimeIndex implemented here or b) #24096 getting sorted out

Of those two, I'm not sure which is preferred. Some scattered observations

if the only thing that failed on BUG: fix mutation of DTI backing Series/DataFrame #24096 was the recent test from Ensure that DatetimeArray keeps reference to original data #23956, then feel free to ignore / break that. I suspect that the groupby issue segfaulting there isn't properly fixed yet, but...
I think that all of these pickle concerns are going to go away once we just use __init__ instead of __new__. It may be worth trying to see if we can do that before REF: DatetimeLikeArray #24024.

jbrockmendel · 2018-12-05T03:54:16Z

I'm about to push a new commit, the relevant changes being:

updated TimedeltaArray._from_sequence signature to have positional arguments match ExtensionArray._from_sequence
separated most of DatetimeArray._from_sequence into sequence_to_dt64ns, mirroring timedelta version sequence_to_td64ns
moved validate_tz_from_dtype from arrays.datetimelike to arrays.datetimes, since it is only used there
added to validate_tz_from_dtype a check for tz-naive dtype and non-None tz, which lets us move that particular check to before the call to _simple_new
implemented datetimelike.validate_inferred_freq to share some more validation code between TimedeltaArray._from_sequence and DatetimeArray._from_sequence

sequence_to_dt64ns still needs a docstring

jbrockmendel · 2018-12-05T04:00:09Z

if the only thing that failed on #24096 was the recent test from #23956,

That test failed here, not in #24096. It was fixed by avoiding calling data = data.view(_NS_DTYPE) in the case where data.dtype is already M8[ns]. The difference in the code amounts to one indentation. Not worth worrying about.

I think that all of these pickle concerns are going to go away once we just use init instead of new. It may be worth trying to see if we can do that before #24024.

I don't see how that will work, but pickling is an area where I frequently find new and exciting ways to screw up. My preference would be to merge this with the small _new_DatetimeIndex edits and revisit it after #24024. This would make #24096 not-a-blocker and let me go through those constructors a little more thoroughly (motivated #24100)

pandas/core/arrays/datetimes.py

jreback · 2018-12-05T12:35:53Z

pandas/core/indexes/datetimes.py

-        #  go through _simple_new instead
-        warnings.simplefilter("ignore")
-        result = cls.__new__(cls, verify_integrity=False, **d)
+    if "data" in d and not isinstance(d["data"], DatetimeIndex):


let's fix that one first then. this needs to be changed here.

…om_sequence

jreback · 2018-12-05T20:22:39Z

ok, rebase just in case here

…om_sequence

jbrockmendel · 2018-12-05T20:36:39Z

Rebased. Haven't reverted the _new_DatetimeIndex change because there is one other case that still fails: there is an overflow in _generate_range that we're not handling correctly. I'll address that in a separate PR before long.

jreback · 2018-12-05T20:41:04Z

pandas/core/arrays/datetimes.py


    Raises
    ------
-    TypeError : if both timezones are present but do not match
+    TypeError : PeriodDType data is passed


is this explicity handled?

Via maybe_convert_dtype

jreback · 2018-12-05T22:45:07Z

thanks @jbrockmendel merging to unblock things. but as they say in boxing: keep it clean! hahah

commit 28c61d770f6dfca6857fd0fa6979d4119a31129e Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Dec 6 12:18:19 2018 -0600 uncomment commit bae2e322523efc73a1344464f51611e2dc555ccb Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Dec 6 12:17:09 2018 -0600 maybe fixes commit 6cb4db05c9d6ceba3794096f0172cae5ed5f6019 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Dec 6 09:57:37 2018 -0600 we back commit d97ab57fb32cb23371169d9ed659ccfac34cfe45 Merge: a117de4 b78aa8d Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Dec 6 09:51:51 2018 -0600 Merge remote-tracking branch 'upstream/master' into disown-tz-only-rebased2 commit b78aa8d Author: gfyoung <gfyoung17+GitHub@gmail.com> Date: Thu Dec 6 07:18:44 2018 -0500 REF/TST: Add pytest idiom to reshape/test_tile (pandas-dev#24107) commit 2993b8e Author: gfyoung <gfyoung17+GitHub@gmail.com> Date: Thu Dec 6 07:17:55 2018 -0500 REF/TST: Add more pytest idiom to scalar/test_nat (pandas-dev#24120) commit b841374 Author: evangelineliu <hsiyinliu@gmail.com> Date: Wed Dec 5 18:21:46 2018 -0500 BUG: Fix concat series loss of timezone (pandas-dev#24027) commit 4ae63aa Author: jbrockmendel <jbrockmendel@gmail.com> Date: Wed Dec 5 14:44:50 2018 -0800 Implement DatetimeArray._from_sequence (pandas-dev#24074) commit 2643721 Author: jbrockmendel <jbrockmendel@gmail.com> Date: Wed Dec 5 14:43:45 2018 -0800 CLN: Follow-up to pandas-dev#24100 (pandas-dev#24116) commit 8ea7744 Author: chris-b1 <cbartak@gmail.com> Date: Wed Dec 5 14:21:23 2018 -0600 PERF: ascii c string functions (pandas-dev#23981) commit cb862e4 Author: jbrockmendel <jbrockmendel@gmail.com> Date: Wed Dec 5 12:19:46 2018 -0800 BUG: fix mutation of DTI backing Series/DataFrame (pandas-dev#24096) commit aead29b Author: topper-123 <contribute@tensortable.com> Date: Wed Dec 5 19:06:00 2018 +0000 API: rename MultiIndex.labels to MultiIndex.codes (pandas-dev#23752)

Implement DatetimeArray._from_sequence

be745aa

TomAugspurger reviewed Dec 3, 2018

View reviewed changes

move check down

d85aa7a

jreback added Datetime Datetime data dtype ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 4, 2018

jreback added this to the 0.24.0 milestone Dec 4, 2018

jreback requested changes Dec 4, 2018

View reviewed changes

This was referenced Dec 4, 2018

REF: DatetimeLikeArray #24024

Merged

BUG: fix mutation of DTI backing Series/DataFrame #24096

Merged

Merge branch 'master' of https://github.com/pandas-dev/pandas into fr…

ad88a79

…om_sequence

separate most of from_sequence into sequence_to_dt64ns

d8f8d85

jbrockmendel added 2 commits December 4, 2018 21:10

isort fixup

e94cfff

docstring

96c8119

jreback requested changes Dec 5, 2018

View reviewed changes

jbrockmendel added 2 commits December 5, 2018 06:55

Merge branch 'master' of https://github.com/pandas-dev/pandas into fr…

dbb9677

…om_sequence

requested rearrangement

6ce2528

Merge branch 'master' of https://github.com/pandas-dev/pandas into fr…

9d7cb39

…om_sequence

jreback reviewed Dec 5, 2018

View reviewed changes

jreback approved these changes Dec 5, 2018

View reviewed changes

jreback merged commit 4ae63aa into pandas-dev:master Dec 5, 2018

jbrockmendel deleted the from_sequence branch December 6, 2018 00:13

jbrockmendel mentioned this pull request Dec 13, 2018

BUG: Fix overflow bugs in date_Range #24255

Merged

4 tasks

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Implement DatetimeArray._from_sequence (pandas-dev#24074)

a3f5976

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Implement DatetimeArray._from_sequence (pandas-dev#24074)

67ec0b9

Implement DatetimeArray._from_sequence #24074

Implement DatetimeArray._from_sequence #24074

Conversation

jbrockmendel commented Dec 3, 2018

pep8speaks commented Dec 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 3, 2018

codecov bot commented Dec 3, 2018

Codecov Report

codecov bot commented Dec 3, 2018

Codecov Report

codecov bot commented Dec 3, 2018

Codecov Report

codecov bot commented Dec 3, 2018 • edited Loading

Codecov Report

jbrockmendel commented Dec 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 4, 2018

jbrockmendel commented Dec 4, 2018

TomAugspurger commented Dec 5, 2018

jbrockmendel commented Dec 5, 2018

jbrockmendel commented Dec 5, 2018

Choose a reason for hiding this comment

jreback commented Dec 5, 2018

jbrockmendel commented Dec 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 5, 2018

codecov bot commented Dec 3, 2018 •

edited

Loading