Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DatetimeArray._from_sequence #24074

Merged
merged 9 commits into from
Dec 5, 2018

Conversation

jbrockmendel
Copy link
Member

Removes dependence of DatetimeArray.__new__ on DatetimeIndex. De-duplicated DatetimeIndex.__new__/DatetimeArray.__new__.

The contents of DatetimeArray._from_sequence are basically just moved from DatetimeIndex.__new__. This is feasible because #23675 disentangled to_datetime from DatetimeIndex.__new__.

cc @TomAugspurger this is the last thing on my todo list for DTA/TDA. LMK if I can be helpful with the composition transition.

@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for submitting the PR.

pandas/core/arrays/datetimes.py Outdated Show resolved Hide resolved
pandas/core/arrays/datetimes.py Outdated Show resolved Hide resolved
# go through _simple_new instead
warnings.simplefilter("ignore")
result = cls.__new__(cls, verify_integrity=False, **d)
if "data" in d and not isinstance(d["data"], DatetimeIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the reasoning behind this change? I don't see why just DatetimeIndex would need to have the integrity verified.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this we get a couple of failing pickle-based tests. tests.frame.test_block_internals.test_pickle and tests.series.test_timeseries.test_pickle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I've been fighting those too errors too. I suspect they were working before because DatetimeIndex accepts data=None for range-based constructors, which we don't want for the arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is because you need to define __reduce__ on in ExtensionArray. This should never be hit by a non-DTI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't. The issue is that this function calls DatetimeIndex.__new__ with verify_integrity=False (since it is unpickling a previously-constructed DTI, integrity has presumably already been verified, so we can skip that somewhat-costly step), and the pickle-tested cases raise ValueError because when we try to verify their integrity they fail

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed by #24096

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's fix that one first then. this needs to be changed here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes over in #24096 seem to be... different? I don't know how to explain it, but doesn't the fact that we're having to copy data over in #24096 seem disconnected from pickling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but doesn't the fact that we're having to copy data over in #24096 seem disconnected from pickling?

Pickling turns out to be only tangentially related to the "real" problem. In the status quo, altering a datetime64tz column alters the DatetimeIndex that backs it, but doesn't set its freq to None. When that DataFrame is pickled and then unpickled, it tries to reconstruct that DatetimeIndex, but is passing arguments that should raise ValueError. ATM that gets side-stepped by passing verify_integrity=False.

So the goal of #24096 is to not corrupt the DatetimeIndex in the first place, making verify_integrity=False unnecessary.

That's still pretty roundabout. Any clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that does help.

@TomAugspurger
Copy link
Contributor

LMK if I can be helpful with the composition transition.

Mind if I ping you on specific xfails that I've added in that PR after this,
#23601, and
#23990 are merged? I'm starting to whittle them down.

@codecov
Copy link

codecov bot commented Dec 3, 2018

Codecov Report

Merging #24074 into master will decrease coverage by <.01%.
The diff coverage is 79.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24074      +/-   ##
==========================================
- Coverage   42.38%   42.38%   -0.01%     
==========================================
  Files         161      161              
  Lines       51701    51691      -10     
==========================================
- Hits        21914    21907       -7     
+ Misses      29787    29784       -3
Flag Coverage Δ
#single 42.38% <79.66%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/datetimes.py 49.6% <33.33%> (-2.73%) ⬇️
pandas/core/arrays/datetimes.py 65.56% <88%> (+1.76%) ⬆️
pandas/tseries/frequencies.py 70.8% <0%> (+0.72%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08395af...be745aa. Read the comment docs.

2 similar comments
@codecov
Copy link

codecov bot commented Dec 3, 2018

Codecov Report

Merging #24074 into master will decrease coverage by <.01%.
The diff coverage is 79.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24074      +/-   ##
==========================================
- Coverage   42.38%   42.38%   -0.01%     
==========================================
  Files         161      161              
  Lines       51701    51691      -10     
==========================================
- Hits        21914    21907       -7     
+ Misses      29787    29784       -3
Flag Coverage Δ
#single 42.38% <79.66%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/datetimes.py 49.6% <33.33%> (-2.73%) ⬇️
pandas/core/arrays/datetimes.py 65.56% <88%> (+1.76%) ⬆️
pandas/tseries/frequencies.py 70.8% <0%> (+0.72%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08395af...be745aa. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 3, 2018

Codecov Report

Merging #24074 into master will decrease coverage by <.01%.
The diff coverage is 79.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24074      +/-   ##
==========================================
- Coverage   42.38%   42.38%   -0.01%     
==========================================
  Files         161      161              
  Lines       51701    51691      -10     
==========================================
- Hits        21914    21907       -7     
+ Misses      29787    29784       -3
Flag Coverage Δ
#single 42.38% <79.66%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/datetimes.py 49.6% <33.33%> (-2.73%) ⬇️
pandas/core/arrays/datetimes.py 65.56% <88%> (+1.76%) ⬆️
pandas/tseries/frequencies.py 70.8% <0%> (+0.72%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08395af...be745aa. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 3, 2018

Codecov Report

Merging #24074 into master will increase coverage by <.01%.
The diff coverage is 99.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24074      +/-   ##
==========================================
+ Coverage    92.2%    92.2%   +<.01%     
==========================================
  Files         162      162              
  Lines       51729    51717      -12     
==========================================
- Hits        47697    47686      -11     
+ Misses       4032     4031       -1
Flag Coverage Δ
#multiple 90.6% <99.18%> (-0.01%) ⬇️
#single 43.04% <77.04%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/datetimelike.py 96.41% <100%> (+0.06%) ⬆️
pandas/core/indexes/period.py 93.06% <100%> (-0.02%) ⬇️
pandas/core/arrays/timedeltas.py 87.13% <100%> (-0.15%) ⬇️
pandas/core/indexes/datetimes.py 96.32% <100%> (-0.24%) ⬇️
pandas/core/arrays/datetimes.py 98.56% <100%> (+0.31%) ⬆️
pandas/core/arrays/period.py 98.29% <92.3%> (-0.18%) ⬇️
pandas/core/indexes/base.py 96.27% <0%> (-0.06%) ⬇️
pandas/core/ops.py 94.26% <0%> (+0.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8ea7744...9d7cb39. Read the comment docs.

@jbrockmendel
Copy link
Member Author

Mind if I ping you on specific xfails that I've added

Sounds good.

@jreback jreback added Datetime Datetime data dtype ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 4, 2018
@jreback jreback added this to the 0.24.0 milestone Dec 4, 2018
# assume this data are epoch timestamps
if data.dtype != _INT64_DTYPE:
data = data.astype(np.int64, copy=False)
subarr = data.view(_NS_DTYPE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be able to pull out
subarr = data.view(_NS_DTYPE) of the if/else and just do it always (or maybe just do it in _simple_new), but this is for later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I tried originally, but it breaks the recently-implemented tests.arrays.test_datetimelike.test_from_array_keeps_base (#23956)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm sorry. FWIW, that was a hackish thing that I introduced to avoid segfaults. If the code in reduction code was doing the right thing, we may be able to remove that test / requirement. But I'm not sure what the right thing is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purposes of this PR it sounds like the options are to either remove that test and un-indent t his line, or keep this line how it is. It sounds like the first option would cause problems in your DTA branch. I don't have a strong preference here.

if freq is None and hasattr(values, "freq"):
# i.e. DatetimeArray, DatetimeIndex
freq = values.freq
@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you don't want to add verify_integrity here (as maybe _verify_integrity=True)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've deprecated the kwarg in the DatetimeIndex constructor to get rid of it. In cases where verify_integrity is not needed, a different constructor (e.g. simple_new) should be used.

pandas/core/arrays/datetimes.py Outdated Show resolved Hide resolved
# go through _simple_new instead
warnings.simplefilter("ignore")
result = cls.__new__(cls, verify_integrity=False, **d)
if "data" in d and not isinstance(d["data"], DatetimeIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is because you need to define __reduce__ on in ExtensionArray. This should never be hit by a non-DTI.

@TomAugspurger
Copy link
Contributor

On the function signature, it would be nice if it matched the interface, but that's not strictly necessary. Apparently, https://en.wikipedia.org/wiki/Liskov_substitution_principle is the "rule" here, and I don't think adding optional arguments like freq=None, which are mere optimizations, breaks that rule. Things like dayfirst probably do violate it though :/

Fundamentally, I think this is because we're overloading _from_sequence to handled things that ExtensionArray._from_sequence doesn't expect / require. According to the interface, the type of scalars would be, roughly, Sequence[Timestamp]. We're also allowing things like Sequence[strings-that-can-maybe-be-parsed-as-timestamps].

With PeriodArray, we got around this with a top-level period_array method that handles all the mess that people can throw at us, reserving _from_sequence for Sequence[Period], and __init__ for simply setting the attributes.

What do you think about moving this PR's current _from_sequence to a different name (not sure what to call it, maybe a top-level datetime_array to mirror period_array?

I see that TimedeltaArray._from_sequnce also has some extra args (freq, unit, and the positional arguments name is different).

@jbrockmendel
Copy link
Member Author

and the positional arguments name is different

definitely +1 on having these match

What do you think about moving this PR's current _from_sequence to a different name (not sure what to call it, maybe a top-level datetime_array to mirror period_array?

I'm on board with separating most of the method out into a sequence_to_dt64ns mirroring the sequence_to_td64ns we have in timedeltas.

I still think having period_array instead of having the PeriodArray constructor be user-friendly is silly. I've made my peace with not getting my way on this one, but draw the line at implementing it myself. It will be easy for e.g. Joris to "fix" these in a follow-up.

(plus there's also to_datetime. -1 on proliferation when DatetimeArray.__init__ is an obvious fit and DatetimeArray._from_sequence also exists)

sequence_to_dt64ns notwithstanding, I think the main outstanding concern for this PR is the pickle/verify_integrity thing, which should be handled by either a) @jreback being OK with the change to _new_DatetimeIndex implemented here or b) #24096 getting sorted out.

@TomAugspurger
Copy link
Contributor

That's all totally fair.

I think the main outstanding concern for this PR is the pickle/verify_integrity thing, which should be handled by either a) @jreback being OK with the change to _new_DatetimeIndex implemented here or b) #24096 getting sorted out

Of those two, I'm not sure which is preferred. Some scattered observations

  1. if the only thing that failed on BUG: fix mutation of DTI backing Series/DataFrame #24096 was the recent test from Ensure that DatetimeArray keeps reference to original data #23956, then feel free to ignore / break that. I suspect that the groupby issue segfaulting there isn't properly fixed yet, but...
  2. I think that all of these pickle concerns are going to go away once we just use __init__ instead of __new__. It may be worth trying to see if we can do that before REF: DatetimeLikeArray #24024.

@jbrockmendel
Copy link
Member Author

I'm about to push a new commit, the relevant changes being:

  • updated TimedeltaArray._from_sequence signature to have positional arguments match ExtensionArray._from_sequence
  • separated most of DatetimeArray._from_sequence into sequence_to_dt64ns, mirroring timedelta version sequence_to_td64ns
  • moved validate_tz_from_dtype from arrays.datetimelike to arrays.datetimes, since it is only used there
  • added to validate_tz_from_dtype a check for tz-naive dtype and non-None tz, which lets us move that particular check to before the call to _simple_new
  • implemented datetimelike.validate_inferred_freq to share some more validation code between TimedeltaArray._from_sequence and DatetimeArray._from_sequence

sequence_to_dt64ns still needs a docstring

@jbrockmendel
Copy link
Member Author

if the only thing that failed on #24096 was the recent test from #23956,

That test failed here, not in #24096. It was fixed by avoiding calling data = data.view(_NS_DTYPE) in the case where data.dtype is already M8[ns]. The difference in the code amounts to one indentation. Not worth worrying about.

I think that all of these pickle concerns are going to go away once we just use init instead of new. It may be worth trying to see if we can do that before #24024.

I don't see how that will work, but pickling is an area where I frequently find new and exciting ways to screw up. My preference would be to merge this with the small _new_DatetimeIndex edits and revisit it after #24024. This would make #24096 not-a-blocker and let me go through those constructors a little more thoroughly (motivated #24100)

pandas/core/arrays/datetimes.py Outdated Show resolved Hide resolved
# go through _simple_new instead
warnings.simplefilter("ignore")
result = cls.__new__(cls, verify_integrity=False, **d)
if "data" in d and not isinstance(d["data"], DatetimeIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's fix that one first then. this needs to be changed here.

@jreback
Copy link
Contributor

jreback commented Dec 5, 2018

ok, rebase just in case here

@jbrockmendel
Copy link
Member Author

Rebased. Haven't reverted the _new_DatetimeIndex change because there is one other case that still fails: there is an overflow in _generate_range that we're not handling correctly. I'll address that in a separate PR before long.


Raises
------
TypeError : if both timezones are present but do not match
TypeError : PeriodDType data is passed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this explicity handled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Via maybe_convert_dtype

@jreback jreback merged commit 4ae63aa into pandas-dev:master Dec 5, 2018
@jreback
Copy link
Contributor

jreback commented Dec 5, 2018

thanks @jbrockmendel merging to unblock things. but as they say in boxing: keep it clean! hahah

@jbrockmendel jbrockmendel deleted the from_sequence branch December 6, 2018 00:13
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Dec 6, 2018
commit 28c61d770f6dfca6857fd0fa6979d4119a31129e
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 12:18:19 2018 -0600

    uncomment

commit bae2e322523efc73a1344464f51611e2dc555ccb
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 12:17:09 2018 -0600

    maybe fixes

commit 6cb4db05c9d6ceba3794096f0172cae5ed5f6019
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 09:57:37 2018 -0600

    we back

commit d97ab57fb32cb23371169d9ed659ccfac34cfe45
Merge: a117de4 b78aa8d
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 09:51:51 2018 -0600

    Merge remote-tracking branch 'upstream/master' into disown-tz-only-rebased2

commit b78aa8d
Author: gfyoung <gfyoung17+GitHub@gmail.com>
Date:   Thu Dec 6 07:18:44 2018 -0500

    REF/TST: Add pytest idiom to reshape/test_tile (pandas-dev#24107)

commit 2993b8e
Author: gfyoung <gfyoung17+GitHub@gmail.com>
Date:   Thu Dec 6 07:17:55 2018 -0500

    REF/TST: Add more pytest idiom to scalar/test_nat (pandas-dev#24120)

commit b841374
Author: evangelineliu <hsiyinliu@gmail.com>
Date:   Wed Dec 5 18:21:46 2018 -0500

    BUG: Fix concat series loss of timezone (pandas-dev#24027)

commit 4ae63aa
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Dec 5 14:44:50 2018 -0800

    Implement DatetimeArray._from_sequence (pandas-dev#24074)

commit 2643721
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Dec 5 14:43:45 2018 -0800

    CLN: Follow-up to pandas-dev#24100 (pandas-dev#24116)

commit 8ea7744
Author: chris-b1 <cbartak@gmail.com>
Date:   Wed Dec 5 14:21:23 2018 -0600

    PERF: ascii c string functions (pandas-dev#23981)

commit cb862e4
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Dec 5 12:19:46 2018 -0800

    BUG: fix mutation of DTI backing Series/DataFrame (pandas-dev#24096)

commit aead29b
Author: topper-123 <contribute@tensortable.com>
Date:   Wed Dec 5 19:06:00 2018 +0000

    API: rename MultiIndex.labels to MultiIndex.codes (pandas-dev#23752)
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants