EA: revert treatment of i8values #24623

jbrockmendel · 2019-01-04T19:36:43Z

One option for #24559, with some spillover into #24567.

Restores freq validation (needs test)
Retains a private fastpath

Closes Accept PandasArray (with correct dtype) in DatetimeArray constructor #24615

jbrockmendel · 2019-01-04T20:54:05Z

No idea how we would go about fixing the failing feather tests.

jbrockmendel · 2019-01-05T00:45:26Z

@TomAugspurger any idea about the feather thing. AFAICT any change that restores the DTA/Timestamp i8 equivalence will run into this problem.

TomAugspurger · 2019-01-05T03:37:48Z

AFAICT any change that restores the DTA/Timestamp i8 equivalence will run into this problem.

I think you're correct. On your branch

In [32]: df = pd.DataFrame({"A": pd.date_range('2000', periods=2, tz="US/Central")})

In [33]: df
Out[33]:
                          A
0 2000-01-01 00:00:00-06:00
1 2000-01-02 00:00:00-06:00

In [34]: pa.Table.from_pandas(df).to_pandas()
Out[34]:
                          A
0 2000-01-01 06:00:00-06:00
1 2000-01-02 06:00:00-06:00

Which seems strange to me... Why would us changing back to the old way break pyarrow? I'll do some digging.

jreback · 2019-01-05T23:52:21Z

pandas/core/arrays/datetimes.py

-
-        if values.dtype != _NS_DTYPE:
-            msg = (
+        if values.dtype == np.bool_:


why do u only check for book here? shouldn’t this. e inside _from_sequeneve?

Because up until moments ago I thought we had a test checking that pd.to_datetime(False) returned NaT. No idea where I got that idea... Just moved this check down.

well, the good news is that I'm not hallucinating non-existent test cases. the relevant test was for to_datetime(False, errors="coerce"), so I'm moving this check back up to __init__. To be clear, I don't think the three checks in __init__ should be there at all, but for the moment I'm trying to minimize behavior changes and just get the tests passing.

It's not because to_datetime has that behaviour that there should be a check for that here?

It's not because to_datetime has that behaviour that there should be a check for that here?

It's because to_datetime has a coerce kwarg that specified fallback behavior. If we want to disallow bools (and other dtypes presumably, bools were just the only one tested in 24024) in DatetimeArray but not in sequence_to_dt64ns, the check needs to be here.

codecov · 2019-01-06T00:28:14Z

Codecov Report

Merging #24623 into master will decrease coverage by 49.34%.
The diff coverage is 83.33%.

@@             Coverage Diff             @@
##           master   #24623       +/-   ##
===========================================
- Coverage   92.37%   43.03%   -49.35%     
===========================================
  Files         166      166               
  Lines       52384    52368       -16     
===========================================
- Hits        48390    22536    -25854     
- Misses       3994    29832    +25838

Flag	Coverage Δ
#multiple	`?`
#single	`43.03% <83.33%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/io/packers.py	`14.65% <0%> (-73.57%)`	⬇️
pandas/core/frame.py	`35.74% <100%> (-61.19%)`	⬇️
pandas/core/indexes/datetimes.py	`48.45% <100%> (-47.81%)`	⬇️
pandas/core/internals/blocks.py	`52.71% <50%> (-41.6%)`	⬇️
pandas/core/arrays/datetimes.py	`66.21% <87.5%> (-31.48%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
... and 128 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc91f4c...a46a2cc. Read the comment docs.

codecov · 2019-01-06T00:28:15Z

Codecov Report

Merging #24623 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #24623      +/-   ##
==========================================
+ Coverage   92.38%   92.38%   +<.01%     
==========================================
  Files         166      166              
  Lines       52327    52293      -34     
==========================================
- Hits        48343    48312      -31     
+ Misses       3984     3981       -3

Flag	Coverage Δ
#multiple	`90.8% <100%> (-0.01%)`	⬇️
#single	`43.07% <86.2%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/io/packers.py	`88.21% <100%> (ø)`	⬆️
pandas/core/arrays/datetimelike.py	`97.67% <100%> (ø)`	⬆️
pandas/core/internals/blocks.py	`94.3% <100%> (ø)`	⬆️
pandas/core/frame.py	`96.92% <100%> (ø)`	⬆️
pandas/core/indexes/datetimes.py	`96.44% <100%> (+0.16%)`	⬆️
pandas/core/arrays/datetimes.py	`97.81% <100%> (+0.09%)`	⬆️
pandas/core/arrays/timedeltas.py	`87.86% <0%> (-0.24%)`	⬇️
pandas/core/indexes/timedeltas.py	`90.25% <0%> (+0.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46a31c9...20b91dd. Read the comment docs.

jreback · 2019-01-06T16:18:10Z

@jbrockmendel this needs additional tests?

jbrockmendel · 2019-01-06T18:43:52Z

this needs additional tests?

Yes. Waiting to get weigh-in from @TomAugspurger @jorisvandenbossche @mroeschke.

jorisvandenbossche

I am personally -1 on adding yet another constructor that can parse datetime-like values. We already have to_datetime (and DatetimeIndex and to some extent pd.array and Series) that can do that. Users that want to parse such values, can use those.

I would personally keep the discussion around the handling of integer values separated from that.

Adding a few comments asking for clarifications on some changes.

jorisvandenbossche · 2019-01-06T20:47:53Z

pandas/core/arrays/datetimes.py

-
-        if values.dtype != _NS_DTYPE:
-            msg = (
+        if values.dtype == np.bool_:


It's not because to_datetime has that behaviour that there should be a check for that here?

pandas/core/indexes/datetimes.py

pandas/core/internals/blocks.py

jorisvandenbossche · 2019-01-06T20:50:22Z

pandas/tests/arrays/test_datetimelike.py

@@ -340,7 +340,7 @@ def test_from_array_keeps_base(self):
        arr = np.array(['2000-01-01', '2000-01-02'], dtype='M8[ns]')
        dta = DatetimeArray(arr)

-        assert dta._data is arr
+        assert dta._data.base is arr


why this change?

because a view of arr is taken, so the assertion is untrue as-is.

then why was it passing on master? Because you changed it to take a view? Is it important to do that?

Because I changed it back to use _from_sequence, which uses sequence_to_dt64ns, which yes takes a view. And it is important in as much as in the PR implementing sequence_to_dt64ns when the original version only took a view if it was necessary, there was a request to do it unconditionally to save a line

TomAugspurger · 2019-01-07T13:14:07Z

pandas/core/arrays/datetimes.py

+            # TODO: get tz out of here altogether
+            assert dtype is None
+            tz = timezones.tz_standardize(tz)
+            dtype = DatetimeTZDtype(tz=tz)


I think this tz normalization needs to be applied to the case of user-provided dtype=DatetimeTZDtype(...) as well. I'll try to construct a failing test case.

OK, this (roughly) passes on master, fails here

def test_foo(): idx = pd.DatetimeIndex(['2000', '2001'], tz='US/Central') t1 = pd.DatetimeTZDtype(tz=idx.tz) t2 = pd.DatetimeTZDtype(tz=idx[0].tz) x = pd.arrays.DatetimeArray._simple_new(idx.view("i8"), dtype=t1) y = pd.arrays.DatetimeArray._simple_new(idx.view("i8"), dtype=t2) assert x.dtype.tz == y.dtype.tz

I think it should pass.

mroeschke · 2019-01-07T19:59:00Z

I've been out of the loop with DatetimeArray development, but overall I am +1 with the goals outlined in #24559 (comment)

It would be nice if there was 1 entry point to parse all datetime values, but given that we have so many constructors already; this PR's step of maintaining consistency with the others is good.

TomAugspurger · 2019-01-07T20:16:08Z

I am personally -1 on adding yet another constructor that can parse datetime-like values.

@jorisvandenbossche can you clarify this objection? You would like the following to raise, as it does on master?

In [6]: pd.arrays.DatetimeArray(np.array(['2015-01-01']))
Out[6]:
<DatetimeArray>
['2015-01-01 00:00:00']
Length: 1, dtype: datetime64[ns]

jreback · 2019-01-08T13:13:34Z

@jbrockmendel can you rebase

jorisvandenbossche · 2019-01-08T15:09:30Z

can you clarify this objection?

Yes, I would prefer the situation as it is on master.
I would strongly prefer to keep the main DatetimeArray constructor simple: only accepting the actual values, and nothing more. (with some exceptions of course, like unboxing containers (Series/Index) that contain such values)

Reasons for my current thinking on that:

It's not needed. We already have the user-facing pd.array entry-point, and to_datetime for custom string parsing.
If we add similar ability of pd.array and to_datetime to DatetimeArray as well, users will start using it for that, and it is just yet another general constructor that we have to maintain
Keeping the range of accepted values limited for DatetimeArray gives clearer and more explicit code. If you read this in code, you can know that the values are already an array of datetime64 values. If have something else (scalars, strings, ..), then you can use the specific functionality to deal with those.
(although on this aspect: we are still mixing integers and datetimes, and also giving it different behaviour, which already is a bit annoying)

It's basically what we have been discussing regularly before (in #23212, and probably some other places that I don't directly recall), and what we have done for IntegerArray.

TomAugspurger · 2019-01-08T15:22:26Z

Thanks. There's a 3rd (less important) argument: performance. The slowdown in #24666 comes down to using sequence_to_td64ns.
#24666 (comment). That can be resolved, but at the cost of additional code complexity.

jbrockmendel · 2019-01-08T16:13:33Z

This PR is on hold pending discussion in #24559.

There are parts of this PR that I expect to remain unchanged regardless of how that is resolved:

going through sequence_to_dt64ns rather than re-implementing a bunch of code
- crippling the public constructor can be done in the block that disallows bool dtype; just disallow a bunch more stuff in that same line.
doing freq validation and never letting the public constructor return an invalid instance
- yes I know that has perf implications. That is why there is a private constructor _simple_new that doesn't do validation

TomAugspurger · 2019-01-08T16:17:16Z

@jbrockmendel the performance in
#24666 (comment) is unrelated to freq validation.

going through sequence_to_dt64ns rather than re-implementing a bunch of code

that's fine by me, as long as there's a fastpath when the input is known to be an ndarray, Datetime/TimedeltaArray (or box around those) with the correct dtype.

TomAugspurger · 2019-01-09T13:13:53Z

Are we holding the RC for this? I haven't followed too closely, but IIRC there's two issues

DTA.__init__ has (incidentally) become more flexible as a result of this PR. That's relatively easy to fix by being bit stricter in the __init__
The handling of the i8 values?

Where are we on number 2? Do we have agreement on the desired behavior, and does this PR implement that?

TomAugspurger · 2019-01-09T13:18:23Z

On the desired behavior (let's keep the discussion here now that the PR is open):

#24559 (comment) has a nice summary.

@jorisvandenbossche prefers

Reverting the change to DTI.__new__ for passing integers. Add a warning that it'll change in the future.
Keep DTA.__init__ as it is now (same as Timestamp IIRC)

I think that's my preference as well, if I understand things correctly. @jbrockmendel do you need / want any help pushing this across the line? I'm busy with there things most of today, but can carve out some time if needed. It'd be good to do the RC today I think.

jorisvandenbossche · 2019-01-09T13:42:13Z

As far as I understand, this PR is not related to the handling of integers (at least that is what I understood from asking @jbrockmendel about it), so I would prefer to keep the discussion about that in the issue.

jorisvandenbossche · 2019-01-09T13:44:10Z

About this PR, if we want the changes of this PR in general, it's indeed easy to defer the discussion about the flexibility of the constructor by simply adding a check if dtype is M8 or int, and otherwise raising an error.

But I thought there was still disagreement about whether we want this freq validation or not? (because of the performance hit) Or is that resolved?

jorisvandenbossche · 2019-01-09T13:49:49Z

As far as I understand, this PR is not related to the handling of integers

One thing that it does, though (from #24559 (comment)) is changing the behaviour of DatetimeArray. But so not the reverting of DatetimeIndex + adding a warning there. I would personally keep that for a separate PR, to keep it distinct from the unrelated discussion here (eg flexibility of constructor, freq validation)

jbrockmendel · 2019-01-09T16:30:43Z

I propose to do the following:

rebase
revert parts of this PR that affect dtype-treatment, leaving only parts that:
- ensure freq is validated, so the public DTA constructor never gives an invalid instance
- retains a private fastpath _simple_new for DTA
- closes Accept PandasArray (with correct dtype) in DatetimeArray constructor #24615
address concerns about the DTI 0.23.4/master behavior separately

Everyone on board?

jorisvandenbossche · 2019-01-09T17:15:00Z

Not fully on board, that still changes the behaviour of DatetimeArray regarding M8 input, which is something that first should be decided upon (see the table I added in #24559, writing yet another comment with possible options)

jbrockmendel · 2019-01-09T17:16:05Z

Not fully on board, that still changes the behaviour of DatetimeArray regarding M8 input,

I'm specifically proposing to revert changes to the behavior of DatetimeArray with regard to M8 input.

edit: "revert" as in revert changes made in an earlier version of this PR.

jorisvandenbossche · 2019-01-09T17:33:11Z

But so without using _from_seqence then? As that does handle M8 differently

jbrockmendel · 2019-01-09T17:51:57Z

But so without using _from_seqence then? As that does handle M8 differently

The suggested game-plan didn't discuss how to get from here to there. Just that the DTA behavior would revert to how it in master.

pep8speaks · 2019-01-09T17:53:48Z

Hello @jbrockmendel! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 09, 2019 at 17:54 Hours UTC

jbrockmendel · 2019-01-09T18:25:48Z

ATM I'm trying to put together a branch that only adds freq validation. De-duplicating code, making _from_sequence match __init__, worrying about DTI... later.

jorisvandenbossche · 2019-01-09T18:26:25Z

Can't you move the freq validation of _from_sequence to a separate helper function(s), and call that in __init__ so you don't need to call _from_sequence/sequence_to_dt64ns in the init? That might be much simpler?

jorisvandenbossche · 2019-01-09T18:26:59Z

(that comment was about the current diff, not about your last comment, to be clear)

jbrockmendel · 2019-01-09T18:29:45Z

Can't you move the freq validation of _from_sequence to a separate helper function(s)

That's close to what I'm doing in the other branch I mentioned. It's a little inelegant to do as a helper function since it needs to do a bit at the beginning and a bit at the end. But It's only 3 lines in the new branch, so I'm not going to sweat it for now.

jbrockmendel · 2019-01-09T18:34:46Z

See #24686.

jbrockmendel · 2019-01-10T16:30:09Z

Closing in favor of #24686 and #24708

jbrockmendel added 2 commits January 4, 2019 11:32

revert treatment of i8values

a562769

remove debug pound

21d7228

jbrockmendel added 2 commits January 5, 2019 15:22

Merge branch 'master' of https://github.com/pandas-dev/pandas into i8

891663b

best guess at feather fix

401abd8

jreback reviewed Jan 5, 2019

View reviewed changes

troubleshoot pyarrow fail

a46a2cc

revert

27e5fb7

jbrockmendel mentioned this pull request Jan 6, 2019

Changes to i8data for DatetimeIndex #24559

Closed

jreback added Datetime Datetime data dtype Internals Related to non-user accessible pandas implementation labels Jan 6, 2019

jorisvandenbossche reviewed Jan 6, 2019

View reviewed changes

remove unnecessary casting, add tests

1715eba

TomAugspurger reviewed Jan 7, 2019

View reviewed changes

jbrockmendel added 2 commits January 9, 2019 08:02

Merge branch 'master' of https://github.com/pandas-dev/pandas into i8

8e6ca59

standardize tzinfos

c109018

revert DTA behavior

14cf136

flake8 fixup

20b91dd

jbrockmendel closed this Jan 10, 2019

jbrockmendel deleted the i8 branch January 10, 2019 16:30

EA: revert treatment of i8values #24623

EA: revert treatment of i8values #24623

Conversation

jbrockmendel commented Jan 4, 2019 • edited Loading

jbrockmendel commented Jan 4, 2019

jbrockmendel commented Jan 5, 2019

TomAugspurger commented Jan 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 6, 2019

Codecov Report

codecov bot commented Jan 6, 2019 • edited Loading

Codecov Report

jreback commented Jan 6, 2019

jbrockmendel commented Jan 6, 2019

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Jan 7, 2019

TomAugspurger commented Jan 7, 2019 • edited Loading

jreback commented Jan 8, 2019

jorisvandenbossche commented Jan 8, 2019

TomAugspurger commented Jan 8, 2019

jbrockmendel commented Jan 8, 2019

TomAugspurger commented Jan 8, 2019 • edited Loading

TomAugspurger commented Jan 9, 2019

TomAugspurger commented Jan 9, 2019 • edited Loading

jorisvandenbossche commented Jan 9, 2019

jorisvandenbossche commented Jan 9, 2019

jorisvandenbossche commented Jan 9, 2019

jbrockmendel commented Jan 9, 2019 • edited Loading

jorisvandenbossche commented Jan 9, 2019

jbrockmendel commented Jan 9, 2019 • edited Loading

jorisvandenbossche commented Jan 9, 2019

jbrockmendel commented Jan 9, 2019

pep8speaks commented Jan 9, 2019 • edited Loading

Comment last updated on January 09, 2019 at 17:54 Hours UTC

jbrockmendel commented Jan 9, 2019

jorisvandenbossche commented Jan 9, 2019

jorisvandenbossche commented Jan 9, 2019

jbrockmendel commented Jan 9, 2019

jbrockmendel commented Jan 9, 2019

jbrockmendel commented Jan 10, 2019

jbrockmendel commented Jan 4, 2019 •

edited

Loading

codecov bot commented Jan 6, 2019 •

edited

Loading

TomAugspurger commented Jan 7, 2019 •

edited

Loading

TomAugspurger commented Jan 8, 2019 •

edited

Loading

TomAugspurger commented Jan 9, 2019 •

edited

Loading

jbrockmendel commented Jan 9, 2019 •

edited

Loading

jbrockmendel commented Jan 9, 2019 •

edited

Loading

pep8speaks commented Jan 9, 2019 •

edited

Loading