Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EA: revert treatment of i8values #24623

Closed
wants to merge 11 commits into from
Closed

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Jan 4, 2019

One option for #24559, with some spillover into #24567.

Restores freq validation (needs test)
Retains a private fastpath

@jbrockmendel
Copy link
Member Author

No idea how we would go about fixing the failing feather tests.

@jbrockmendel
Copy link
Member Author

@TomAugspurger any idea about the feather thing. AFAICT any change that restores the DTA/Timestamp i8 equivalence will run into this problem.

@TomAugspurger
Copy link
Contributor

AFAICT any change that restores the DTA/Timestamp i8 equivalence will run into this problem.

I think you're correct. On your branch

In [32]: df = pd.DataFrame({"A": pd.date_range('2000', periods=2, tz="US/Central")})

In [33]: df
Out[33]:
                          A
0 2000-01-01 00:00:00-06:00
1 2000-01-02 00:00:00-06:00

In [34]: pa.Table.from_pandas(df).to_pandas()
Out[34]:
                          A
0 2000-01-01 06:00:00-06:00
1 2000-01-02 06:00:00-06:00

Which seems strange to me... Why would us changing back to the old way break pyarrow? I'll do some digging.


if values.dtype != _NS_DTYPE:
msg = (
if values.dtype == np.bool_:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do u only check for book here? shouldn’t this. e inside _from_sequeneve?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because up until moments ago I thought we had a test checking that pd.to_datetime(False) returned NaT. No idea where I got that idea... Just moved this check down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the good news is that I'm not hallucinating non-existent test cases. the relevant test was for to_datetime(False, errors="coerce"), so I'm moving this check back up to __init__. To be clear, I don't think the three checks in __init__ should be there at all, but for the moment I'm trying to minimize behavior changes and just get the tests passing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not because to_datetime has that behaviour that there should be a check for that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not because to_datetime has that behaviour that there should be a check for that here?

It's because to_datetime has a coerce kwarg that specified fallback behavior. If we want to disallow bools (and other dtypes presumably, bools were just the only one tested in 24024) in DatetimeArray but not in sequence_to_dt64ns, the check needs to be here.

@codecov
Copy link

codecov bot commented Jan 6, 2019

Codecov Report

Merging #24623 into master will decrease coverage by 49.34%.
The diff coverage is 83.33%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24623       +/-   ##
===========================================
- Coverage   92.37%   43.03%   -49.35%     
===========================================
  Files         166      166               
  Lines       52384    52368       -16     
===========================================
- Hits        48390    22536    -25854     
- Misses       3994    29832    +25838
Flag Coverage Δ
#multiple ?
#single 43.03% <83.33%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/packers.py 14.65% <0%> (-73.57%) ⬇️
pandas/core/frame.py 35.74% <100%> (-61.19%) ⬇️
pandas/core/indexes/datetimes.py 48.45% <100%> (-47.81%) ⬇️
pandas/core/internals/blocks.py 52.71% <50%> (-41.6%) ⬇️
pandas/core/arrays/datetimes.py 66.21% <87.5%> (-31.48%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
... and 128 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc91f4c...a46a2cc. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 6, 2019

Codecov Report

Merging #24623 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24623      +/-   ##
==========================================
+ Coverage   92.38%   92.38%   +<.01%     
==========================================
  Files         166      166              
  Lines       52327    52293      -34     
==========================================
- Hits        48343    48312      -31     
+ Misses       3984     3981       -3
Flag Coverage Δ
#multiple 90.8% <100%> (-0.01%) ⬇️
#single 43.07% <86.2%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/io/packers.py 88.21% <100%> (ø) ⬆️
pandas/core/arrays/datetimelike.py 97.67% <100%> (ø) ⬆️
pandas/core/internals/blocks.py 94.3% <100%> (ø) ⬆️
pandas/core/frame.py 96.92% <100%> (ø) ⬆️
pandas/core/indexes/datetimes.py 96.44% <100%> (+0.16%) ⬆️
pandas/core/arrays/datetimes.py 97.81% <100%> (+0.09%) ⬆️
pandas/core/arrays/timedeltas.py 87.86% <0%> (-0.24%) ⬇️
pandas/core/indexes/timedeltas.py 90.25% <0%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46a31c9...20b91dd. Read the comment docs.

@jreback jreback added Datetime Datetime data dtype Internals Related to non-user accessible pandas implementation labels Jan 6, 2019
@jreback
Copy link
Contributor

jreback commented Jan 6, 2019

@jbrockmendel this needs additional tests?

@jbrockmendel
Copy link
Member Author

this needs additional tests?

Yes. Waiting to get weigh-in from @TomAugspurger @jorisvandenbossche @mroeschke.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally -1 on adding yet another constructor that can parse datetime-like values. We already have to_datetime (and DatetimeIndex and to some extent pd.array and Series) that can do that. Users that want to parse such values, can use those.

I would personally keep the discussion around the handling of integer values separated from that.

Adding a few comments asking for clarifications on some changes.


if values.dtype != _NS_DTYPE:
msg = (
if values.dtype == np.bool_:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not because to_datetime has that behaviour that there should be a check for that here?

pandas/core/indexes/datetimes.py Outdated Show resolved Hide resolved
pandas/core/internals/blocks.py Outdated Show resolved Hide resolved
@@ -340,7 +340,7 @@ def test_from_array_keeps_base(self):
arr = np.array(['2000-01-01', '2000-01-02'], dtype='M8[ns]')
dta = DatetimeArray(arr)

assert dta._data is arr
assert dta._data.base is arr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because a view of arr is taken, so the assertion is untrue as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why was it passing on master? Because you changed it to take a view? Is it important to do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I changed it back to use _from_sequence, which uses sequence_to_dt64ns, which yes takes a view. And it is important in as much as in the PR implementing sequence_to_dt64ns when the original version only took a view if it was necessary, there was a request to do it unconditionally to save a line

# TODO: get tz out of here altogether
assert dtype is None
tz = timezones.tz_standardize(tz)
dtype = DatetimeTZDtype(tz=tz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this tz normalization needs to be applied to the case of user-provided dtype=DatetimeTZDtype(...) as well. I'll try to construct a failing test case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this (roughly) passes on master, fails here

def test_foo():
    idx = pd.DatetimeIndex(['2000', '2001'], tz='US/Central')
    t1 = pd.DatetimeTZDtype(tz=idx.tz)
    t2 = pd.DatetimeTZDtype(tz=idx[0].tz)

    x = pd.arrays.DatetimeArray._simple_new(idx.view("i8"), dtype=t1)
    y = pd.arrays.DatetimeArray._simple_new(idx.view("i8"), dtype=t2)
    assert x.dtype.tz == y.dtype.tz    

I think it should pass.

@mroeschke
Copy link
Member

I've been out of the loop with DatetimeArray development, but overall I am +1 with the goals outlined in #24559 (comment)

It would be nice if there was 1 entry point to parse all datetime values, but given that we have so many constructors already; this PR's step of maintaining consistency with the others is good.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 7, 2019

I am personally -1 on adding yet another constructor that can parse datetime-like values.

@jorisvandenbossche can you clarify this objection? You would like the following to raise, as it does on master?

In [6]: pd.arrays.DatetimeArray(np.array(['2015-01-01']))
Out[6]:
<DatetimeArray>
['2015-01-01 00:00:00']
Length: 1, dtype: datetime64[ns]

@jreback
Copy link
Contributor

jreback commented Jan 8, 2019

@jbrockmendel can you rebase

@jorisvandenbossche
Copy link
Member

can you clarify this objection?

Yes, I would prefer the situation as it is on master.
I would strongly prefer to keep the main DatetimeArray constructor simple: only accepting the actual values, and nothing more. (with some exceptions of course, like unboxing containers (Series/Index) that contain such values)

Reasons for my current thinking on that:

  • It's not needed. We already have the user-facing pd.array entry-point, and to_datetime for custom string parsing.
    If we add similar ability of pd.array and to_datetime to DatetimeArray as well, users will start using it for that, and it is just yet another general constructor that we have to maintain

  • Keeping the range of accepted values limited for DatetimeArray gives clearer and more explicit code. If you read this in code, you can know that the values are already an array of datetime64 values. If have something else (scalars, strings, ..), then you can use the specific functionality to deal with those.
    (although on this aspect: we are still mixing integers and datetimes, and also giving it different behaviour, which already is a bit annoying)

It's basically what we have been discussing regularly before (in #23212, and probably some other places that I don't directly recall), and what we have done for IntegerArray.

@TomAugspurger
Copy link
Contributor

Thanks. There's a 3rd (less important) argument: performance. The slowdown in #24666 comes down to using sequence_to_td64ns.
#24666 (comment). That can be resolved, but at the cost of additional code complexity.

@jbrockmendel
Copy link
Member Author

This PR is on hold pending discussion in #24559.

There are parts of this PR that I expect to remain unchanged regardless of how that is resolved:

  • going through sequence_to_dt64ns rather than re-implementing a bunch of code
    • crippling the public constructor can be done in the block that disallows bool dtype; just disallow a bunch more stuff in that same line.
  • doing freq validation and never letting the public constructor return an invalid instance
    • yes I know that has perf implications. That is why there is a private constructor _simple_new that doesn't do validation

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 8, 2019

@jbrockmendel the performance in
#24666 (comment) is unrelated to freq validation.

going through sequence_to_dt64ns rather than re-implementing a bunch of code

that's fine by me, as long as there's a fastpath when the input is known to be an ndarray, Datetime/TimedeltaArray (or box around those) with the correct dtype.

@TomAugspurger
Copy link
Contributor

Are we holding the RC for this? I haven't followed too closely, but IIRC there's two issues

  1. DTA.__init__ has (incidentally) become more flexible as a result of this PR. That's relatively easy to fix by being bit stricter in the __init__
  2. The handling of the i8 values?

Where are we on number 2? Do we have agreement on the desired behavior, and does this PR implement that?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 9, 2019

On the desired behavior (let's keep the discussion here now that the PR is open):

#24559 (comment) has a nice summary.

@jorisvandenbossche prefers

  1. Reverting the change to DTI.__new__ for passing integers. Add a warning that it'll change in the future.
  2. Keep DTA.__init__ as it is now (same as Timestamp IIRC)

I think that's my preference as well, if I understand things correctly. @jbrockmendel do you need / want any help pushing this across the line? I'm busy with there things most of today, but can carve out some time if needed. It'd be good to do the RC today I think.

@jorisvandenbossche
Copy link
Member

As far as I understand, this PR is not related to the handling of integers (at least that is what I understood from asking @jbrockmendel about it), so I would prefer to keep the discussion about that in the issue.

@jorisvandenbossche
Copy link
Member

About this PR, if we want the changes of this PR in general, it's indeed easy to defer the discussion about the flexibility of the constructor by simply adding a check if dtype is M8 or int, and otherwise raising an error.

But I thought there was still disagreement about whether we want this freq validation or not? (because of the performance hit) Or is that resolved?

@jorisvandenbossche
Copy link
Member

As far as I understand, this PR is not related to the handling of integers

One thing that it does, though (from #24559 (comment)) is changing the behaviour of DatetimeArray. But so not the reverting of DatetimeIndex + adding a warning there. I would personally keep that for a separate PR, to keep it distinct from the unrelated discussion here (eg flexibility of constructor, freq validation)

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Jan 9, 2019

I propose to do the following:

Everyone on board?

@jorisvandenbossche
Copy link
Member

Not fully on board, that still changes the behaviour of DatetimeArray regarding M8 input, which is something that first should be decided upon (see the table I added in #24559, writing yet another comment with possible options)

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Jan 9, 2019

Not fully on board, that still changes the behaviour of DatetimeArray regarding M8 input,

I'm specifically proposing to revert changes to the behavior of DatetimeArray with regard to M8 input.

edit: "revert" as in revert changes made in an earlier version of this PR.

@jorisvandenbossche
Copy link
Member

But so without using _from_seqence then? As that does handle M8 differently

@jbrockmendel
Copy link
Member Author

But so without using _from_seqence then? As that does handle M8 differently

The suggested game-plan didn't discuss how to get from here to there. Just that the DTA behavior would revert to how it in master.

@pep8speaks
Copy link

pep8speaks commented Jan 9, 2019

Hello @jbrockmendel! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 09, 2019 at 17:54 Hours UTC

@jbrockmendel
Copy link
Member Author

ATM I'm trying to put together a branch that only adds freq validation. De-duplicating code, making _from_sequence match __init__, worrying about DTI... later.

@jorisvandenbossche
Copy link
Member

Can't you move the freq validation of _from_sequence to a separate helper function(s), and call that in __init__ so you don't need to call _from_sequence/sequence_to_dt64ns in the init? That might be much simpler?

@jorisvandenbossche
Copy link
Member

(that comment was about the current diff, not about your last comment, to be clear)

@jbrockmendel
Copy link
Member Author

Can't you move the freq validation of _from_sequence to a separate helper function(s)

That's close to what I'm doing in the other branch I mentioned. It's a little inelegant to do as a helper function since it needs to do a bit at the beginning and a bit at the end. But It's only 3 lines in the new branch, so I'm not going to sweat it for now.

@jbrockmendel
Copy link
Member Author

See #24686.

@jbrockmendel
Copy link
Member Author

Closing in favor of #24686 and #24708

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants