Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Series[timedelta64]+DatetimeIndex[tz] bugs #18884

Merged
merged 23 commits into from
Jan 2, 2018

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Dec 20, 2017

ser + index lost timezone
index + ser retained timezone but returned a DatetimeIndex

ser + index lossed timezone
closes pandas-dev#13905

index + ser retained timezone but returned a DatetimeIndex
result = op(pd.TimedeltaIndex(left), right)
return construct_result(left, result,
index=left.index, name=left.name,
dtype=result.dtype)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should name be passed using com._maybe_match_name here? Not sure what the convention is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@jreback jreback added Bug Timedelta Timedelta data type Datetime Datetime data dtype Timezones Timezone data dtype labels Dec 21, 2017
result = op(pd.TimedeltaIndex(left), right)
return construct_result(left, result,
index=left.index, name=left.name,
dtype=result.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -709,6 +709,15 @@ def wrapper(left, right, name=name, na_op=na_op):

left, right = _align_method_SERIES(left, right)

if is_timedelta64_dtype(left) and isinstance(right, pd.DatetimeIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't special case things here, this is the point of the _TimeOp class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the appropriate place. Adding additional wrapping/unwrapping in _TimeOp obscures whats going on. Even if it wasn't a third(!) level of wrapping, b/c of namespacing/closure issues it is essentially impossible to handle overflow checks correctly with the current setup.

Eventually this is going to have to look like:

def wrapper(...)
    if isinstance(right, ABCDataFrame):
        return NotImplemented
    elif is_timedelta64_dtype(left):
         [dispatch to TimedeltaIndex op]
    elif (is_datetime64_dtype(left) or is_datetime64tz_dtype(left)):
         [dispatch to DatetimeIndex op]

    [everything else]

Everything _TimeOp does is done (better) in the index classes. Anything other than dispatching to those methods is unnecessary duplication and an invitation to inconsistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so let's untangle that first. This bug fix just addsmore things to undo later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug fix just addsmore things to undo later.

This won't need to be undone. Eventually the and isinstance(right, pd.DatetimeIndex) part of if is_timedelta64_dtype(left) and isinstance(right, pd.DatetimeIndex): is removed and the timedelta64_dtype case is complete.

I've given the order of edits quite a bit of thought. Transitioning towards the dispatch approach case-by-case and implementing corresponding tests along the way is the way to go.

so let's untangle that first.

Untangle which first? The mess of closures that prevents overflow checks?

I guess if you want to untangle _TimeOp independently, #18832 is a step in that direction, is orthogonal to the other outstanding PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, until #18832 this needs to change as I have indicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#18832 is orthogonal.

What change have you indicated? Not clear on what "untangle this first" refers to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I don't want this here. put it where the other conversions are. that can be reactored at some point if its worthile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if I squint it kind of makes sense to avoid fixing the overflow-check bug in this PR and instead do it separately. Will change.

index=index, name=names[1])
expected = pd.Series(index + pd.Timedelta(seconds=5),
index=index, name=names[2])
# passing name arg isn't enough when names[2] is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add blank lines before comments

expected.name = names[2]
assert expected.dtype == index.dtype
res = ser + index
tm.assert_series_equal(res, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls use result=

@codecov
Copy link

codecov bot commented Dec 21, 2017

Codecov Report

Merging #18884 into master will increase coverage by <.01%.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18884      +/-   ##
==========================================
+ Coverage   91.57%   91.57%   +<.01%     
==========================================
  Files         150      150              
  Lines       48941    48948       +7     
==========================================
+ Hits        44817    44824       +7     
  Misses       4124     4124
Flag Coverage Δ
#multiple 89.93% <93.33%> (ø) ⬆️
#single 41.75% <26.66%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/ops.py 90.41% <100%> (+0.17%) ⬆️
pandas/core/indexes/datetimelike.py 97.08% <100%> (+0.02%) ⬆️
pandas/core/indexes/datetimes.py 95.17% <50%> (-0.28%) ⬇️
pandas/util/testing.py 84.95% <0%> (+0.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c19bdc9...190197c. Read the comment docs.

@@ -709,6 +709,15 @@ def wrapper(left, right, name=name, na_op=na_op):

left, right = _align_method_SERIES(left, right)

if is_timedelta64_dtype(left) and isinstance(right, pd.DatetimeIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I don't want this here. put it where the other conversions are. that can be reactored at some point if its worthile.

@jbrockmendel
Copy link
Member Author

well I don't want this here. put it where the other conversions are. that can be reactored at some point if its worthile.

I'm in the process of implementing this, but please reconsider.

Moving this into _TimeOp requires implementing logic to figure out what the output dtype is. While this is doable, it is entirely unnecessary because it is essentially re-implementing (worse) the logic in the TimedeltaIndex operation.

@jbrockmendel
Copy link
Member Author

The most recent attempt to kludge this into TimeOp still got the dtypes wrong. The way presented is the right way to do this.

@jreback
Copy link
Contributor

jreback commented Dec 24, 2017

then let’s refactor this first
i don’t want to add a sub optimal soln which will then be removed

@jreback jreback closed this Dec 24, 2017
@jbrockmendel
Copy link
Member Author

Huh? The change here is in the direction of the long-term solution, is not going to be removed, is not sub-optimal.

@jbrockmendel
Copy link
Member Author

Pls reopen.

@jbrockmendel
Copy link
Member Author

Got the requested (but much worse) version working. Please reopen.

@jreback
Copy link
Contributor

jreback commented Dec 26, 2017

ok let's see what you have.

@jreback jreback reopened this Dec 26, 2017
@jbrockmendel
Copy link
Member Author

ok let's see what you have.

Pushed yesterday, looks like CI just finished.

dtype=dtype,
)

return wrapper


def _get_series_result_name(left, right):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than right new code, this should be threaded into the existing routine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Threading it in separates name-convention logic into multiple places and makes the existing kludge kludgier, but will change.

@@ -515,7 +515,8 @@ def _convert_to_array(self, values, name=None, other=None):

# a datelike
elif isinstance(values, pd.DatetimeIndex):
values = values.to_series()
# TODO: why are we casting to_series in the first place?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if you change this does it work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tentative yes, but only with the change made by _get_series_result_name below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: tinkering with this and doing the threading requested below don't play nicely together. We should keep this as-is and address separately. This fixes a bug and we should call it a win. There are a bunch of these bug-fixes to get in before we can clean up the mess that is _TimeOp.

@@ -515,7 +515,8 @@ def _convert_to_array(self, values, name=None, other=None):

# a datelike
elif isinstance(values, pd.DatetimeIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to use ABCDatetimeIndex

dtype=dtype,
)

return wrapper


def _get_series_op_result_name(left, right):
# `left` is always a Series object
if isinstance(right, (pd.Series, pd.Index)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use generic instance checks, thought why can't you just duck type, e.g.

name = _maybe_match_name(left, getattr(right, 'name', None))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the generics, sure. On the ducktype:

  1. pretty sure what you have in mind is name = _maybe_match_name(left, right)

  2. That would represent a non-trivial change in pandas convention/policy for name propagation. Up until a few days ago even Index names were ignored (except for unintentional corner cases caused by conversion within _TimeOp). Allowing through anything with a name attribute would include e.g. most DateOffset subclasses, which I don't think is desired. If this is something you'd like to see changed, I'd ask you to open an issue or something and consider it out of scope for this bug-fixing PR.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2017

@jbrockmendel FYI I cancelled a couple of your builds on travis. trying to get 0.22 out. most of these PR's have comments anyhow.

@jbrockmendel
Copy link
Member Author

FYI I cancelled a couple of your builds on travis. trying to get 0.22 out. most of these PR's have comments anyhow.

Sounds good. I'll flag any comments that need clarification.

BTW appveyor usually seems to be the constraining factor when the pipeline gets clogged. In cases where I screw up is there a way to cancel build for my own PRs there?

@jreback
Copy link
Contributor

jreback commented Dec 29, 2017

when you push again appveyor (and travis) both cancel the previous. so it doesn't matter really. travis does it as soon as you push, while appveyor won't cancel until it actually runs (so it looks like its taking longer).

@jbrockmendel
Copy link
Member Author

Just pushed fixes, including for #18989. You may need to cancel the build again.

@jbrockmendel
Copy link
Member Author

Repush or hang tight?

@jreback
Copy link
Contributor

jreback commented Dec 31, 2017

all good now
go ahead and repush

@@ -535,6 +536,11 @@ def _convert_to_array(self, values, name=None, other=None):
elif inferred_type in ('timedelta', 'timedelta64'):
# have a timedelta, convert to to ns here
values = to_timedelta(values, errors='coerce', box=False)
if isinstance(other, pd.DatetimeIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ABC here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change.

dtype=dtype,
)

return wrapper


def _get_series_op_result_name(left, right):
# `left` is always a Series object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggested a simplification

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggested a simplification

I'm guessing you're referring to this, which I responded to here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, rather than creating another function like this, just in-line it as its only used once. .

tz=tz, name=names[0])
ser = pd.Series([pd.Timedelta(seconds=5)] * 2,
index=index, name=names[1])
expected = pd.Series(index + pd.Timedelta(seconds=5),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convention is not to use pd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change.

dtype=dtype,
)

return wrapper


def _get_series_op_result_name(left, right):
# `left` is always a Series object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, rather than creating another function like this, just in-line it as its only used once. .


# passing name arg isn't enough when names[2] is None
expected.name = names[2]
assert expected.dtype == index.dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test adding with ser.values as well (obviously will be index result). was part of the OP as an example.

@jreback
Copy link
Contributor

jreback commented Jan 1, 2018

rebase and push again, fixed some hanging by Travis CI

@jbrockmendel
Copy link
Member Author

This should be ready, is orthogonal to #19024.

@jreback jreback added this to the 0.23.0 milestone Jan 2, 2018
@jreback jreback merged commit 04beec7 into pandas-dev:master Jan 2, 2018
@jreback
Copy link
Contributor

jreback commented Jan 2, 2018

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Timedelta Timedelta data type Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DatetimeIndex + TimeDelta gives wrong results when timezone is set
2 participants