Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: isin incorrectly casting ints to datetimes #37528

Merged
merged 8 commits into from
Nov 22, 2020

Conversation

jbrockmendel
Copy link
Member

De-duplication in follow-up.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this going to need perf tests as this is very sensitive

also i am surprised this is needed at all where you did it

# Dispatch to DatetimeLikeIndexMixin.isin
from pandas import Index

return Index(comps).isin(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure?

ensure_data and below does this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way we call ensure_data ends up casting dt64 to int64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't you fix it in _ensure_data then, I am not averse to dispatching to the EA itself, which is what we are already doing for Categorical, or is that the plan. If that is the plan, then I would think you could actually remove all off this and simply dispatch (e.g. most of this logic then would actually be handled by PandasArray or the EA as appropriate).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a bunch of GH issues about isin (added a new label yesterday), many of which are about unwanted casting. I haven't looked at them all closely tet.

why can't you fix it in _ensure_data then

Even if we fix that (which yes, i'll take a look, but probably in a separate PR), we still need to dispatch to the (Period|Timedelta|Datetime)Index to get correct casting of the other (assuming we want algos.isin(dtlike, other) to match dtlike.isin(other), which I for one do)

am not averse to dispatching to the EA itself, which is what we are already doing for Categorical, or is that the plan

ATM Categorical has isin but thats it. #20617 suggests adding isin to EA, but thats definitely outside the scope of this PR.

@jbrockmendel
Copy link
Member Author

this going to need perf tests as this is very sensitive

will run the appropriate asvs

@jbrockmendel
Copy link
Member Author

actually, i think this can be simplified to just use _from_sequence following #37179

# Dispatch to DatetimeLikeIndexMixin.isin
from pandas import Index

return Index(comps).isin(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't you fix it in _ensure_data then, I am not averse to dispatching to the EA itself, which is what we are already doing for Categorical, or is that the plan. If that is the plan, then I would think you could actually remove all off this and simply dispatch (e.g. most of this logic then would actually be handled by PandasArray or the EA as appropriate).

@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Datetime Datetime data dtype labels Oct 31, 2020
@jbrockmendel
Copy link
Member Author

gentle ping. id like to see if we can share some code between this and equals

@jbrockmendel
Copy link
Member Author

Found we don't have any asvs that hit this, so just added a few. No change for cases with matching dtype, nice pickup for others:

       before           after         ratio
     [f823a859]       [9758b021]
     <ref-dtlike-indexers>       <bug-dti-isin>
-        691±30μs          143±3μs     0.21  series_methods.IsInDatetime64.time_isin_empty
-     1.23±0.08ms          141±8μs     0.11  series_methods.IsInDatetime64.time_isin_mismatched_dtype

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@jreback
Copy link
Contributor

jreback commented Nov 21, 2020

I guess as a short term measure this is ok as you fixed some bugs. the problem here is that you are bouncing between algos and the indexes, where everything should be in algos. this feels like a step backwards, if its tactical, then ok.

@jbrockmendel
Copy link
Member Author

the problem here is that you are bouncing between algos and the indexes, where everything should be in algos

its not unusual for core.algorithms functions to call EA methods. There's no way the casting logic belongs in core.algorithms.

@jreback
Copy link
Contributor

jreback commented Nov 21, 2020

the problem here is that you are bouncing between algos and the indexes, where everything should be in algos

its not unusual for core.algorithms functions to call EA methods. There's no way the casting logic belongs in core.algorithms.

that's not what's happening here, these are calling Index methods. This seems quite unusual and a very odd depenency.
Why do you not simply move the logic in DTI.isin back to algorithms? what's the point of adding this depenedecy.?

@jbrockmendel
Copy link
Member Author

that's not what's happening here, these are calling Index methods

Good point, my mistake. At some point this should be consolidated in the EA. IIRC theres an issue about adding isin to the interface.

Why do you not simply move the logic in DTI.isin back to algorithms?

because the relevant logic is casting logic that belongs in the DTA. My current thought is that once we get the _from_sequence-strictness PR in we'll be able to de-duplicate some logic between isin+equals using that.

@jbrockmendel
Copy link
Member Author

The new hashtables have me looking at algorithms._ensure_data, and I'm pretty confident that this will allow us to simplify it quite a bit

@jreback jreback added this to the 1.2 milestone Nov 22, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fair

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: isin - dtype conversions (Timestamp vs ints)
2 participants