Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: isin incorrectly casting ints to datetimes #37528

Merged
merged 8 commits into from
Nov 22, 2020
4 changes: 4 additions & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -380,6 +380,10 @@ Datetimelike
- :class:`Timestamp` and :class:`DatetimeIndex` comparisons between timezone-aware and timezone-naive objects now follow the standard library ``datetime`` behavior, returning ``True``/``False`` for ``!=``/``==`` and raising for inequality comparisons (:issue:`28507`)
- Bug in :meth:`DatetimeIndex.equals` and :meth:`TimedeltaIndex.equals` incorrectly considering ``int64`` indexes as equal (:issue:`36744`)
- Bug in :meth:`TimedeltaIndex.sum` and :meth:`Series.sum` with ``timedelta64`` dtype on an empty index or series returning ``NaT`` instead of ``Timedelta(0)`` (:issue:`31751`)
- Bug in :meth:`Series.isin` with ``datetime64[ns]`` dtype and :meth:`DatetimeIndex.isin` incorrectly casting integers to datetimes (:issue:`36621`)
- Bug in :meth:`Series.isin` with ``datetime64[ns]`` dtype and :meth:`DatetimeIndex.isin` failing to consider timezone-aware and timezone-naive datetimes as always different (:issue:`???`)
- Bug in :meth:`Series.isin` with ``PeriodDtype`` dtype and :meth:`PeriodIndex.isin` failing to consider arguments with different ``PeriodDtype`` as always different (:issue:`???`)


Timedelta
^^^^^^^^^
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -430,6 +430,12 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
# handle categoricals
return cast("Categorical", comps).isin(values)

if needs_i8_conversion(comps):
# Dispatch to DatetimeLikeIndexMixin.isin
from pandas import Index

return Index(comps).isin(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure?

ensure_data and below does this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way we call ensure_data ends up casting dt64 to int64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't you fix it in _ensure_data then, I am not averse to dispatching to the EA itself, which is what we are already doing for Categorical, or is that the plan. If that is the plan, then I would think you could actually remove all off this and simply dispatch (e.g. most of this logic then would actually be handled by PandasArray or the EA as appropriate).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a bunch of GH issues about isin (added a new label yesterday), many of which are about unwanted casting. I haven't looked at them all closely tet.

why can't you fix it in _ensure_data then

Even if we fix that (which yes, i'll take a look, but probably in a separate PR), we still need to dispatch to the (Period|Timedelta|Datetime)Index to get correct casting of the other (assuming we want algos.isin(dtlike, other) to match dtlike.isin(other), which I for one do)

am not averse to dispatching to the EA itself, which is what we are already doing for Categorical, or is that the plan

ATM Categorical has isin but thats it. #20617 suggests adding isin to EA, but thats definitely outside the scope of this PR.


comps, dtype = _ensure_data(comps)
values, _ = _ensure_data(values, dtype=dtype)

Expand Down
28 changes: 28 additions & 0 deletions pandas/core/indexes/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -473,12 +473,40 @@ def isin(self, values, level=None):
if level is not None:
self._validate_index_level(level)

if not hasattr(values, "dtype"):
values = np.asarray(values)

if values.dtype.kind in ["f", "i", "u", "c"]:
# TODO: de-duplicate with equals, validate_comparison_value
return np.zeros(self.shape, dtype=bool)

if not isinstance(values, type(self)):
inferrable = [
"timedelta",
"timedelta64",
"datetime",
"datetime64",
"date",
"period",
]
if values.dtype == object:
inferred = lib.infer_dtype(values, skipna=False)
if inferred not in inferrable:
if "mixed" in inferred:
return self.astype(object).isin(values)
return np.zeros(self.shape, dtype=bool)

try:
values = type(self)(values)
except ValueError:
return self.astype(object).isin(values)

try:
self._data._check_compatible_with(values)
except (TypeError, ValueError):
# Includes tzawareness mismatch and IncompatibleFrequencyError
return np.zeros(self.shape, dtype=bool)

return algorithms.isin(self.asi8, values.asi8)

@Appender(Index.where.__doc__)
Expand Down
55 changes: 55 additions & 0 deletions pandas/tests/series/methods/test_isin.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import pandas as pd
from pandas import Series, date_range
import pandas._testing as tm
from pandas.core.arrays import PeriodArray


class TestSeriesIsIn:
Expand Down Expand Up @@ -90,6 +91,60 @@ def test_isin_read_only(self):
expected = Series([True, True, True])
tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("dtype", [object, None])
def test_isin_dt64_values_vs_ints(self, dtype):
# GH#36621 dont cast integers to datetimes for isin
dti = date_range("2013-01-01", "2013-01-05")
ser = Series(dti)

comps = np.asarray([1356998400000000000], dtype=dtype)

res = dti.isin(comps)
expected = np.array([False] * len(dti), dtype=bool)
tm.assert_numpy_array_equal(res, expected)

res = ser.isin(comps)
tm.assert_series_equal(res, Series(expected))

res = pd.core.algorithms.isin(ser, comps)
tm.assert_numpy_array_equal(res, expected)

def test_isin_tzawareness_mismatch(self):
dti = date_range("2013-01-01", "2013-01-05")
ser = Series(dti)

other = dti.tz_localize("UTC")

res = dti.isin(other)
expected = np.array([False] * len(dti), dtype=bool)
tm.assert_numpy_array_equal(res, expected)

res = ser.isin(other)
tm.assert_series_equal(res, Series(expected))

res = pd.core.algorithms.isin(ser, other)
tm.assert_numpy_array_equal(res, expected)

def test_isin_period_freq_mismatch(self):
dti = date_range("2013-01-01", "2013-01-05")
pi = dti.to_period("M")
ser = Series(pi)

# We construct another PeriodIndex with the same i8 values
# but different dtype
dtype = dti.to_period("Y").dtype
other = PeriodArray._simple_new(pi.asi8, dtype=dtype)

res = pi.isin(other)
expected = np.array([False] * len(pi), dtype=bool)
tm.assert_numpy_array_equal(res, expected)

res = ser.isin(other)
tm.assert_series_equal(res, Series(expected))

res = pd.core.algorithms.isin(ser, other)
tm.assert_numpy_array_equal(res, expected)


@pytest.mark.slow
def test_isin_large_series_mixed_dtypes_and_nan():
Expand Down