BUG: Fix join on MultiIndex for mixed Datetimelike and string levels #32739

nrebena · 2020-03-15T22:25:01Z

Problem

From #26558, a minimal example is:

import pandas as pd
import datetime

i1 = pd.Index(['2019-05-31'])
i2 = pd.Index([datetime.datetime(2019, 5, 31)])

mi1 = pd.MultiIndex.from_tuples([('2019-05-31',)])
mi2 = pd.MultiIndex.from_tuples([(datetime.datetime(2019, 5, 31),)])

print(i1.join(i2, return_indexers=True))
print(mi1.join(mi2, return_indexers=True))
   
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)
# (MultiIndex([('2019-05-31',)], ), None, array([-1]))

Here we can see that one of the indexers for multiindex is wrong.

Proposed fix 🩹

The difference of handling for single index is explain by the join method for DatetimeTimedeltaMixin.
I did some refactoring in pandas/core/indexes/datetimelike.py to be able to reuse the same datetime handling method.
Then in the Index.join method, where MultiIndex are handled, I apply this method to each level of the MultiIndex that is a DatetimeTimedeltaMixin.

PR checkboxes ☑️

closes .add gives incorrect result with MI Dataframe with mix of object and datetimes on index. #26558
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback

your proposed patch is very very specific to DTI/TDI conversions. Please try to narrow down exactly what is going wrong here, this is introducing a massive level of complexity.

jreback · 2020-03-16T02:44:19Z

pandas/core/indexes/base.py

@@ -3408,7 +3409,16 @@ def join(self, other, how="left", level=None, return_indexers=False, sort=False)

            # have the same levels/names so a simple join
            if self.names == other.names:
-                pass
+                # For datetimelike levels, we maybe convert the corresponding level


woa, we don't want to do anything like this.

nrebena · 2020-03-16T10:38:48Z

your proposed patch is very very specific to DTI/TDI conversions. Please try to narrow down exactly what is going wrong here, this is introducing a massive level of complexity.

The bug is very specific to DTI/TDI conversions.

What I did here is reproduce what happen when joining a DatetimeLike index with another index, see code here

pandas/pandas/core/indexes/datetimelike.py

Line 833 in 23f8763

if self._is_convertible_to_index_for_join(other):

The reason this bug happen it that join between DatetimeLike Index and Object Index works, ie, the returned join_index has casted the object to the corresponding datetimelike, but then ,when calling get_indexer,

pandas/pandas/core/indexes/base.py

Line 3505 in a59fcc9

lindexer = self.get_indexer(join_index)

, the Object Index return no result for the datetimelike index (as it should).

In brief, in Index.join, for multiindex, we join datetimelike and object together with maybe_casting of the object, but this doesn't happen at the get_indexer level.
When using single Index, the object index is maybe_casted before attempting to join, and the get_indexer then work.

woa, we don't want to do anything like this.

That is what happen actually for joining DatetimeLikeIndex with ObjectIndex

I am open to suggestion if one could tell me what we expect here.
The actual patch try to make the comportement of multiindex join consistent with the one on single index.
An alternativ is to prevent joining datetimelike with object, and raise by asking to explicitly cast to datetimelike before end, but I suspect this would be a breaking change.

jreback · 2020-03-16T10:54:22Z

@nrebena if indices type are not the same then they must be joined as object

nrebena · 2020-03-16T11:05:21Z

@jreback This is actually not the case for single index, see this example, where the indice type differs, but the resulting type is DatetimeIndex.

import pandas as pd
import datetime

i1 = pd.Index(['2019-05-31'])
i2 = pd.Index([datetime.datetime(2019, 5, 31)])

print(i1.join(i2, return_indexers=True))
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)

Is it an acceptable solution to fix Index.join to return an object index in this case ?

jreback · 2020-03-16T11:29:55Z

yes see what fixing this breaks (likely would need to depreciate this)

nrebena · 2020-03-21T12:34:07Z

Update on root cause and advice needed.
In the following example we have one datetime index and one string index with a datetime, with their comportment regarding join, intersection, union and equals.
I am not sure what are the desired comportment, but actually, anytime a string can be use as a datetime, it is casted so. See the following result from branch master.

import pandas as pd
import datetime

i1 = pd.Index(['2019-05-31'])
i2 = pd.Index([datetime.datetime(2019, 5, 31)])

i2.join(i1, how='outer', return_indexers=True))
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)
i1.join(i2, how='outer', return_indexers=True) 
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)

i1.union(i2)
# Index(['2019-05-31', 2019-05-31 00:00:00], dtype='object')
i2.union(i1)                                                                                                 
# Index([2019-05-31 00:00:00, '2019-05-31'], dtype='object')

i1.intersection(i2)
# Index(['2019-05-31'], dtype='object')
i2.intersection(i1)                                                                                          
# DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None)

i1.equals(i2)
# True
i2.equals(i1)                                                                                                
# True

date = "2020-03-16"
values = [date, pd.to_datetime(date)]
pd.Categorical(values, ordered=False)
# [2020-03-16, 2020-03-16]
# Categories (1, datetime64[ns]): [2020-03-16]

Are this results what should be expected for each one? For the one with unexpected result, what should be the right one?

jreback · 2020-03-22T20:54:52Z

Update on root cause and advice needed.
In the following example we have one datetime index and one string index with a datetime, with their comportment regarding join, intersection, union and equals.
I am not sure what are the desired comportment, but actually, anytime a string can be use as a datetime, it is casted so. See the following result from branch master.

import pandas as pd
import datetime

i1 = pd.Index(['2019-05-31'])
i2 = pd.Index([datetime.datetime(2019, 5, 31)])

i2.join(i1, how='outer', return_indexers=True))
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)
i1.join(i2, how='outer', return_indexers=True) 
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)

i1.union(i2)
# Index(['2019-05-31', 2019-05-31 00:00:00], dtype='object')
i2.union(i1)                                                                                                 
# Index([2019-05-31 00:00:00, '2019-05-31'], dtype='object')

i1.intersection(i2)
# Index(['2019-05-31'], dtype='object')
i2.intersection(i1)                                                                                          
# DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None)

i1.equals(i2)
# True
i2.equals(i1)                                                                                                
# True

date = "2020-03-16"
values = [date, pd.to_datetime(date)]
pd.Categorical(values, ordered=False)
# [2020-03-16, 2020-03-16]
# Categories (1, datetime64[ns]): [2020-03-16]

Are this results what should be expected for each one? For the one with unexpected result, what should be the right one?

all of the above should be deprecated. We shouldn't every compare strings and datetimes. We may currently do this, but we need to deprecate it.

jreback · 2020-04-10T22:07:46Z

@nrebena if you would like to update this to the above behavior would be fine; we should not be joining datetimelike things with non-datetimelike things at all.

nrebena · 2020-04-10T22:15:32Z

Sure thing. I'll try to get to it.

nrebena · 2020-04-13T17:39:13Z

I decided to split this, and started another PR here #33531, and will continue this one regarding the behaviour of pd.Categorical.

jreback · 2020-06-14T15:34:59Z

closing this until / unless #33531 is resolved

nrebena added 3 commits March 15, 2020 23:07

TST: Add test for pandas-dev#26558

a50bfe9

BUG: pandas-dev#26558 Fix join on MultiIndex with datetime and string

311ad1c

DOC: Bugfix for MultiIndex in whatsnew

a59fcc9

jreback requested changes Mar 16, 2020

View reviewed changes

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Apr 10, 2020

nrebena mentioned this pull request Apr 13, 2020

BUG: Do not use string Index like Datetimelike Index #33531

Closed

5 tasks

jreback closed this Jun 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix join on MultiIndex for mixed Datetimelike and string levels #32739

BUG: Fix join on MultiIndex for mixed Datetimelike and string levels #32739

nrebena commented Mar 15, 2020

jreback left a comment

jreback Mar 16, 2020

nrebena commented Mar 16, 2020

jreback commented Mar 16, 2020

nrebena commented Mar 16, 2020 •

edited

Loading

jreback commented Mar 16, 2020

nrebena commented Mar 21, 2020

jreback commented Mar 22, 2020

jreback commented Apr 10, 2020

nrebena commented Apr 10, 2020

nrebena commented Apr 13, 2020

jreback commented Jun 14, 2020

BUG: Fix join on MultiIndex for mixed Datetimelike and string levels #32739

BUG: Fix join on MultiIndex for mixed Datetimelike and string levels #32739

Conversation

nrebena commented Mar 15, 2020

Problem

Proposed fix 🩹

PR checkboxes ☑️

jreback left a comment

Choose a reason for hiding this comment

jreback Mar 16, 2020

Choose a reason for hiding this comment

nrebena commented Mar 16, 2020

jreback commented Mar 16, 2020

nrebena commented Mar 16, 2020 • edited Loading

jreback commented Mar 16, 2020

nrebena commented Mar 21, 2020

jreback commented Mar 22, 2020

jreback commented Apr 10, 2020

nrebena commented Apr 10, 2020

nrebena commented Apr 13, 2020

jreback commented Jun 14, 2020

nrebena commented Mar 16, 2020 •

edited

Loading