Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix join on MultiIndex for mixed Datetimelike and string levels #32739

Closed
wants to merge 3 commits into from

Conversation

nrebena
Copy link
Contributor

@nrebena nrebena commented Mar 15, 2020

Problem

From #26558, a minimal example is:

import pandas as pd
import datetime

i1 = pd.Index(['2019-05-31'])
i2 = pd.Index([datetime.datetime(2019, 5, 31)])

mi1 = pd.MultiIndex.from_tuples([('2019-05-31',)])
mi2 = pd.MultiIndex.from_tuples([(datetime.datetime(2019, 5, 31),)])

print(i1.join(i2, return_indexers=True))
print(mi1.join(mi2, return_indexers=True))
   
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)
# (MultiIndex([('2019-05-31',)], ), None, array([-1]))

Here we can see that one of the indexers for multiindex is wrong.

Proposed fix 🩹

The difference of handling for single index is explain by the join method for DatetimeTimedeltaMixin.
I did some refactoring in pandas/core/indexes/datetimelike.py to be able to reuse the same datetime handling method.
Then in the Index.join method, where MultiIndex are handled, I apply this method to each level of the MultiIndex that is a DatetimeTimedeltaMixin.

PR checkboxes ☑️

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your proposed patch is very very specific to DTI/TDI conversions. Please try to narrow down exactly what is going wrong here, this is introducing a massive level of complexity.

@@ -3408,7 +3409,16 @@ def join(self, other, how="left", level=None, return_indexers=False, sort=False)

# have the same levels/names so a simple join
if self.names == other.names:
pass
# For datetimelike levels, we maybe convert the corresponding level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa, we don't want to do anything like this.

@nrebena
Copy link
Contributor Author

nrebena commented Mar 16, 2020

your proposed patch is very very specific to DTI/TDI conversions. Please try to narrow down exactly what is going wrong here, this is introducing a massive level of complexity.

The bug is very specific to DTI/TDI conversions.

What I did here is reproduce what happen when joining a DatetimeLike index with another index, see code here

if self._is_convertible_to_index_for_join(other):

The reason this bug happen it that join between DatetimeLike Index and Object Index works, ie, the returned join_index has casted the object to the corresponding datetimelike, but then ,when calling get_indexer,

lindexer = self.get_indexer(join_index)
, the Object Index return no result for the datetimelike index (as it should).

In brief, in Index.join, for multiindex, we join datetimelike and object together with maybe_casting of the object, but this doesn't happen at the get_indexer level.
When using single Index, the object index is maybe_casted before attempting to join, and the get_indexer then work.

woa, we don't want to do anything like this.

That is what happen actually for joining DatetimeLikeIndex with ObjectIndex

I am open to suggestion if one could tell me what we expect here.
The actual patch try to make the comportement of multiindex join consistent with the one on single index.
An alternativ is to prevent joining datetimelike with object, and raise by asking to explicitly cast to datetimelike before end, but I suspect this would be a breaking change.

@jreback
Copy link
Contributor

jreback commented Mar 16, 2020

@nrebena if indices type are not the same then they must be joined as object

@nrebena
Copy link
Contributor Author

nrebena commented Mar 16, 2020

@jreback This is actually not the case for single index, see this example, where the indice type differs, but the resulting type is DatetimeIndex.

import pandas as pd
import datetime

i1 = pd.Index(['2019-05-31'])
i2 = pd.Index([datetime.datetime(2019, 5, 31)])

print(i1.join(i2, return_indexers=True))
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)

Is it an acceptable solution to fix Index.join to return an object index in this case ?

@jreback
Copy link
Contributor

jreback commented Mar 16, 2020

yes see what fixing this breaks (likely would need to depreciate this)

@nrebena
Copy link
Contributor Author

nrebena commented Mar 21, 2020

Update on root cause and advice needed.
In the following example we have one datetime index and one string index with a datetime, with their comportment regarding join, intersection, union and equals.
I am not sure what are the desired comportment, but actually, anytime a string can be use as a datetime, it is casted so. See the following result from branch master.

import pandas as pd
import datetime

i1 = pd.Index(['2019-05-31'])
i2 = pd.Index([datetime.datetime(2019, 5, 31)])

i2.join(i1, how='outer', return_indexers=True))
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)
i1.join(i2, how='outer', return_indexers=True) 
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)

i1.union(i2)
# Index(['2019-05-31', 2019-05-31 00:00:00], dtype='object')
i2.union(i1)                                                                                                 
# Index([2019-05-31 00:00:00, '2019-05-31'], dtype='object')

i1.intersection(i2)
# Index(['2019-05-31'], dtype='object')
i2.intersection(i1)                                                                                          
# DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None)

i1.equals(i2)
# True
i2.equals(i1)                                                                                                
# True

date = "2020-03-16"
values = [date, pd.to_datetime(date)]
pd.Categorical(values, ordered=False)
# [2020-03-16, 2020-03-16]
# Categories (1, datetime64[ns]): [2020-03-16]

Are this results what should be expected for each one? For the one with unexpected result, what should be the right one?

@jreback
Copy link
Contributor

jreback commented Mar 22, 2020

Update on root cause and advice needed.
In the following example we have one datetime index and one string index with a datetime, with their comportment regarding join, intersection, union and equals.
I am not sure what are the desired comportment, but actually, anytime a string can be use as a datetime, it is casted so. See the following result from branch master.

import pandas as pd
import datetime

i1 = pd.Index(['2019-05-31'])
i2 = pd.Index([datetime.datetime(2019, 5, 31)])

i2.join(i1, how='outer', return_indexers=True))
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)
i1.join(i2, how='outer', return_indexers=True) 
# (DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None), None, None)

i1.union(i2)
# Index(['2019-05-31', 2019-05-31 00:00:00], dtype='object')
i2.union(i1)                                                                                                 
# Index([2019-05-31 00:00:00, '2019-05-31'], dtype='object')

i1.intersection(i2)
# Index(['2019-05-31'], dtype='object')
i2.intersection(i1)                                                                                          
# DatetimeIndex(['2019-05-31'], dtype='datetime64[ns]', freq=None)

i1.equals(i2)
# True
i2.equals(i1)                                                                                                
# True

date = "2020-03-16"
values = [date, pd.to_datetime(date)]
pd.Categorical(values, ordered=False)
# [2020-03-16, 2020-03-16]
# Categories (1, datetime64[ns]): [2020-03-16]

Are this results what should be expected for each one? For the one with unexpected result, what should be the right one?

all of the above should be deprecated. We shouldn't every compare strings and datetimes. We may currently do this, but we need to deprecate it.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

@nrebena if you would like to update this to the above behavior would be fine; we should not be joining datetimelike things with non-datetimelike things at all.

@nrebena
Copy link
Contributor Author

nrebena commented Apr 10, 2020

Sure thing. I'll try to get to it.

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Apr 10, 2020
@nrebena
Copy link
Contributor Author

nrebena commented Apr 13, 2020

I decided to split this, and started another PR here #33531, and will continue this one regarding the behaviour of pd.Categorical.

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

closing this until / unless #33531 is resolved

@jreback jreback closed this Jun 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.add gives incorrect result with MI Dataframe with mix of object and datetimes on index.
2 participants