Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Do not use string Index like Datetimelike Index #33531

Closed
wants to merge 2 commits into from

Conversation

nrebena
Copy link
Contributor

@nrebena nrebena commented Apr 13, 2020

Follow up of conversation in #32739 (comment).

The goal is to prevent index of string that look like datetimelike to be used as datetimelike.

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@jbrockmendel
Copy link
Member

so the point is that you dont want pd.Index(["2020-04-15"]) to be a DatetimeIndex? That's a breaking change.

cc @jorisvandenbossche @TomAugspurger xref #33558

@TomAugspurger
Copy link
Contributor

I'd want to deprecate that behavior.

@nrebena
Copy link
Contributor Author

nrebena commented Apr 15, 2020

pd.Index(["2020-04-15"]) is already not a DatetimeIndex, but, when compared with one, it is cast as one.

If we are to deprecated this behavior, I have some question on how the depreciation it is expected to behave.
Should it show a warning when the Index is cast?
To retains the old behavior, it suffice to cast the Index to DatetimeIndex (i.e with pd.to_datetime), but should we converse the old behavior with a warning that it won't behave the same in the future, or switch to the new behavior directly, with a warning on how to re achieve the old behavior?

@jorisvandenbossche
Copy link
Member

So to make it explicit, currently we have this behaviour:

In [1]: idx = pd.Index(["2020-04-15"]) 

In [2]: idx     
Out[2]: Index(['2020-04-15'], dtype='object')

In [3]: dtidx = pd.DatetimeIndex(["2020-04-15"])      

In [4]: dtidx 
Out[4]: DatetimeIndex(['2020-04-15'], dtype='datetime64[ns]', freq=None)

So a object dtype index with strings, and a DatetimeIndex.

Comparing them elementwise correctly gives False values:

In [6]: idx == dtidx   
Out[6]: array([False])

but calling the equals method actually returns True:

In [5]: idx.equals(dtidx)  
Out[5]: True

And I think we want this to give False. Correct?

The same also happens for other "coercible" values (not just strings), at least when DatetimeIndex is the calling index. For example:

In [7]: idx = pd.Int64Index(dtidx.asi8)    

In [8]: idx.equals(dtidx)    
Out[8]: False

In [9]: dtidx.equals(idx)   
Out[9]: True

I would say we also want to deprecate this behaviour?
If the Index classes are different, the Index objects can't be equal.

So some questions around this:

  • Do we already agree we don't want this behaviour of trying to coerce the other Index? I am in favor of not doing coercing, but so not only for strings but eg also integers. However, where to draw the line? (eg what with an object dtype index of Timestamp objects?)
  • Do we need to deprecate this, or can we just fix it as a bug fix? Since it seems this was done quite on purpose in the code, I would say we need to deprecate this?
  • The above examples are for the equals method. But has this further reaching implications? Does this affect the behaviour of other methods? (eg in the linked PR, I saw mention of union/intersection etc methods)

@jreback
Copy link
Contributor

jreback commented Apr 16, 2020

we should deprecate this

@nrebena
Copy link
Contributor Author

nrebena commented Apr 18, 2020

Do we already agree we don't want this behaviour of trying to coerce the other Index? I am in favor of not doing coercing, but so not only for strings but eg also integers. However, where to draw the line? (eg what with an object dtype index of Timestamp objects?)

From here

def _is_convertible_to_index_for_join(cls, other: Index) -> bool:

you can see that many types where already excluded from being coerced.

           "floating",
            "mixed-integer",
            "integer",
            "integer-na",
            "mixed-integer-float",
            "mixed",

Maybe instead of an exclusion list, we could have an list of which type can could be coerced.

The above examples are for the equals method. But has this further reaching implications? Does this affect the behaviour of other methods? (eg in the linked PR, I saw mention of union/intersection etc methods)

Union and intersection behavior follow equals behavior. The same is true for join and merge.
But changed are needed in pd.Categorical

date = "2020-03-16"
values = [date, pd.to_datetime(date)]
pd.Categorical(values, ordered=False)
# Categories (1, datetime64[ns]): [2020-03-16]

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

@nrebena i think the idea of this is fine, can you merge master and can re-evaluate

@jreback jreback added Index Related to the Index class or subclasses Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 14, 2020
@nrebena
Copy link
Contributor Author

nrebena commented Jun 14, 2020

@jreback Rebased.
I am not sure how to both deprecate and test the new behavior.
Deprecating could be in

def _is_convertible_to_index_for_join(cls, other: Index) -> bool:
by warning when using string typed index eventually.

@@ -903,6 +926,15 @@ def _is_convertible_to_index_for_join(cls, other: Index) -> bool:
return True
return False

def _wrap_joined_index(self, joined: np.ndarray, other):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad I failed my rebase, this is old code I missed.

@@ -408,3 +408,20 @@ def test_isocalendar_returns_correct_values_close_to_new_year_with_tz():
dtype="UInt32",
)
tm.assert_frame_equal(result, expected_data_frame)


def test_datetimelike_string():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally could paramaterize this with things that don't match here (e.g. more than just 1 thing).

Also this is not in the right location, search for were we test .equals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move this in pandas/tests/indexes/datetimes/test_ops.py

By trying to parametrize I have now more question about the expected behaviors, see comment that follows.

@nrebena
Copy link
Contributor Author

nrebena commented Jun 14, 2020

By trying to parametrize, I did see more case of index of different type that are equals.

import pandas as pd 
i1 = pd.Index([1586736000.0]) 
i2 = pd.Index(pd.to_datetime([1586736000.0])) 
print(i1) 
print(i2) 
print(i1.equals(i2))                                                                                              
# Float64Index([1586736000.0], dtype='float64')
# DatetimeIndex(['1970-01-01 00:00:01.586736'], dtype='datetime64[ns]', freq=None)
# True

I am not sure this is not the expected behavior either?

@jreback
Copy link
Contributor

jreback commented Jul 17, 2020

can u merge master and can look again

@simonjayhawkins
Copy link
Member

@nrebena can you add a release note

@dsaxton
Copy link
Member

dsaxton commented Oct 8, 2020

Closing as stale, @nrebena let us know if you'd like to reopen

@dsaxton dsaxton closed this Oct 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses Reshaping Concat, Merge/Join, Stack/Unstack, Explode Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants