Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: membership checks on ExtensionArray containing NA values #37867

Merged
merged 26 commits into from
Nov 29, 2020

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Nov 15, 2020

Membership checks on ExtensionArrays containing NA values raises ValueError in some circumstances (but not in other):

>>> arr1 = pd.array(["a", pd.NA])
>>> arr2 = pd.array([pd.NA, "a"])
>>> "a" in arr1
True  # ok
>>> "a" in arr2
TypeError: boolean value of NA is ambiguous  # not ok
>>> pd.NA in arr1
TypeError: boolean value of NA is ambiguous  # not ok
>>> pd.NA in arr2
True  # ok

So overall quite random failures. This PR fixes this problem by adding a custom __contains__ method on ExtensionArray.

I assume that we want pd.NA in arr1 to keep returning True. Note however that np.nan in np.array([np.nan]) return False, so pandas' behaviour is different.

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 15, 2020
@jreback jreback added this to the 1.2 milestone Nov 15, 2020
"""
# comparisons of any item to pd.NA always return pd.NA, so e.g. "a" in [pd.NA]
# would raise a TypeError. The implementation below works around that.
if isna(item):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_valid_nat_for_dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand you here. Are you saying that pd.NaT in arr should return False if the array is not a datetime-like array? Can you expand a bit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that pd.NaT in arr should return False if the array is not a datetime-like array?

correct (unless maybe object dtype). Some other examples:

dta = pd.date_range("2016-01-01", periods=3)._data  # <-- DateTimeArray
dta[-1] = pd.NaT
pa = dta.to_period("D")  # <-- PeriodArray
tda = dta - dta  # <-- TimedeltaArray
arr = pd.array([1, 2 ,3, pd.NA])  # <-- IntegerArray
flt = pd.array([1.0, 2.0, 3.0, np.nan])  # <-- FloatArray

tdnat = np.timedelta64("NaT", "ns")
dtnat = np.timedelta64("NaT", "ns")

>>> tdnat in dta
False
>>> dtnat in dta
True  # <-- actually this returns False, but thats a newly-discovered bug

>>> tdnat in pa
False
>>> dtnat in pa
False

>>> tdnat in tda
True   # <-- actually this raises TypeError, but thats a newly-discovered bug
>>> dtnat in tda
False    # <-- actually this raises TypeError, but thats a newly-discovered bug

>>> tdnat in flt
False
>>> dtnat in flt
False    # <-- actually this raises TypeError, but thats a newly-discovered bug

>>> tdnat in arr
False
>>> dtnat in arr
False     # <-- actually this raises TypeError, but thats a newly-discovered bug

side-note after checking all of these: yikes!

Copy link
Contributor Author

@topper-123 topper-123 Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't use datetime-likes very much, so I'm a bit weak on what these comparisons of nan-likes should return in datatime-likes and you certainly didn't help :-)

Maybe do as @jreback suggests, and handle these later/separately? An alternative could be in the base ExtensionArray check for pd.NA only. It sub-classes wan to do it differently, they can implement a __contains__ method themselves. That seems the cleanest to me.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2020

ok so this looks fine. merge and then open issues / followup for recently discovered cases?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should first discuss / decide on what behaviour we actually want for containment involving missing values (as the item being checked).

As it seems (even apart from the noted errors with pd.NA) an inconsistent situation right now? (np.nan gives False, pd.NaT gives True)

Also, can you add a base extension test?

@jorisvandenbossche
Copy link
Member

Now, I don't really know myself at the moment what I would expect for behaviour here. Listing all possible options (eg for pd.NA in pd.array([1, 2, pd.NA]):

  • Return True (because pd.NA is present in the array)
  • Return False (because pd.NA is not equal to itself, so cannot be "found")
  • Raise an error (because it is ambiguous (since there is an argument for both True/False), or because there is simply no "correct" answer, in the same logic as we also raise an error for bool(pd.NA))
    • Counterargument: in dictionaries pd.NA can be used as key, so you could argue that containement is similar as dictionary key checking
  • Return pd.NA (similar reasons as above, + since the pd.NA values are considered as "unknown", it is also "unknown" if the unknown value is contained in the array)

Quickly looking at some other languages to see what they do:

R returns True:

> NA %in% c(1, 2, 3, NA)
[1] TRUE

Julia is "strict" about the unknown part of missing, and if a missing value is present in the array, even any other value not present in the array is not known to not be in the array (so not even returns False):

julia> arr = [1 2 3 missing]
1×4 Array{Union{Missing, Int64},2}:
 1  2  3  missing

julia> 1 in arr
true

julia> 5 in arr
missing

julia> missing in arr
missing

julia> arr = [1 2 3]
1×3 Array{Int64,2}:
 1  2  3

julia> 5 in arr
false

julia> missing in arr
missing

@topper-123
Copy link
Contributor Author

Yes, @jorisvandenbossche, my doubts on nan-likes as well: There doesn't seem to be any inherantly correct choice, as it look to me.

We have however already in pandas that:

>>> arr2 = pd.array([pd.NA, "a"])
>>> pd.NA in arr2  # see OP
True
>>> dta = pd.date_range("2016-01-01", periods=2)._data
>>> dta[0] = pd.NaT
>>> pd.NaT in dta
True

So from at backward compat stand point there is some argument for returning True in Pandas, if looking for a nan-like in the ExtensionArray.

@jorisvandenbossche
Copy link
Member

Specifically that pd.NA in pd.array([pd.NA, "a"]) is returning True, I wouldn't use as a "prior" art argument, as I think this is mostly accidental behaviour and we never really considered yet the desired behaviour for __contains__ for nullable EAs (and those types are not yet considered stable, so we can change behaviour if we decide this is needed).

The fact that it returns True for pd.NaT certainly has a much longer history, but I don't know to what extent this was intentional (@jbrockmendel ?). For example the numpy counterpart actually returns False:

In [20]: arr = np.array(["NaT"], dtype="datetime64[ns]")

In [21]: arr[0]
Out[21]: numpy.datetime64('NaT')

In [22]: arr[0] in arr
Out[22]: False

Something else, the current isna(item) check is also too broad, as (I think from looking at the code), it will make that things like None in pd.array([pd.NA, "a"]) or np.nan in pd.array([pd.NA, "a"]) will also return True, which I think we certainly don't want?

@jbrockmendel
Copy link
Member

The fact that it returns True for pd.NaT certainly has a much longer history, but I don't know to what extent this was intentional

I'm pretty sure that is unintentional. I've been fixing some isna checks to use is_valid_nat_for_dtype, but haven't gotten all of them.

Something else, the current isna(item) check is also too broad, as (I think from looking at the code), it will make that things like None in pd.array([pd.NA, "a"]) or np.nan in pd.array([pd.NA, "a"]) will also return True, which I think we certainly don't want?

I'd advocate updating is_valid_nat_for_dtype to reflect those exclusions. (And renaming valid_nat -> valid_na)

In [22]: arr[0] in arr
Out[22]: False

That is surprising. My intuition is that arr[0] in arr should be True for any arr. I'm open to being convinced otherwise.

@jorisvandenbossche
Copy link
Member

That is surprising. My intuition is that arr[0] in arr should be True for any arr. I'm open to being convinced otherwise.

For NaN/NaT, those values are not equal to itself, and I think numpy uses equality to implement __contains__

@jbrockmendel
Copy link
Member

For NaN/NaT, those values are not equal to itself, and I think numpy uses equality to implement contains

My intuition is based more around our Index implementations that use get_loc.

What would the idiomatic way of checking any(x is pd.NA for x in arr) be?

@topper-123
Copy link
Contributor Author

topper-123 commented Nov 21, 2020

What would the idiomatic way of checking any(x is pd.NA for x in arr) be?

arr.isna().any() IMO.

I think there will be be a surprise if users check for nan-likes using __contains__ for some users, because there are two reasoably choices:

  1. pd.NA in arr should be interpreted as (pd.NA == arr).any(), resulting in False even if a nan-like is in arr
  2. pd.NA in arr should be interpreted as arr.isna().any(), resulting in True if a nan-like is in arr, but will also return True for e.g. pd.NaT in arr, which may be unintended (or has to be decided upon.

A solution to avoid ambiguity is to raise a TypeError, if a nan-like is supplied to __contains__:

    def __contains__(self, item) -> bool:
        if isna(item):
            raise TypeError("NA value not allowed. Use .isna method to check for NA values.")
        else:
            return (item == self).any()

This would at least ensure that "a" in arr doesn't raise unexpectedly.

@jbrockmendel
Copy link
Member

What would the idiomatic way of checking any(x is pd.NA for x in arr) be?

arr.isna().any() IMO.

My complaint about that is it doesn't distinguish between different NAs, most relevant for object dtype.

This would at least ensure that "a" in arr doesn't raise unexpectedly.

Short-term, can we fix this without taking a stance on the NA topic?

My intuition is that obj in pd.array(arr) should match obj in pd.Index(arr) and obj in pd.Series(arr). Is this controversial?

@jreback
Copy link
Contributor

jreback commented Nov 24, 2020

What would the idiomatic way of checking any(x is pd.NA for x in arr) be?

arr.isna().any() IMO.

My complaint about that is it doesn't distinguish between different NAs, most relevant for object dtype.

This would at least ensure that "a" in arr doesn't raise unexpectedly.

Short-term, can we fix this without taking a stance on the NA topic?

My intuition is that obj in pd.array(arr) should match obj in pd.Index(arr) and obj in pd.Series(arr). Is this controversial?

agree with @jbrockmendel

I think have NaN != itself is different then a membership test. we do not want special cases.

pandas/tests/arrays/categorical/test_operators.py Outdated Show resolved Hide resolved
pandas/tests/arrays/string_/test_string.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 24, 2020

Short-term, can we fix this without taking a stance on the NA topic?

How is fixing this not taking a stance? ;)
Because to fix it we need to choose a certain behaviour regardless, which then defacto becomes the __contains__ behaviour for pd.NA ..

The obj in pd.Index(arr) example from @jbrockmendel is a good comparison, because it basically does a engine/hashtable lookup operation under the hood and so I suppose will be consistent with indexing (series.loc[pd.NA] if the index has missing values).

So that basically gives two possible interpretations / implementations of __contains__:

  • equality-based: ((arr == obj).any())
  • lookup-based (eg pd.NA in dict.fromkeys(arr), or our indexing engine lookup)

While numpy uses the first one, it seems to make sense to use the second option for membership test like __contains__.

But, I think there is also something to say for using the "unknown value" interpretation of pd.NA. And then we don't know if some "unknown value" is present in the array, and we shouldn't return True or False.

@jorisvandenbossche jorisvandenbossche changed the title BUG: membership checks on ExtensionArray containing NA values API: membership checks on ExtensionArray containing NA values Nov 24, 2020

def test_contains():
# GH-37867
arr = pd.arrays.StringArray(np.array(["a", "b"], dtype=object))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arr = pd.arrays.StringArray(np.array(["a", "b"], dtype=object))
arr = pd.array(["a", "b"], dtype="string")

is a bit shorter to construct those

@jorisvandenbossche
Copy link
Member

@topper-123 since you are adding a method on the base extension array class, I think we should also add a test for this in the generic base extension tests. There are fixtures for data with/without missing values and for the na_value, so it should be possible to test this in general.

@topper-123
Copy link
Contributor Author

@topper-123 since you are adding a method on the base extension array class, I think we should also add a test for this in the generic base extension tests. There are fixtures for data with/without missing values and for the na_value, so it should be possible to test this in general.

I've looked, but can't find the base extension tests. Could you say where they're located?

@jorisvandenbossche
Copy link
Member

In pandas/tests/extension/base/, and then in one of those files. Not sure which file is best fitting, maybe interface.py, or methods.py

@jbrockmendel
Copy link
Member

Short-term, can we fix this without taking a stance on the NA topic?

How is fixing this not taking a stance? ;)

I was specifically referring to "This would at least ensure that "a" in arr doesn't raise unexpectedly.", which I think we can ensure without taking a stance on the behavior of pd.NA in arr.

@jorisvandenbossche
Copy link
Member

I was specifically referring to "This would at least ensure that "a" in arr doesn't raise unexpectedly."

Ah, yes, that's something we certainly agree on I think that this at least shouldn't raise!
(but to fix that we still need to choose a behaviour for pd.NA in .. as well)

Now, I am fine with following our behaviour for Index containment / indexing, which is to treat NaN and pd.NA etc equal to itself (in contrast to == equality).

@topper-123
Copy link
Contributor Author

topper-123 commented Nov 25, 2020

I've updated the PR.

For comparisons with nan_likes, it nows returns True if the na value is arr.dtype.na_value else False.

@topper-123
Copy link
Contributor Author

The Travis failure looks unrelated. I'll run the PR again after comments.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, some comments.

# comparisons of any item to pd.NA always return pd.NA, so e.g. "a" in [pd.NA]
# would raise a TypeError. The implementation below works around that.
if item is self.dtype.na_value:
return isna(self).any() if self._can_hold_na else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use self.isna()

pandas/tests/extension/base/interface.py Show resolved Hide resolved
pandas/core/arrays/base.py Show resolved Hide resolved
# the settled on rule is: `nan_like in arr` is True if nan_like is
# arr.dtype.na_value and arr.isna().any() is True. Else the check returns False.

for this_data in [data, data_missing]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments below on how to actually parameterize this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't its possible to parametrize fixtures? do have an example or hint on how that's done?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grep for

pytest.getfixturevalue in tests
we just had a use of this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't get it to work, I'll try it again tonight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I would simply leave it as is, using pytest.getfixturevalue only makes it way more complicated as it needs to be

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@@ -50,6 +50,10 @@ def test_view(self, data):
# __setitem__ does not work, so we only have a smoke-test
data.view()

@pytest.mark.xfail(raises=AssertionError, reason="Not implemented yet")
def test_contains(self, data, data_missing):
super().test_contains(data, data_missing)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arrow arrays have Arrow Nulls as the dtype.na_value, while data_missing[0] gives ´None´.

Maybe @TomAugspurger could look into that as part of his Arrow work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arrow arrays have Arrow Nulls as the dtype.na_value, while data_missing[0] gives ´None´.

It's fine to simply skip it as you did. This are only test EAs, that were needed for certain specific tests, so no problem this don't fully work otherwise.

The actual public arrays using Arrow under the hood (eg the new ArrowStringArray) has a different implementation and is fully tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also base these test arrays on pd.NA? Otherwise we can use the same code as in the string array to replace the scalar return value of None with na_value.

@topper-123
Copy link
Contributor Author

The travis failure looks unrelated and I can't reproduce it on my computer. So IMo this PR passes currently, but I could make another run, if there's agreement on the content.

@jorisvandenbossche
Copy link
Member

Yes, don't worry about the travis failure on this PR

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
def test_contains(self, data, data_missing):
# GH-37867
# na value handling in Categorical.__contains__ is deprecated.
# See base.BaseInterFaceTests.test_contains for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates what you have in tests/arrays/categorical/test_operators.py ?

Copy link
Contributor Author

@topper-123 topper-123 Nov 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not: Categorical already had a __contains__ method and it's more permissive than the new one. So, in this file we have (below) assert na_value_type in data_missing, while the base tests method is assert na_value_type not in data_missing (notice the not).

na values is also more complicated in categoricals, because in some cases we want to accept pd.NaT and in other cases not. I'd like to take it in another round (or let it slide)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I was not referring to the base tests that this is overriding, but the original test you added to tests/arrays/categorical/test_operators.py which also tests this more permissive behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'll delete the ones in tests/arrays/categorical/test_operators.py.

@@ -51,6 +51,13 @@ def numpy_dtype(self) -> np.dtype:
"""
return self._dtype

@property
def na_value(self) -> object:
if issubclass(self.type, np.floating):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this always nan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was that np.float64("nan") is not np.nan. However, I've fixed this, so this method is actuallt not necessary.

pandas/tests/extension/decimal/array.py Show resolved Hide resolved
@@ -143,6 +143,11 @@ def test_custom_asserts(self):
with pytest.raises(AssertionError, match=msg):
self.assert_frame_equal(a.to_frame(), b.to_frame())

@pytest.mark.xfail(reason="comparison method not implemented on JSONArray")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is than issue number for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue number is the first line in the method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the xfail pls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good and fine merging like this, but should change to use the nulls_fixture which is more comprehensive. if you can do this today can get this in

assert na_value in data_missing
assert na_value not in data

# the data can never contain other nan-likes than na_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might consider actually using the nulls_fixture as this is more comprehensive (sure it will run the other checks multiple times but no big deal).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally -1 on using a fixture for this (we could move the list of nulls that is currently used for defining the fixture to constant in _testing.py and use that constant in both places, though)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case?

we already have a nulls fixture and use it in a great many places

we need to be comprehensive and general in testing - specific cases are ok sometimes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok here's the fixture contents

@pytest.fixture(params=[None, np.nan, pd.NaT, float("nan"), pd.NA], ids=str)

so if you add float('nan') then this should cover

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the generality of using the nulls_fixture too. I've added it in the newest commit.

assert na_value in data_missing
assert na_value not in data

# the data can never contain other nan-likes than na_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here.

@jreback
Copy link
Contributor

jreback commented Nov 29, 2020

lgtm. thanks @topper-123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants