Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Row-wise comparison between two series always evaluates to all False when one series contains pd.NA #45599

Closed
2 of 3 tasks
wl2522 opened this issue Jan 24, 2022 · 4 comments
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations Usage Question

Comments

@wl2522
Copy link

wl2522 commented Jan 24, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

a = pd.Series([1, 2, 3])
b = pd.Series([1, 2, pd.NA])

print(a == b)

0    False
1    False
2    False
dtype: bool

print(a.eq(b))

0    False
1    False
2    False
dtype: bool

Issue Description

In my actual use case, I'm performing row-wise comparisons between an integer column and the same column shifted by various periods

column == column.shift(periods=i) for 1 <= i <= 6

to check if a previous row's value is the same as the current row's value.

Because of the behavior described in my example, these comparisons are all incorrectly evaluating into columns filled with all False values, even if there are rows with the same values between both columns.

Note: I did not try reproducing this bug with the main branch version of pandas but I scanned through the list of commit messages from commits pushed since pandas version 1.4.0 and did not notice any that sound like they would address this issue.

Expected Behavior

Since pandas.Series.eq performs an element-wise comparison between each series, I would expect for comparisons involving pd.NA to behave like those which involve np.nan:

c = pd.Series([1.0, 2.0, 3.0])
d = pd.Series([1.0, 2.0, np.nan])
print(c == d)

0     True
1     True
2    False
dtype: bool

print(c.eq(d))

0     True
1     True
2    False
dtype: bool

Installed Versions

INSTALLED VERSIONS
------------------
commit           : bb1f651536508cdfef8550f93ace7849b00046ee
python           : 3.8.12.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-1059-aws
Version          : #62~18.04.1-Ubuntu SMP Fri Oct 22 21:51:38 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.4.0
numpy            : 1.21.2
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.2.4
setuptools       : 58.0.4
Cython           : 0.29.25
pytest           : 6.2.5
hypothesis       : None
sphinx           : 4.2.0
blosc            : None
feather          : None
xlsxwriter       : 3.0.2
lxml.etree       : 4.7.1
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.3
IPython          : 7.29.0
pandas_datareader: None
bs4              : 4.10.0
bottleneck       : 1.3.2
fastparquet      : None
fsspec           : 2022.01.0
gcsfs            : None
matplotlib       : 3.5.0
numba            : 0.51.2
numexpr          : 2.8.1
odfpy            : None
openpyxl         : 3.0.9
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.3
sqlalchemy       : 1.4.27
tables           : 3.6.1
tabulate         : 0.8.9
xarray           : None
xlrd             : 2.0.1
xlwt             : 1.3.0
zstandard        : None
@wl2522 wl2522 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 24, 2022
@phofl
Copy link
Member

phofl commented Jan 25, 2022

The Series with the NA value has dtype object, hence this result is correct. When you specify Int64 dtype, you get the expected result

a = pd.Series([1, 2, 3])
b = pd.Series([1, 2, pd.NA], dtype="Int64")

@phofl phofl closed this as completed Jan 25, 2022
@phofl phofl added ExtensionArray Extending pandas with custom dtypes or arrays. Usage Question Numeric Operations Arithmetic, Comparison, and Logical operations and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 25, 2022
@jbrockmendel
Copy link
Member

@phofl i dont think this is cut-and-dry; we'd expect the first two elements to be True. There are a couple of issues about pd.NA not behaving well inside object dtype Series/arrays.

@wl2522
Copy link
Author

wl2522 commented Jan 25, 2022

thank you @phofl for pointing out the dtype difference between the two series!

as @jbrockmendel implied, it seems like there are at least two issues/behaviors happening here:

  1. creating a series that mixes int with pd.NA without specifying dtype automatically creates an object series, which was reported in BUG: mix of int and pd.NA defaults to object dtype #33662 and sounds like it may be changed in the future?
  2. there's some underlying behavior beyond just the dtype difference that i'm not understanding because modifying my example so that it doesn't involve pd.NA gives the expected result:
a = pd.Series([1, 2, 3], dtype="Int64")
c = pd.Series([1, 2, 5], dtype=object)
print(a == c)

0     True
1     True
2    False
dtype: boolean

edit: replacing the 5 in the last row of series c with other types of data such as 'a' or np.nan also give the expected result of True, True, False

@jbrockmendel
Copy link
Member

creating a series that mixes int with pd.NA without specifying dtype automatically creates an object series

That is expected for the forseeable future.

there's some underlying behavior beyond just the dtype difference that i'm not understanding because modifying my example so that it doesn't involve pd.NA gives the expected result:

I'm pretty sure this is driven by pd.NA's PITA behavior when inside an object-dtype arraylike xref #32931, #33066

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations Usage Question
Projects
None yet
Development

No branches or pull requests

3 participants