Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Index.get_indexer_non_unique misbehaves when index contains multiple nan #35392

Closed
mhabets opened this issue Jul 23, 2020 · 3 comments · Fixed by #35498
Closed

BUG: Index.get_indexer_non_unique misbehaves when index contains multiple nan #35392

mhabets opened this issue Jul 23, 2020 · 3 comments · Fixed by #35498
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@mhabets
Copy link

mhabets commented Jul 23, 2020

Code Sample, a copy-pastable example

axis = pd.Index([np.nan, 'var1', np.nan])
axis.get_indexer_for([np.nan])

Current Output

array([-1], dtype=int64)
Meaning that np.nan is not in axis which is incorrect and makes df.drop(columns=[np.nan]) to fail when columns contains multiple nan.

Expected Output

array([0, 2], dtype=int64)

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.0.5
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0.post20200714
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.20.3
sphinx : 3.1.2
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.2
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : None
scipy : 1.5.0
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.9
numba : 0.50.1

@mhabets mhabets added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 23, 2020
@mhabets mhabets changed the title BUG: Index.get_indexer_non_unique misbehaves when passed with duplicated nan BUG: Index.get_indexer_non_unique misbehaves when index contains multiple nans Jul 23, 2020
@mhabets mhabets changed the title BUG: Index.get_indexer_non_unique misbehaves when index contains multiple nans BUG: Index.get_indexer_non_unique misbehaves when index contains multiple nan Jul 23, 2020
@simonjayhawkins simonjayhawkins added Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 23, 2020
@simonjayhawkins
Copy link
Member

Thanks @mhabets for the report. I can confirm that this issue persists on master and also on 0.25.3, so not a recent regression.

@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Jul 23, 2020
@SanthoshBala18
Copy link
Contributor

One additional point to be noted, if we pass any other value along with np.nan, output is returned as expected.
For Example:

axis = pd.Index([np.nan, 'var1', np.nan])
axis.get_indexer_for([np.nan, 'var1'])

This returns: array([0, 2, 1])

alexhlim added a commit to alexhlim/pandas that referenced this issue Jul 31, 2020
alexhlim added a commit to alexhlim/pandas that referenced this issue Jul 31, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.3 Dec 28, 2020
@simonjayhawkins simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021
@jbrockmendel
Copy link
Member

One additional point to be noted, if we pass any other value along with np.nan, output is returned as expected

Not quite. If we pass another value that is also numeric, the problem remains. When the input is all-numeric, it gets coerced to a numeric-dtype Index. Then when passed to get_indexer_for it does an astype(object), which results in a new NaN object, leading to set-semantics problems.

@realead think we should just avoid using python sets altogether here? see discussion in #35498

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
5 participants