Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Index.get_indexer_non_unique misbehaves with multiple nan #35498

Merged
merged 4 commits into from
Aug 5, 2021

Conversation

alexhlim
Copy link
Member

Looking at the implementation in index.pyx, I notice that when [np.nan] is passed to get_indexer_non_unique, the code was not able to get passed the __contains__ method for stargets (line 314).

Further testing (python=3.7.7, numpy=1.18.5):

import numpy as np

# Case 1: Does not work -> prints nothing
# nan dtype: np.float64, ndarray dtype: np.float64
targets = np.array([np.nan])

# Case 2: Works -> prints 0, 1, 2
# nan dtype: U3, ndarray dtype: <U32
targets = np.array([np.nan, 'var1'])

values = np.array([np.nan, 'var1', np.nan])

stargets = set(targets)

for i, v in enumerate(values):
    if v in stargets:
        print(i)

Case 1 and 2 results differ because of the dtype of nan (U3 vs float64).

Upon further research, I figured out that np.nan != np.nan as per IEEE (when it is a float) and creating a set from a np.array could lead to some bizarre results (numpy/numpy#9358). Also, since a dictionary is the main data structure in this method to keep track of the targets indices, I don't think it is ideal to use nans as keys (https://stackoverflow.com/questions/6441857/nans-as-key-in-dictionaries).

I thought it would be appropriate to replace nans (with 0s) in the targets and values arrays in order to avoid the problems stated above. When considering where to replace the nans, I thought of two places where it could potentially happen:

  1. In get_indexer_non_unique (/pandas/core/indexes/base.py)
  2. In get_indexer_non_unique (/pandas/_libs/index.pyx)

Including the changes in 1. would mean overwriting the the Index object's properties, so I decided to include the changes in 2.

FYI -- I wasn't sure if the test I included was in the correct file. Please let me know if you would like this test to be in another file.

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, some comments

pandas/_libs/index.pyx Outdated Show resolved Hide resolved
pandas/tests/base/test_misc.py Outdated Show resolved Hide resolved
@alexhlim
Copy link
Member Author

Thanks for the comments, @arw2019! I just committed the changes.

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

Might want to move the release note to 1.2 - I think 1.1.1 is intended for regressions from 1.0.5

alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 1, 2020
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run the indexing benchmarks to check for performance regressions?

pandas/_libs/index.pyx Outdated Show resolved Hide resolved
@WillAyd WillAyd added the Indexing Related to indexing on series/frames, not to indexes themselves label Aug 4, 2020
pandas/tests/indexes/test_base.py Outdated Show resolved Hide resolved
pandas/_libs/index.pyx Outdated Show resolved Hide resolved
pandas/tests/indexes/test_base.py Outdated Show resolved Hide resolved
alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 11, 2020
alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 11, 2020
alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 11, 2020
alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 11, 2020
alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 11, 2020
pandas/_libs/index.pyx Outdated Show resolved Hide resolved
pandas/_libs/index.pyx Outdated Show resolved Hide resolved
alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 18, 2020
@alexhlim alexhlim requested a review from WillAyd August 18, 2020 15:53
pandas/tests/indexes/test_indexing.py Outdated Show resolved Hide resolved
alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 21, 2020
alexhlim added a commit to alexhlim/pandas that referenced this pull request Aug 21, 2020
@alexhlim
Copy link
Member Author

alexhlim commented Aug 22, 2020

@WillAyd I added a check to disambiguate between 0 and np.nan. I believe the CI failure is unrelated (test_s3_roundtrip_for_dir). Let me know what you think!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you merge master and a few comments.

pandas/_libs/index.pyx Outdated Show resolved Hide resolved
pandas/_libs/index.pyx Outdated Show resolved Hide resolved
pandas/tests/indexes/test_indexing.py Outdated Show resolved Hide resolved
alexhlim added a commit to alexhlim/pandas that referenced this pull request Sep 14, 2020
alexhlim added a commit to alexhlim/pandas that referenced this pull request Sep 14, 2020
alexhlim added a commit to alexhlim/pandas that referenced this pull request Sep 14, 2020
alexhlim added a commit to alexhlim/pandas that referenced this pull request Sep 14, 2020
@alexhlim
Copy link
Member Author

@jbrockmendel I noticed you submitted a PR that will close the original issue. if that's the case, would you like me to close this PR? or are there some changes that I have incorporated in this PR that you would like merged?

@jbrockmendel
Copy link
Member

I noticed you submitted a PR that will close the original issue. if that's the case, would you like me to close this PR? or are there some changes that I have incorporated in this PR that you would like merged?

I was hoping you'd port whatever parts of my PR are useful and then I'll close mine. You did most of the work here.

@alexhlim
Copy link
Member Author

I noticed you submitted a PR that will close the original issue. if that's the case, would you like me to close this PR? or are there some changes that I have incorporated in this PR that you would like merged?

I was hoping you'd port whatever parts of my PR are useful and then I'll close mine. You did most of the work here.

Thanks for the guidance, I will definitely port your work into this PR!

@alexhlim
Copy link
Member Author

alexhlim commented Jul 3, 2021

@jbrockmendel I liked the simplicity of your implementation better, so I decided to use it instead of what I currently have. I added a test case for the matching-but-not-identical nans. also, I rebased this PR so there's a cleaner commit history.

@alexhlim
Copy link
Member Author

alexhlim commented Jul 3, 2021

@jreback ran the index asv: some performance decreased:

asv continuous -f 1.1 upstream/master HEAD -b ^indexing

       before           after         ratio
     [5675cd8a]       [85dd25ce]
     <master>         <nui-regression>
+        81.5±2ms          107±2ms     1.31  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
+        80.4±1ms          105±3ms     1.31  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
+        81.8±2ms          106±1ms     1.29  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
+      83.3±0.9ms        105±0.9ms     1.26  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
+        81.9±2ms          101±2ms     1.24  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
+      8.14±0.3ms       10.0±0.2ms     1.23  indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
+        88.4±2ms          106±2ms     1.19  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+        88.5±1ms          106±1ms     1.19  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+         128±2μs          143±3μs     1.12  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@jbrockmendel
Copy link
Member

LGTM @alexhlim can you merge master

@jbrockmendel
Copy link
Member

LGTM pending green cc @jreback

@alexhlim
Copy link
Member Author

@jbrockmendel noticed that CI / Checks is failing:

  1. [doctest] pandas.core.generic.NDFrame.to_xarray
  2. Typing validation

Are these errors something that my PR caused? If not, should I wait until these errors are resolved on master and re-merge?

@jbrockmendel
Copy link
Member

Looks like those are unrelated, #42716

@alexhlim
Copy link
Member Author

alexhlim commented Aug 3, 2021

@jbrockmendel @jreback pinging b/c green

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM cc @jreback

@jreback jreback added this to the 1.4 milestone Aug 4, 2021
@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 4, 2021
@jreback
Copy link
Contributor

jreback commented Aug 4, 2021

thanks @alexhlim very nice, will merge on green.

@jreback jreback merged commit bf267b4 into pandas-dev:master Aug 5, 2021
@jreback
Copy link
Contributor

jreback commented Aug 5, 2021

thanks @alexhlim

@alexhlim
Copy link
Member Author

alexhlim commented Aug 5, 2021

Thank you everyone, especially @jbrockmendel and @jreback for guiding me through this PR. I definitely learned a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Index.get_indexer_non_unique misbehaves when index contains multiple nan
8 participants