Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: get_indexer_non_unique with np.nan #42289

Conversation

jbrockmendel
Copy link
Member

One of the tests is copied from #35498

@realead
Copy link
Contributor

realead commented Jun 29, 2021

I think in order to be consistent, one could use _lib.hashtable.PyObjectHashTable (instead of Python's set and dict) - it has the "expected" behavior when it comes to nan-floats and can also handle more complex cases (like (float("nan),) or complex(0, float("nan"))) out-of-the-box.

The downside is, that one has to misuse a table for a set (#39799 would fix that).

@jbrockmendel jbrockmendel added Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jun 30, 2021
@jbrockmendel
Copy link
Member Author

@realead IIUC what you're suggesting is viable medium-term but likely not short-term? i.e. is it worth doing a temporary fix like this PR without what you're describing?

@realead
Copy link
Contributor

realead commented Jul 2, 2021

I must confess, I assumed the solution would be within reach.

But there are some stumbling blocks:

Panda's PyObjectHashTable doesn't perform reference counting

https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/hashtable_class_helper.pxi.in#L1126-L1138

so using it as set/dict with temporary objects is problematic.

Also PyObjectHashTable maps objects to int64 and not pyobject, which is at least a problem for performance. Maybe it is worth to introduce PandasSet and PandasDict which would have the same handling of nans as pandas algorithm. But until then a fix like that is probably best we can do.

@jbrockmendel
Copy link
Member Author

But until then a fix like that is probably best we can do.

Makes sense, thanks for taking a look.

Maybe it is worth to introduce PandasSet and PandasDict which would have the same handling of nans as pandas algorithm

I'll defer to you on this, as you've mentioned using other khash functionality before. I'm a bit wary about the increasing build size.

@jbrockmendel
Copy link
Member Author

closing in favor of #35498

@jbrockmendel jbrockmendel deleted the bug-get_indexer_non_unique-nan branch July 24, 2021 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Index.get_indexer_non_unique misbehaves when index contains multiple nan
2 participants