BUG: get_indexer_non_unique with np.nan #42289

jbrockmendel · 2021-06-28T21:31:40Z

closes BUG: Index.get_indexer_non_unique misbehaves when index contains multiple nan #35392
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

One of the tests is copied from #35498

realead · 2021-06-29T04:04:55Z

I think in order to be consistent, one could use _lib.hashtable.PyObjectHashTable (instead of Python's set and dict) - it has the "expected" behavior when it comes to nan-floats and can also handle more complex cases (like (float("nan),) or complex(0, float("nan"))) out-of-the-box.

The downside is, that one has to misuse a table for a set (#39799 would fix that).

jbrockmendel · 2021-06-30T14:58:18Z

@realead IIUC what you're suggesting is viable medium-term but likely not short-term? i.e. is it worth doing a temporary fix like this PR without what you're describing?

realead · 2021-07-02T05:59:53Z

I must confess, I assumed the solution would be within reach.

But there are some stumbling blocks:

Panda's PyObjectHashTable doesn't perform reference counting

https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/hashtable_class_helper.pxi.in#L1126-L1138

so using it as set/dict with temporary objects is problematic.

Also PyObjectHashTable maps objects to int64 and not pyobject, which is at least a problem for performance. Maybe it is worth to introduce PandasSet and PandasDict which would have the same handling of nans as pandas algorithm. But until then a fix like that is probably best we can do.

jbrockmendel · 2021-07-03T18:34:40Z

But until then a fix like that is probably best we can do.

Makes sense, thanks for taking a look.

Maybe it is worth to introduce PandasSet and PandasDict which would have the same handling of nans as pandas algorithm

I'll defer to you on this, as you've mentioned using other khash functionality before. I'm a bit wary about the increasing build size.

jbrockmendel · 2021-07-24T22:25:44Z

closing in favor of #35498

BUG: get_indexer_non_unique with np.nan

13ee8fa

jbrockmendel added Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jun 30, 2021

jbrockmendel closed this Jul 24, 2021

jbrockmendel deleted the bug-get_indexer_non_unique-nan branch July 24, 2021 22:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: get_indexer_non_unique with np.nan #42289

BUG: get_indexer_non_unique with np.nan #42289

jbrockmendel commented Jun 28, 2021

realead commented Jun 29, 2021 •

edited

Loading

jbrockmendel commented Jun 30, 2021

realead commented Jul 2, 2021

jbrockmendel commented Jul 3, 2021

jbrockmendel commented Jul 24, 2021

BUG: get_indexer_non_unique with np.nan #42289

BUG: get_indexer_non_unique with np.nan #42289

Conversation

jbrockmendel commented Jun 28, 2021

realead commented Jun 29, 2021 • edited Loading

jbrockmendel commented Jun 30, 2021

realead commented Jul 2, 2021

jbrockmendel commented Jul 3, 2021

jbrockmendel commented Jul 24, 2021

realead commented Jun 29, 2021 •

edited

Loading