Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/ERR/ENH: Allow MultiIndex.from_tuples to handle NaNs #23578

Closed
h-vetinari opened this issue Nov 8, 2018 · 4 comments · Fixed by #52560
Closed

API/ERR/ENH: Allow MultiIndex.from_tuples to handle NaNs #23578

h-vetinari opened this issue Nov 8, 2018 · 4 comments · Fixed by #52560
Labels
good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Needs Tests Unit test(s) needed to prevent regressions

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Nov 8, 2018

This is the origin of #23558 and #23677, but IMO worthy to fix in its own right:

Trying to create a MultiIndex from a list of tuples containing NaNs (like Index.str.partition does if there are NaNs present) yields:

>>> pd.MultiIndex.from_tuples([('a', 'b', 'c'), np.nan, ('d', '', '')])
[...]
TypeError: object of type 'float' has no len()

However, it works easily when passing a tuple of NaNs

>>> pd.MultiIndex.from_tuples([('a', 'b', 'c'), (np.nan,) * 3, ('d', '', '')])
MultiIndex(levels=[['a', 'd'], ['', 'b'], ['', 'c']],
           labels=[[0, -1, 1], [1, -1, 0], [1, -1, 0]])

In fact, the length of the tuple is irrelevant:

>>> pd.MultiIndex.from_tuples([('a', 'b', 'c'), (np.nan,), ('d', '', '')])
MultiIndex(levels=[['a', 'd'], ['', 'b'], ['', 'c']],
           labels=[[0, -1, 1], [1, -1, 0], [1, -1, 0]])

Aside from the inconvenience, it is also a real problem to set elements of an array (say if there are several NaNs) to tuples, because they almost always get interpreted as another axis/list-like and then give creative errors.

I'm thinking this extra fail-safe should not be controversial, and I've got a fix prepared already, which is blocked by #23582 (at least in terms of adding tests).

@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex labels Nov 11, 2018
@gfyoung
Copy link
Member

gfyoung commented Nov 11, 2018

cc @toobaz

@toobaz
Copy link
Member

toobaz commented Nov 13, 2018

I guess there should be a link between the two following proposals I see in this issue, but I'm failing to grasp it.

  1. pd.MultiIndex.from_tuples([('a', 'b', 'c'), np.nan, ('d', '', '')]) fails: I think that's perfectly fine. Or at least, I would not call it a bug. Sure, there is scope for an enhancement, but one that I do not feel very important, and that at the same time doesn't seem easy to implement/maintain.
    Incidentally, I think that BUG: Index.str.partition not nan-safe #23558 should be fixed by just passing a tuple of NaNs, and that pd.MultiIndex.from_tuples([('a', 'b', 'c'), (np.nan,), ('d', '', '')]) should have raised - but never mind because it is not harmful and changing it now might break code.

Aside from the inconvenience, it is also a real problem to set elements of an array (say if there are several NaNs) to tuples, because they almost always get interpreted as another axis/list-like and then give creative errors.

What do you mean by "an array"? Creating a MultiIndex with tuple labels is easy:

In [2]: pd.MultiIndex.from_tuples([('a', 'b', 'c'), ('d', (2, 3), '')])
Out[2]: 
MultiIndex(levels=[['a', 'd'], [(2, 3), 'b'], ['', 'c']],
           labels=[[0, 1], [1, 0], [1, 0]])

In any case, I fail to see the link with point 1).

@mroeschke
Copy link
Member

This failing example seem to work now. Could use a test

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Enhancement labels Apr 4, 2023
@srkds
Copy link
Contributor

srkds commented Apr 8, 2023

This failing example seem to work now. Could use a test

Hi @mroeschke

I went through the issue and related discussions and I found commit , mentioned here

It contains the same taste case that I think we want to add.
but I think this is only committed to Fork and I didn't find it on the main.

I would like to contribute, could you please guide me on how should I proceed next?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants