Separate MultiIndex names from levels #27242

topper-123 · 2019-07-05T04:47:17Z

progress towards Add MultiIndex._data and MultiIndex.array #27138
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

In #27138 I proposed doing some changes to MultiIndex, so that the index type can have its data collected in _data as type List[Categorical],+ adding MultiIndex.arrays in order to access each full level as zero-copy Categorical.

This is the first part of that proposal, and drops setting the names on the levels[x].name attribute and instead sets the names on the MultiIndex._names attribute.

This PR is a minorly backward-breaking change (so would be good to get into 0.25), while the followup will not break anything.

jreback

in MultiIndex, no problem setting the level.name attribute, but if you are in frame.py or reshape.py I would avoid doing this and instead use the name= parameter in ._shallow_copy() or use .rename()? (on a level)

jreback · 2019-07-05T20:48:35Z

pandas/core/indexes/multi.py

@@ -259,6 +259,7 @@ def __new__(
        result._set_levels(levels, copy=copy, validate=False)
        result._set_codes(codes, copy=copy, validate=False)

+        result._names = [None for _ in levels]


[None] * len(levels)

jreback · 2019-07-05T20:49:42Z

pandas/core/reshape/reshape.py

@@ -260,10 +260,13 @@ def get_new_values(self):
    def get_new_columns(self):
        if self.value_columns is None:
            if self.lift == 0:
-                return self.removed_level
+                lev = self.removed_level._shallow_copy()


why wouldn't you do
lev = self.removed_level._shallow_copy(name=self.removed_name) ?

_shallow_copy and rename and other indirect methods to set the .name all call ._set_names, which does a lot of checks. Those checks are not needed in these internal functionality, as the name has already been validated.

Perhaps have a fastpath parameter in _set_names?

jreback · 2019-07-05T20:50:44Z

pandas/core/reshape/reshape.py

@@ -658,7 +663,9 @@ def _convert_level_number(level_num, columns):
        new_names = this.columns.names[:-1]
        new_columns = MultiIndex.from_tuples(unique_groups, names=new_names)
    else:
-        new_columns = unique_groups = this.columns.levels[0]
+        new_columns = this.columns.levels[0]._shallow_copy()


use name= here

jreback · 2019-07-05T20:50:56Z

pandas/core/reshape/reshape.py

@@ -302,7 +305,9 @@ def get_new_index(self):
            lev, lab = self.new_index_levels[0], result_codes[0]
            if (lab == -1).any():
                lev = lev.insert(len(lev), lev._na_value)
-            return lev.take(lab)
+            new_index = lev.take(lab)


I would use .rename()

jreback · 2019-07-05T20:51:02Z

pandas/core/reshape/reshape.py


-            lev = self.removed_level
-            return lev.insert(0, lev._na_value)
+            lev = self.removed_level.insert(0, item=self.removed_level._na_value)


I would use .rename()

jreback · 2019-07-05T20:51:20Z

pandas/tests/frame/test_alter_axes.py

@@ -979,7 +979,7 @@ def test_reset_index(self, float_frame):
        ):
            values = lev.take(level_codes)
            name = names[i]
-            tm.assert_index_equal(values, Index(deleveled[name]))
+            tm.assert_index_equal(values, Index(deleveled[name]), check_names=False)


why is this changed?

lev.take(level_codes) doesn't provide a name any more, while a rest index does provides its Series with a name (as it should.

I've added a test assert values.name is None to make this more explicit.

topper-123 · 2019-07-06T20:52:12Z

I’ve changed how the name is set. This is a bit slower (many checks that are not needed), but than could be fixed seperately in a later PR.

jreback · 2019-07-06T21:50:20Z

pandas/tests/test_multilevel.py

@@ -1609,12 +1607,12 @@ def test_constructor_with_tz(self):
        )

        result = MultiIndex.from_arrays([index, columns])
-        tm.assert_index_equal(result.levels[0], index)
-        tm.assert_index_equal(result.levels[1], columns)
+        tm.assert_index_equal(result.levels[0], index, check_names=False)


I now find these tests very confusing that we lose the names on the levels themselves. (I know that's the point of this PR).

maybe add a .set_names(index.name) (for example) and remove the check_names arg (so its the default of True)

I've made changes so we avoid check_names=False (so is implicitly True).

topper-123 · 2019-07-08T23:25:51Z

Is this ok? I'd like to get this in 0.25, as this is a breaking change. I'll add a 0.25 label as a reminder.

The rest of #27138 will be non-breaking, so can go in later, if needed.

TomAugspurger · 2019-07-09T00:19:48Z

I haven’t looked, but we shouldn’t merge breaking changes in the release candidate.

…

________________________________ From: Terji Petersen <notifications@github.com> Sent: Monday, July 8, 2019 6:25 PM To: pandas-dev/pandas Cc: Subscribed Subject: Re: [pandas-dev/pandas] Separate MultiIndex names from levels (#27242) Is this ok? I'd like to get this in 0.25, as this is a breaking change. I'll add a 0.25 label as a reminder. The rest of #27138<#27138> will be non-breaking, so can go in later, if needed. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#27242?email_source=notifications&email_token=AAKAOIV7NXDLCTHW6VWPXTDP6PEINA5CNFSM4H6F75WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZOUESY#issuecomment-509428299>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAKAOIX2TJDHTZTBXNWDJZDP6PEINANCNFSM4H6F75WA>.

topper-123 · 2019-07-09T05:42:16Z

Yeah, I understand that, but the followups to this PR will in addition to the benefits mentioned in #27138 also allow some nice simplifications of MultiIndex (by delegating all single-level checks to Categorical) and I assume release of 0.25 will mean a stop to breaking changes for a while,, because next up will be 1.0?

jreback · 2019-07-09T21:50:49Z

@TomAugspurger I don't believe we have held off on merging even breaking changes to an RC. I don't see this as a big deal and would merge as is.

TomAugspurger · 2019-07-10T04:11:16Z

I haven't had a chance to look (and won't this week), but if we're merging API changes in RC0 then we'll need a second RC.

…

On Tue, Jul 9, 2019 at 3:50 PM Jeff Reback ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> I don't believe we have held off on merging even breaking changes to an RC. I don't see this as a big deal and would merge as is. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27242?email_source=notifications&email_token=AAKAOITD5YHRMJ3JRADT4TDP6UB4FA5CNFSM4H6F75WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZRUSAA#issuecomment-509823232>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIUYKYEC5G4OUG647ZLP6UB4FANCNFSM4H6F75WA> .

topper-123 · 2019-07-24T23:00:56Z

Ping.

doc/source/whatsnew/v0.25.0.rst

topper-123 · 2019-07-26T14:05:50Z

I've moved the whatsnew section to 1.0.

doc/source/whatsnew/v1.0.0.rst

jreback · 2019-07-27T14:41:41Z

@topper-123 lgtm (tiny comment). @TomAugspurger any objections?

TomAugspurger · 2019-07-29T17:06:13Z

Question on the broad goal of #27138: IIUC, the motivation is for MultiIndex to be backed by a _data: List[Categorical]. I don't necessarily agree that that goal requires API breaking changes here.

Why can .levels not return something like

@property
def levels(self):
    return FrozenList(pd.Index(self._data[i], name=self.names[i] for i in idx.nlevels))

TomAugspurger · 2019-10-15T21:43:45Z

@topper-123 do you have thoughts on #27242 (comment)?

topper-123 · 2019-10-16T09:00:48Z

As I see it, this will for practical purposes keep the names in two locations

No, they'll be stored in one place and accessible from two locations. This is the same as on master, only the source of truth and the referring location are swapped.

But if someone does

>>> mi = pd.MultiIndex.from_product([[1, 2], ['a', 'b']], names=['x', 'y'])
>>> lev = mi.levels[0]
>>> mi.set_names('z', level=0)
# then
>>> mi.names[0], lev.name
'z', 'x'

So the names will be stored in two places (or users should not store individual levels seperately, which they can't be expected to know). So for this reason I think it's the most most practical to make a clean cut.

EDIT: Ok I got an idea: What if we deprecate levels and use the name categories instead? In that case we could do (after implementing #27138):

@property
def levels(self) -> FrozenList[Index]:
    warnings.warn(...)
    return FrozenList(pd.Index(lev.categories, name=name) for name, lev in zip(self.names, self._data))

@property
def categories(self) -> FrozenList[Index]:
    return FrozenList(lev.categories for lev in self._data)

This would also make the API for MultiIndex be more similar to CategoricalIndex.

TomAugspurger · 2019-10-16T13:05:07Z

Ahh, a new name for levels is a good idea for getting around this problem.

jreback · 2019-10-16T13:06:30Z

+1 on @topper-123 new idea.

TomAugspurger · 2019-10-16T13:08:21Z

On the new name, is .categories what we want?

I suspect that in the near-term we'll have a DictEncodedArray that's like a Categorical, except it won't have the same semantics around "unobserved" categories and new categories can be added implicitly. I'm not sure what we would call the set of unique values for that thing (perhaps categories too).

topper-123 · 2019-10-16T13:44:07Z

I'm not set on exact name for this, but would like consistency. So maybe if you make a suggestion on the attribute name for that new array type?

I BTW don't know if I like the name DictEncodedArray . Can't it be just EncodedArray instead?

Edit: Or is the idea that what is now categories will in the new implementation be a dict instead of an Index? Then maybe call it mapping...?

TomAugspurger · 2019-10-16T13:53:56Z

I think `.categories` is fine for now. It's a bit unfortunate that it's a `List[Categorical]` rather than an Index like on CategoricalIndex, but that's probably OK.

…

On Wed, Oct 16, 2019 at 8:44 AM Terji Petersen ***@***.***> wrote: I'm not set on exact name for this, but would like consistency. So maybe if you make a suggestion on the attribute name for that new array type? I BTW don't know if I like the name DictEncodedArray . Can't it be just EncodedArray instead? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27242?email_source=notifications&email_token=AAKAOITMBHZEU2367ULJJMDQO4LDFA5CNFSM4H6F75WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBMRBII#issuecomment-542707873>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOISNZBJLIMRV35ZQE6DQO4LDFANCNFSM4H6F75WA> .

topper-123 · 2019-10-16T14:00:07Z

Ok, if I can get this PR merged, I will start implementing MultiIndex._data and will deprecate levels and add categories.

TomAugspurger · 2019-10-16T14:30:41Z

👍 I'm fine with merging this as long as we also do the .levels / .categories for 1.0.

jreback · 2019-10-16T14:41:10Z

thanks @topper-123

no need to create an issue, you can just ref this PR.

TomAugspurger · 2019-10-16T16:15:25Z

I made #29032 so that this isn't dropped.

…

On Wed, Oct 16, 2019 at 9:41 AM Jeff Reback ***@***.***> wrote: thanks @topper-123 <https://github.com/topper-123> no need to create an issue, you can just ref this PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27242?email_source=notifications&email_token=AAKAOIUZC4524R2E2JHMM3TQO4R3HA5CNFSM4H6F75WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBMXW3Y#issuecomment-542735215>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOISP6RLQUQVNPVBTQTTQO4R3HANCNFSM4H6F75WA> .

jorisvandenbossche · 2019-10-17T18:26:31Z

This is breaking pyarrow (https://issues.apache.org/jira/browse/ARROW-6922).

The above is a very long discussion (which I didn't really follow before, sorry for that), but what I somewhat understand is that for 1.0 we want to restore the .levels behaviour of being able to get the name? But why was this PR then merged?
Or why was the suggestion of Tom to wrap the categories in an Index with the name as the levels (#27242 (comment)) not implemented?

jorisvandenbossche · 2019-10-17T18:33:46Z

If the idea is to deprecate the .levels attribute of a MultiIndex, I think that deserves first some separate focused discussion (as that is a big change).

Short-term, can we add back the names to .level? From the discussion above, it seems this does not need to conflict with the refactor of the MultiIndex internals (for getting the name)

TomAugspurger · 2019-10-17T18:44:56Z

The above is a very long discussion (which I didn't really follow before, sorry for that), but what I somewhat understand is that for 1.0 we want to restore the .levels behaviour of being able to get the name? But why was this PR then merged?

Right. Restoring getting is relatively straightforward. I can put a PR up with that later today.

xref https://issues.apache.org/jira/browse/ARROW-6922 / pandas-dev#27242 (comment) / pandas-dev#29032 No docs yet, since it isn't clear how this will eventually sort out. But we at least want to preserve this behavior for 1.0

* API: Restore getting name from MultiIndex level xref https://issues.apache.org/jira/browse/ARROW-6922 / #27242 (comment) / #29032 No docs yet, since it isn't clear how this will eventually sort out. But we at least want to preserve this behavior for 1.0 * fixups

* API: Restore getting name from MultiIndex level xref https://issues.apache.org/jira/browse/ARROW-6922 / pandas-dev#27242 (comment) / pandas-dev#29032 No docs yet, since it isn't clear how this will eventually sort out. But we at least want to preserve this behavior for 1.0 * fixups

topper-123 added MultiIndex API Design labels Jul 5, 2019

topper-123 force-pushed the MultiIndex._names branch 3 times, most recently from 7d93c89 to ab2fdf5 Compare July 5, 2019 10:52

jreback requested changes Jul 5, 2019

View reviewed changes

jreback requested changes Jul 6, 2019

View reviewed changes

topper-123 force-pushed the MultiIndex._names branch 5 times, most recently from 5bcf204 to efcfeac Compare July 8, 2019 22:16

topper-123 added this to the 0.25.0 milestone Jul 8, 2019

jreback removed this from the 0.25.0 milestone Jul 17, 2019

jreback requested changes Jul 25, 2019

View reviewed changes

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved

jreback added this to the 1.0 milestone Jul 25, 2019

topper-123 force-pushed the MultiIndex._names branch from efcfeac to e7b8927 Compare July 26, 2019 00:26

jreback reviewed Jul 27, 2019

View reviewed changes

doc/source/whatsnew/v1.0.0.rst Show resolved Hide resolved

jreback approved these changes Jul 27, 2019

View reviewed changes

topper-123 force-pushed the MultiIndex._names branch from e7b8927 to c81a26e Compare July 27, 2019 18:24

jreback merged commit 46e89b0 into pandas-dev:master Oct 16, 2019

topper-123 deleted the MultiIndex._names branch October 16, 2019 14:47

TomAugspurger mentioned this pull request Oct 16, 2019

Deprecate getting the name from MultiIndex.levels #29032

Closed

TomAugspurger mentioned this pull request Oct 17, 2019

API: Restore getting name from MultiIndex level #29061

Merged

jorisvandenbossche mentioned this pull request Oct 31, 2019

PERF: regression in MultiIndex get_loc indexing performance #29311

Closed

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

Separate MultiIndex names from levels (pandas-dev#27242)

c659785

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

Separate MultiIndex names from levels (pandas-dev#27242)

91b7e1e

bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020

Separate MultiIndex names from levels (pandas-dev#27242)

66543fb

topper-123 mentioned this pull request Feb 4, 2020

PERF: Indexing a multi-index is a lot slower #31648

Closed

asfimport mentioned this pull request Oct 18, 2019

[Python] Pandas master build is failing (MultiIndex.levels change) apache/arrow#23245

Closed

Separate MultiIndex names from levels #27242

Separate MultiIndex names from levels #27242

Conversation

topper-123 commented Jul 5, 2019 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Jul 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Jul 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Jul 6, 2019 • edited Loading

Choose a reason for hiding this comment

topper-123 commented Jul 8, 2019

TomAugspurger commented Jul 9, 2019 via email

topper-123 commented Jul 9, 2019

jreback commented Jul 9, 2019

TomAugspurger commented Jul 10, 2019 via email

topper-123 commented Jul 24, 2019

topper-123 commented Jul 26, 2019

jreback commented Jul 27, 2019

TomAugspurger commented Jul 29, 2019

TomAugspurger commented Oct 15, 2019

topper-123 commented Oct 16, 2019 • edited Loading

TomAugspurger commented Oct 16, 2019

jreback commented Oct 16, 2019

TomAugspurger commented Oct 16, 2019

topper-123 commented Oct 16, 2019 • edited Loading

TomAugspurger commented Oct 16, 2019 via email

topper-123 commented Oct 16, 2019

TomAugspurger commented Oct 16, 2019

jreback commented Oct 16, 2019

TomAugspurger commented Oct 16, 2019 via email

jorisvandenbossche commented Oct 17, 2019

jorisvandenbossche commented Oct 17, 2019

TomAugspurger commented Oct 17, 2019

topper-123 commented Jul 5, 2019 •

edited

Loading

topper-123 Jul 5, 2019 •

edited

Loading

topper-123 Jul 6, 2019 •

edited

Loading

topper-123 commented Oct 16, 2019 •

edited

Loading

topper-123 commented Oct 16, 2019 •

edited

Loading