Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MultiIndex._data and MultiIndex.array #27138

Closed
topper-123 opened this issue Jun 30, 2019 · 3 comments
Closed

Add MultiIndex._data and MultiIndex.array #27138

topper-123 opened this issue Jun 30, 2019 · 3 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. MultiIndex Needs Discussion Requires discussion from core team before further action

Comments

@topper-123
Copy link
Contributor

I propose adding a MultiIndex._data that is of type List[Categorical], where all the underlying data of a MultiIndex would be stored. A multiIndex.array property would also be added, that accesses the _data.

This has the advantage of collecting the data that is underlying MultiIndex into one data structure, that is human readable, and also makes access to zero-copy data very easy, e.g. would mi.array[1] return the data of the second level as a Categorical, in a easy-to-read form.

A MultiIndex could with the above changes be explained as just "a container over a list of Categoricals", which is easier to explain than the current mode. The MultiIndex could also be related to CategoricalIndex, which is "a container over a single Categorical".

This change means that MultiIndex.levels will become a property that returns a FrozenList(cat.categories for cat in self._data), and MultiIndex.codes will be a property that returns FrozenList(cat.codes for cat in self._data).

MultiIndex.array will be added and will simply be a property that returns a FrozenList of self._data.

Performance will not be affected, as most operations would still go through MultiIndex.codes and MultiIndex.levels.

Moving names from MultiIndex.levels to MultiIndex._names

Currently the levels' names are stored at each level's name attribute. This is not very compatible with extracting the categories from _data. (the .categories is actually part of the dtype, which ideally should be immutable, so we shouldn't set or change its name attribute).

To make my suggestion practically possible, the level names should be stored in MultiIndex._names instead, and MultiIndex.names will become a property that reads from/writes to MultiIndex._names. I think this change simplifies the MultiIndex a bit, as data and names are dealt with separately. This is a small backward breaking change though.

So, I suggest making two PRs:

  1. Separating the names from the levels (to be included in 0.25)
  2. Add _data, array and change levels and codes into properties.
@TomAugspurger
Copy link
Contributor

I'm a little concerned with MultiIndex.array. The type of Index.array is an ExtensionArray.

Would be Ok with a .arrays attribute, that has your proposed type / behavior.

@topper-123
Copy link
Contributor Author

I think .arrays is ok. I've started working on this.

@jbrockmendel
Copy link
Member

.array is supposed to refer to the actual data backing an Index/Series. Until MI is backed by some kind of MultiArray (ProductArray for nicely behaved cases?), I don't think this makes sense.

@mroeschke mroeschke added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action and removed API Design labels Jul 10, 2021
@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. MultiIndex Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants