Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-index repr #879

Closed
wants to merge 7 commits into from
Closed

Multi-index repr #879

wants to merge 7 commits into from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Jun 11, 2016

Another item of #719.

An example:

>>> index = pd.MultiIndex.from_product((list('ab'), range(10)))
>>> index.names= ('a_long_level_name', 'level_1')
>>> data = xr.DataArray(range(20), [('x', index)])
>>> data
<xarray.DataArray (x: 20)>
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])
Coordinates:
  * x                    (x) object MultiIndex
    - a_long_level_name  object 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'a' 'b' ...
    - level_1            int64 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

To be consistent with the displayed coordinates and/or data variables, it displays the actual used level values. Using the pandas.MultiIndex.get_level_values method would be expensive for big indexes, so I re-implemented it in xarray so that we can truncate the computation to the first x values, which is very cheap.

It still needs testing.

Maybe it would be nice to align the level values.

unique = index.levels[level_num]
labels = index.labels[level_num]
size = min(max_size, labels.size)
filled = pd.core.algorithms.take_1d(unique.values, labels[:size],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to figure out how to implement this only using public API (nothing prefaced with an underscore). Otherwise, pandas will almost certainly break us in a future release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest simply using index.levels[level_num][:max_size]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases it might make sense to just use index.levels[level_num][:max_size] to show the (first) unique values for each level.

But in other cases I find this

>>> data.isel(x=range(3))
<xarray.DataArray (x: 3)>
array([0, 1, 2])
Coordinates:
  * x                    (x) object MultiIndex
    - a_long_level_name  object 'a' 'a' 'a'
    - level_1            int64 0 1 2

much better than this:

>>> data.isel(x=range(3))
<xarray.DataArray (x: 3)>
array([0, 1, 2])
Coordinates:
  * x                    (x) object MultiIndex
    - a_long_level_name  object 'a' 'b'
    - level_1            int64 0 1 2 3 4 5 6 7 8 9

What about just returning pd.core.algorithms.take_1d(unique.values, labels[:max_size]) or even np.take(unique.values, labels[:max_size])?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good point. It's definitely more useful to display the particular index levels that are actually used at those points.

In that case I would use unique[labels[:max_size]]. It's better to avoid the NumPy methods like np.take on pandas.Index objects because they don't always always preserve dtypes properly.

@benbovy
Copy link
Member Author

benbovy commented Jun 15, 2016

I ended up refactoring most of the implementation.

It looks cleaner I think, although I'm still not fully satisfied with this version. It isn't as straightforward to implement as I thought, especially for checking (and getting) MultiIndex and for dealing with col_width and max_width.



def _maybe_summarize_multiindex(var, col_width, max_width):
if isinstance(var.variable._data, PandasIndexAdapter):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think isinstance(var, Coordinate) would be a safer check.

@benbovy
Copy link
Member Author

benbovy commented Aug 31, 2016

Closing this. See #947.

@benbovy benbovy closed this Aug 31, 2016
@benbovy benbovy deleted the multi-index_repr branch September 2, 2016 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants