Pass indexes directly to the DataArray and Dataset constructors #7214

benbovy · 2022-10-25T14:16:44Z

Closes Pass indexes to the Dataset and DataArray constructors #6392
Closes Opening dataset without loading any indexes? #6633 ?
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

I'm thinking of only accepting one or more instances of Indexes as indexes argument in the Dataset and DataArray constructors. The only exception is when fastpath=True a mapping can be given directly. Also, when an empty collection of indexes is passed this skips the creation of default pandas indexes for dimension coordinates.

It is much easier to handle: just check that keys returned by Indexes.variables do no conflict with the coordinate names in the coords argument
It is slightly safer: it requires the user to explicitly create an Indexes object, thus with less chance to accidentally provide coordinate variables and index objects that do not relate to each other (we could probably add some safe guards in the Indexes class itself)
It is more convenient: an Xarray Index may provide a factory method that returns an instance of Indexes that we just need to pass as indexes, and we could also do something like ds = xr.Dataset(indexes=other_ds.xindexes)

TomNicholas · 2022-10-25T17:25:48Z

Big moves!!

benbovy · 2022-10-25T19:50:57Z

Hmm I'm wondering what would be best between the options below regarding the types for the indexes argument:

Indexes[Index] | Sequence[Indexes[Index] | None
Indexes[Index] | None
Mapping[Any, Index] | None
Any other suggestion?

Option 1 is nice for passing multiple indexes, e.g.,

pd_midx1 = pd.MultiIndex.from_arrays(..., names=("one", "two"))
pd_midx2 = pd.MultiIndex.from_arrays(..., , names=("three", "four"))

indexes1 = PandasMultiIndex.from_pandas_index(pd_midx1, "x")
indexes2 = PandasMultiIndex.from_pandas_index(pd_midx2, "y")

ds = xr.Dataset(indexes=[indexes1, indexes2])

With option 1 it feels odd passing an empty list in order to avoid creating default indexes: ds = xr.Dataset(indexes=[]). Not really better in this regard with option 2: ds = xr.Dataset(indexes=Indexes()). Option 3 is better IMO: ds = xr.Dataset(indexes={}).

Option 3 actually works in all cases since Indexes[Index] is a sub-type of Mapping[Any, Index]. However, it is not clear from this generic type that any non-empty mapping must be an instance of Indexes (because the latter also contains the coordinate variables).

I'm leaning towards option 3. For passing multiple indexes at once we could probably expand the Indexes API, e.g., with an .update() method.

What do people think?

benbovy · 2022-10-26T07:52:35Z

For passing multiple indexes at once we could probably expand the Indexes API, e.g., with an .update() method.

Maybe with something else than .update() (let's keep Indexes an immutable collection?)

- easily create an empty Indexes collection - check consistency between indexes and variables

TODO: check indexes shapes / dims for DataArray

benbovy · 2022-10-26T11:47:57Z

I implemented option 3. We can still change or revert it later if it's not the best one.

A few examples:

import pandas as pd
import xarray as xr
from xarray.indexes import wrap_pandas_multiindex

midx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two"))

It is now possible to pass a pandas multi-index to a Dataset like this:

# this returns an `Indexes` object (indexes + coordinates)
indexes = wrap_pandas_multiindex(midx, "x")

ds = xr.Dataset(indexes=indexes)
# <xarray.Dataset>
# Dimensions:  (x: 4)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'b' 'b'
#   * two      (x) int64 1 2 1 2
# Data variables:
#     *empty*

IMO the above should be preferred over passing it as a coordinate (should we deprecate it now?):

ds_deprecated = xr.Dataset(coords={"x": midx})

ds_deprecated.identical(ds)
# True

# eventually this would behave like this:
ds_midx_as_array = xr.Dataset(coords={"x": midx})
# <xarray.Dataset>
# Dimensions:  (x: 4)
# Coordinates:
#   * x        (x) object ('a', 1) ('a', 2) ('b', 1) ('b', 2)
# Data variables:
#     *empty*

We can pass indexes around from one Xarray object to another, e.g.,

da = xr.DataArray([1, 2, 3, 4], dims="x", indexes=ds.xindexes)
# <xarray.DataArray (x: 4)>
# array([1, 2, 3, 4])
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'b' 'b'
#   * two      (x) int64 1 2 1 2

Skip creating pandas indexes for dimension coordinates:

ds_noindex = xr.Dataset(coords={"x": [0, 1, 2]}, indexes={})
# <xarray.Dataset>
# Dimensions:  (x: 3)
# Coordinates:
#     x        (x) int64 0 1 2
# Data variables:
#     *empty*

ds_noindex.xindexes
# Indexes:
#     *empty*

benbovy · 2022-10-26T13:54:22Z

Passing multiple indexes:

midx1 = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two"))
midx2 = pd.MultiIndex.from_product([["c", "d"], [3, 4]], names=("three", "four"))

indexes1 = wrap_pandas_multiindex(midx1, "x")
indexes2 = wrap_pandas_multiindex(midx2, "y")

indexes = Indexes(
    indexes=dict(**indexes1, **indexes2),
    variables=dict(**indexes1.variables, **indexes2.variables)
)

ds = xr.Dataset(indexes=indexes)
# <xarray.Dataset>
# Dimensions:  (x: 4, y: 4)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'b' 'b'
#   * two      (x) int64 1 2 1 2
#   * y        (y) object MultiIndex
#   * three    (y) object 'c' 'c' 'd' 'd'
#   * four     (y) int64 3 4 3 4
# Data variables:
#     *empty*

That's not looking super nice, but probably we can add some convenience function or Indexes method.

Since its constructor can now be used publicly. Copy input mappings and check the type of input indexes.

benbovy · 2022-10-27T13:31:24Z

I also added an .assign_indexes() method that may be quite convenient. Like for the constructors, it only accepts an Indexes instance.

ds = xr.Dataset(coords={"x": [4, 5, 6, 7]})
ds2 = xr.Dataset(coords={"x": [1, 2, 3, 4]})

ds.assign_indexes(ds2.xindexes)
# <xarray.Dataset>
# Dimensions:  (x: 4)
# Coordinates:
#   * x        (x) int64 1 2 3 4
# Data variables:
#     *empty*

midx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two"))
indexes = wrap_pandas_multiindex(midx, "x")

ds.assign_indexes(indexes)
# <xarray.Dataset>
# Dimensions:  (x: 4)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'b' 'b'
#   * two      (x) int64 1 2 1 2
# Data variables:
#     *empty*

benbovy · 2022-10-27T13:41:50Z

@pydata/xarray I'd be very happy if you could share your thoughts about the examples shown in the last three comments. If you think the API looks good like that, then I will work on adding some tests and on the documentation.

dcherian · 2022-10-27T16:21:53Z

How about Indexes.from_pandas_multi_index() classmethod?

I like the last spelling: ds.assign_indexes(indexes) though that raises a few questioins:

does indexes get merged with existing ._indexes?
If not, how does a user merge manually? Can we extract enough information from Index to have xr.merge(Indexes) -> Indexes work?

benbovy · 2022-10-27T18:21:02Z

How about Indexes.from_pandas_multi_index() classmethod?

Yes that would make sense. However, it would be adding another pandas.MultiIndex special case while we'd like to remove them in Xarray. Maybe a more generic Indexes class method that could be reused by 3rd-party indexes too? E.g., via some kind of hook or entrypoint... The tricky thing is that arguments would probably differ much from one index type to another.

does indexes get merged with existing ._indexes?

Indexes are not merged together but the new / replaced coordinate variables must be compatible with the other variables of the dataset. Dataset.assign_indexes(indexes) is actually implemented like this:

def assign_indexes(self, indexes: Indexes[Index]):
    ds_indexes = Dataset(indexes=indexes)
    return (
        self
            # prepare drop-in index / coordinate replacement
            .drop_vars(indexes, errors="ignore")
            # ensure the new indexes / coordinates are compatible with the Dataset
            .merge(
                ds_indexes,
                compat="minimal",   # probably not the right option?
                join="override",    # fastest option? (no real effect because of `drop_vars`)
                combine_attrs="no_conflicts",
            )
    )

Can we extract enough information from Index to have xr.merge(Indexes) -> Indexes work?

That is actually a good idea for #7214 (comment)! Not sure I would reuse xr.merge() for this as it would make the API messy, but why not an xr.merge_indexes() top-level function or an Indexes.merge() method?

shoyer · 2022-10-27T18:21:51Z

xarray/core/dataset.py

+        elif len(indexes) == 0:
+            create_default_indexes = False
+            indexes = Indexes()
+        else:
+            create_default_indexes = True
+            if not isinstance(indexes, Indexes):
+                raise TypeError("non-empty indexes must be an instance of `Indexes`")
+            elif indexes._index_type != Index:
+                raise TypeError("indexes must only contain Xarray `Index` objects")


I don't like the special case for size 0 indexes. These sort of special cases that violate type stability can lead to very hard to debug bugs.

Instead, how about never creating default pandas indexes if indexes is provided? Maybe there could also be a special method (e.g., assign_default_indexes()) for manually adding in pandas indexes later.

Yes I like that!

I also like assign_default_indexes() or maybe assign_indexes(indexes=None) if we don't want to add another method.

One concern I have is when a user wants to explicitly provide a pandas multi-index or a custom single coordinate index for a dimension and still wants the default pandas indexes for the other dimension coordinates.

In this case, calling assign_default_indexes() or assign_indexes(indexes=None) later will likely overwrite the explicitly provided indexes?

After removing the pandas multi-index dimension coordinate this won't be an issue anymore, but the issue remains for any custom 1-d dimension coordinate index.

Should we add an overwrite=True keyword argument to assign_(default_)indexes?

assign_default_indexes() should probably include an optional list of dimension names for which to assign default indexes.

By default, we might make it only add indexes for dimensions that doesn't already have an index.

shoyer · 2022-10-27T18:23:38Z

xarray/core/dataset.py

+        both_indexes_and_coords = set(indexes) & coord_names
+        if both_indexes_and_coords:
+            raise ValueError(
+                f"{both_indexes_and_coords} are found in both indexes and coords"
+            )


This check surprises me. As a user, I would expect that I can write something like xarray.Dataset(data_vars=ds.data_vars, coords=ds.coords, attrs=ds.attrs, indexes=ds.xindexes) to make a new object.

Good point. Would it be reasonable that the coordinates provided via indexes=... (if we use it) override the coordinates provided via coords=...?

Would it be reasonable that the coordinates provided via indexes=... (if we use it) override the coordinates provided via coords=...?

Can you give an example of how this would happen? Do indexes already provide enough information to create a coordinate (this would slightly surprise me)?

I think an error in these cases or creating a Dataset with inconsistent coordinates/indexes (if it is hard to check) would be fine. I would not silently override coordinates.

shoyer · 2022-10-27T18:28:40Z

I'm thinking of only accepting one or more instances of Indexes as indexes argument in the Dataset and DataArray constructors

I would lean against this, only because it's easier to explicitly manipulate indexes in the form of a dict than an xarray.Indexes object.

Explicitly providing indexes is an advanced user feature. I think it's OK to require users to do a bit more work in this case and to not necessarily do consistency checks (beyond verifying that the coordinate variables exist).

benbovy · 2022-10-27T19:04:19Z

Explicitly providing indexes is an advanced user feature.

Agreed. However, xr.Dataset(coords={"x": pandas_midx}) is something that presumably a lot of users rely on (it is used extensively in Xarray's tests) and that we should really deprecate IMO. If we don't provide a convenient alternative, I expect many of those users will complain.

it's easier to explicitly manipulate indexes in the form of a dict

While generally I also prefer handling plain dict objects over custom dict-like objects, here I don't see much reasons of manipulating Xarray index objects independently of their coordinate variables. Indexes allows keeping them tied together, and it is already returned by .xindexes.

EDIT -- For more context: initially an Indexes object was almost equivalent to a Frozen(obj._indexes). In #5692 I tried hard and struggled to keep dealing with separate dicts of indexes and indexed variables, but in the end it made things much easier to encapsulate the variables in Indexes, which is also used internally in different places.

benbovy · 2022-10-28T09:38:22Z

Maybe a more generic Indexes class method that could be reused by 3rd-party indexes too? E.g., via some kind of hook or entrypoint...

An Indexes accessor? Or this is going too far? 🙂

benbovy · 2022-10-28T09:49:02Z

not necessarily do consistency checks (beyond verifying that the coordinate variables exist).

I'd just want to add that, from my experience with debugging multi-index issues, it is hard even for advanced users to see what's going wrong when coordinates and indexes are not consistent.

dcherian · 2022-10-28T16:41:19Z

t would be adding another pandas.MultiIndex special case while we'd like to remove them in Xarray.

True, but its neatly encapsulated in one place. And if we are going to support many Index classes that build on numpy, scipy, then support for pandas indexes seems entirely in-scope.

I'd just want to add that, from my experience with debugging multi-index issues, it is hard even for advanced users to see what's going wrong when coordinates and indexes are not consistent.

Should we expose xr.tests.assert_internal_invariants for debugging?

shoyer · 2022-10-28T17:49:10Z

Explicitly providing indexes is an advanced user feature.

Agreed. However, xr.Dataset(coords={"x": pandas_midx}) is something that presumably a lot of users rely on (it is used extensively in Xarray's tests) and that we should really deprecate IMO. If we don't provide a convenient alternative, I expect many of those users will complain.

I agree -- we should support this for backwards compatibility (even if we deprecate it).

it's easier to explicitly manipulate indexes in the form of a dict

While generally I also prefer handling plain dict objects over custom dict-like objects, here I don't see much reasons of manipulating Xarray index objects independently of their coordinate variables. Indexes allows keeping them tied together, and it is already returned by .xindexes.

EDIT -- For more context: initially an Indexes object was almost equivalent to a Frozen(obj._indexes). In #5692 I tried hard and struggled to keep dealing with separate dicts of indexes and indexed variables, but in the end it made things much easier to encapsulate the variables in Indexes, which is also used internally in different places.

OK, this totally makes sense.

I don't love that it is possible to express invalid states in Xarray's data model. This motivated the creation of assert_internal_invariants and currently mostly is a concern for Xarray's own developers, but when we exposes the indexes argument, it will be easier for users to make the same sort of errors.

I wonder if we should consider the broader refactor of merging the Indexes and Coordinates objects, and expose the constructor as a public API. For clarity, I'll call it CoordinatesAndIndexes for now, but it could likely reuse the public name of Coordinates.

This would have a number of benefits:

It's impossible to provide inconsistent coords and indexes, because there is no separate indexes argument.
Likewise, it is impossible to create inconsistent coordinates and indexes on an existing Xarray object.
All the logic for verifying consistent coords and indexes can go in one place, shared between Dataset/DataArray. (Yes, it would be annoying to refactor Dataset to merge in variables from CoordinatesAndIndexes rather than the current separate Dataset._variables)
The public API also becomes clearer: if users want default indexes, they can pass a dict of variables into coords. If they want to copy indexes from another object, they can pass in a CoordinatesAndIndexes object (either from another Xarray object or constructed directly).

benbovy · 2022-10-31T12:54:50Z

Thanks for the suggestion @shoyer, in general I like it very much! "Coordinates possibly baked by one or more indexes" feels much more natural than "indexes and their corresponding coordinates". Even though indexes have been promoted as 1st class citizens in the data model, their right place should still be in the background compared to coordinates. So having a Coordinates object that encapsulates the indexes makes a lot of sense to me.

My main concern is about the timing, as such a broader refactor might postpone some work in progress on the public API and the documentation. Ideally this shouldn't discourage users to start experimenting with custom indexes and building an ecosystem around it, as soon as possible.

There might be a fast path towards your suggestion, at least regarding the public facing API (your points 1 and 4):

Keep "private" the constructor of Indexes and keep it immutable.
Add a new IndexedCoordinates(Coordinates) class. Unlike DatasetCoordinates and DataArrayCoordinates, it would have a public constructor and/or alternative class methods (e.g., .from_pandas_multi_index() suggested by @dcherian)
In general, passing any Coordinates object to coords would assign both the coordinates and the indexes.

This would let us the possibility to achieve a broader (mostly internal) refactor of Indexes and Coordinates objects later without the risk of introducing too much breaking changes.

Alternatively, we could just wait for that refactor to finish before implementing explicit assignment of coordinates and indexes. We already have .set_xindex() and .drop_indexes() that are relevant and we could wait before deprecating xr.Dataset(coords={"x": pandas_midx}). Not sure when such big refactor will happen, though, the wait could be long.

benbovy · 2023-07-18T11:52:11Z

Superseded by #7368.

benbovy added 2 commits October 25, 2022 16:11

add indexes argument to Dataset.__init__

41f4fd8

make indexes arg public for DataArray.__init__

4baa8af

TomNicholas added the topic-indexing label Oct 25, 2022

benbovy mentioned this pull request Oct 25, 2022

Explicit indexes: next steps #6293

Open

49 tasks

benbovy added 5 commits October 26, 2022 11:55

Indexes constructor updates

dbc058a

- easily create an empty Indexes collection - check consistency between indexes and variables

use the generic Mapping[Any, Index] for indexes

16a9983

add wrap_pandas_multiindex function

3c076d5

do not create default indexes when not desired

70e7a5d

fix Dataset dimensions

00e1766

TODO: check indexes shapes / dims for DataArray

benbovy added 6 commits October 26, 2022 17:18

copy the coordinate variables of passed indexes

3bf92cd

DataArray: check dimensions/shape of index coords

c9b6363

docstrings tweaks

82dc5cc

more Indexes safety

a58c9d0

Since its constructor can now be used publicly. Copy input mappings and check the type of input indexes.

ensure input indexes are Xarray indexes

9beeea7

add .assign_indexes() method

c6e94b4

shoyer reviewed Oct 27, 2022

View reviewed changes

ameraner mentioned this pull request Nov 15, 2022

Add LI L2 reader pytroll/satpy#2271

Merged

4 tasks

Merge branch 'main' into indexes-arg-constructors

ddd505e

benbovy mentioned this pull request Dec 8, 2022

Expose "Coordinates" as part of Xarray's public API #7368

Merged

8 tasks

benbovy closed this Jul 18, 2023

benbovy deleted the indexes-arg-constructors branch August 30, 2023 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass indexes directly to the DataArray and Dataset constructors #7214

Pass indexes directly to the DataArray and Dataset constructors #7214

benbovy commented Oct 25, 2022 •

edited

Loading

TomNicholas commented Oct 25, 2022

benbovy commented Oct 25, 2022

benbovy commented Oct 26, 2022

benbovy commented Oct 26, 2022 •

edited

Loading

benbovy commented Oct 26, 2022

benbovy commented Oct 27, 2022 •

edited

Loading

benbovy commented Oct 27, 2022

dcherian commented Oct 27, 2022

benbovy commented Oct 27, 2022

shoyer Oct 27, 2022

benbovy Oct 27, 2022

benbovy Oct 28, 2022

shoyer Oct 28, 2022

shoyer Oct 27, 2022

benbovy Oct 27, 2022 •

edited

Loading

shoyer Oct 28, 2022

shoyer commented Oct 27, 2022

benbovy commented Oct 27, 2022 •

edited

Loading

benbovy commented Oct 28, 2022

benbovy commented Oct 28, 2022

dcherian commented Oct 28, 2022 •

edited

Loading

shoyer commented Oct 28, 2022

benbovy commented Oct 31, 2022

benbovy commented Jul 18, 2023

Pass indexes directly to the DataArray and Dataset constructors #7214

Pass indexes directly to the DataArray and Dataset constructors #7214

Conversation

benbovy commented Oct 25, 2022 • edited Loading

TomNicholas commented Oct 25, 2022

benbovy commented Oct 25, 2022

benbovy commented Oct 26, 2022

benbovy commented Oct 26, 2022 • edited Loading

benbovy commented Oct 26, 2022

benbovy commented Oct 27, 2022 • edited Loading

benbovy commented Oct 27, 2022

dcherian commented Oct 27, 2022

benbovy commented Oct 27, 2022

shoyer Oct 27, 2022

Choose a reason for hiding this comment

benbovy Oct 27, 2022

Choose a reason for hiding this comment

benbovy Oct 28, 2022

Choose a reason for hiding this comment

shoyer Oct 28, 2022

Choose a reason for hiding this comment

shoyer Oct 27, 2022

Choose a reason for hiding this comment

benbovy Oct 27, 2022 • edited Loading

Choose a reason for hiding this comment

shoyer Oct 28, 2022

Choose a reason for hiding this comment

shoyer commented Oct 27, 2022

benbovy commented Oct 27, 2022 • edited Loading

benbovy commented Oct 28, 2022

benbovy commented Oct 28, 2022

dcherian commented Oct 28, 2022 • edited Loading

shoyer commented Oct 28, 2022

benbovy commented Oct 31, 2022

benbovy commented Jul 18, 2023

benbovy commented Oct 25, 2022 •

edited

Loading

benbovy commented Oct 26, 2022 •

edited

Loading

benbovy commented Oct 27, 2022 •

edited

Loading

benbovy Oct 27, 2022 •

edited

Loading

benbovy commented Oct 27, 2022 •

edited

Loading

dcherian commented Oct 28, 2022 •

edited

Loading