Invisible differences between arrays using IntervalIndex #4579

gerritholl · 2020-11-12T17:54:55Z

What happened:

I have two DataArrays that each have a coordinate constructed with pandas.interval_range. In one case I pass the interval_range directly, in the other case I call .to_numpy() first. The two DataArrays look identical but aren't. This can lead to hard-to-find bugs, because behaviour is not identical: the former supports indexing whereas the latter doesn't.

What you expected to happen:

I expect two arrays that appear identical to behave identically. If they don't behave identically then there should be some way to tell the difference (apart from equals, which tells me they are different but not how).

Minimal Complete Verifiable Example:

import xarray
import pandas

da1 = xarray.DataArray([0, 1, 2], dims=("x",), coords={"x":
    pandas.interval_range(0, 2, 3)})
da2 = xarray.DataArray([0, 1, 2], dims=("x",), coords={"x":
    pandas.interval_range(0, 2, 3).to_numpy()})

print(repr(da1) == repr(da2))
print(repr(da1.x) == repr(da2.x))
print(da1.x.dtype == da2.x.dtype)

# identical?  No:
print(da1.equals(da2))
print(da1.x.equals(da2.x))

# in particular:
da1.sel(x=1)  # works
da2.sel(x=1)  # fails

Results in:

True
True
True
False
False
Traceback (most recent call last):
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "mwe105.py", line 19, in <module>
    da2.sel(x=1)  # fails
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/dataarray.py", line 1143, in sel
    ds = self._to_temp_dataset().sel(
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/dataset.py", line 2105, in sel
    pos_indexers, new_indexes = remap_label_indexers(
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/coordinates.py", line 397, in remap_label_indexers
    pos_indexers, new_indexes = indexing.remap_label_indexers(
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/indexing.py", line 275, in remap_label_indexers
    idxr, new_idx = convert_label_indexer(index, label, dim, method, tolerance)
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/indexing.py", line 196, in convert_label_indexer
    indexer = index.get_loc(label_value, method=method, tolerance=tolerance)
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err
KeyError: 1

Additional context

I suppose this happens because under the hood xarray does something clever to support pandas-style indexing even though the coordinate variable appears like a numpy array with an object dtype, and that this cleverness is lost if the object is already converted to a numpy array. But there is, as far as I can see, no way to tell the difference once the objects have been created.

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp150.12.82-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.16.1
pandas: 1.1.4
numpy: 1.19.4
scipy: 1.5.3
netCDF4: 1.5.4
pydap: None
h5netcdf: 0.8.1
h5py: 3.1.0
Nio: None
zarr: 2.5.0
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.7
cfgrib: None
iris: None
bottleneck: None
dask: 2.30.0
distributed: 2.30.1
matplotlib: 3.3.2
cartopy: 0.18.0
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20201009
pip: 20.2.4
conda: installed
pytest: 6.1.2
IPython: 7.19.0
sphinx: 3.3.0

The text was updated successfully, but these errors were encountered:

max-sixty · 2020-11-12T19:12:32Z

Thanks for the clear issue @gerritholl . I agree — it's confusing if those two look the same.

Currently, one way of discriminating them:

In [6]: da1.indexes['x']
Out[6]:
IntervalIndex([(0.0, 0.6666666666666666], (0.6666666666666666, 1.3333333333333333], (1.3333333333333333, 2.0]],
              closed='right',
              name='x',
              dtype='interval[float64]')

In [7]: da2.indexes['x']
Out[7]:
Index([               (0.0, 0.6666666666666666],
       (0.6666666666666666, 1.3333333333333333],
                      (1.3333333333333333, 2.0]],
      dtype='object', name='x')

One option is to push the dtype — 'interval[float64] vs object — or the Index type — IntervalIndex vs Index — values into the repr of the array:

In [8]: da1
Out[8]:
<xarray.DataArray (x: 3)>
array([0, 1, 2])
Coordinates:
  * x        (x) object (0.0, 0.6666666666666666] ... (1.3333333333333333, 2.0]

Could be:

  * x        (x) interval[float64] (0.0, 0.6666666666666666] ... (1.3333333333333333, 2.0]

What are others thoughts?

And ref https://github.com/pydata/xarray/projects/1

benbovy · 2022-09-27T14:45:19Z

Perhaps Xarray has been too clever so far regarding how it handles pandas objects passed directly as coordinate data? pandas.MultiIndex objects are handled in a specific way too, which is often hard to deal with.

Expanding on @max-sixty's suggestion, we could:

treat all coordinate data as duck arrays, i.e., in the example above handle da1 just like da2 (no more special cases for pandas objects)
provide an xarray.indexes.PandasIntervalIndex wrapper, which would inherit from xarray.indexes.PandasIndex with a few addtionnal options and features, e.g., like the ones @dcherian suggests in SciPy Sprint: Creating Flexible Indexes #6783 (comment)
build an interval index from an existing coordinate using , e.g., da.set_xindex("x", PandasIntervalIndex, closed="right")
figure out how to assign both a coordinate and an index from an existing pandas.IntervalIndex object in a convenient but more explicit way

benbovy mentioned this issue Sep 28, 2022

Pass indexes to the Dataset and DataArray constructors #6392

Closed

dcherian added the topic-indexing label Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invisible differences between arrays using IntervalIndex #4579

Invisible differences between arrays using IntervalIndex #4579

gerritholl commented Nov 12, 2020

INSTALLED VERSIONS

max-sixty commented Nov 12, 2020

benbovy commented Sep 27, 2022 •

edited

Loading

Invisible differences between arrays using IntervalIndex #4579

Invisible differences between arrays using IntervalIndex #4579

Comments

gerritholl commented Nov 12, 2020

INSTALLED VERSIONS

max-sixty commented Nov 12, 2020

benbovy commented Sep 27, 2022 • edited Loading

benbovy commented Sep 27, 2022 •

edited

Loading