Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invisible differences between arrays using IntervalIndex #4579

Open
gerritholl opened this issue Nov 12, 2020 · 2 comments
Open

Invisible differences between arrays using IntervalIndex #4579

gerritholl opened this issue Nov 12, 2020 · 2 comments

Comments

@gerritholl
Copy link
Contributor

What happened:

I have two DataArrays that each have a coordinate constructed with pandas.interval_range. In one case I pass the interval_range directly, in the other case I call .to_numpy() first. The two DataArrays look identical but aren't. This can lead to hard-to-find bugs, because behaviour is not identical: the former supports indexing whereas the latter doesn't.

What you expected to happen:

I expect two arrays that appear identical to behave identically. If they don't behave identically then there should be some way to tell the difference (apart from equals, which tells me they are different but not how).

Minimal Complete Verifiable Example:

import xarray
import pandas

da1 = xarray.DataArray([0, 1, 2], dims=("x",), coords={"x":
    pandas.interval_range(0, 2, 3)})
da2 = xarray.DataArray([0, 1, 2], dims=("x",), coords={"x":
    pandas.interval_range(0, 2, 3).to_numpy()})

print(repr(da1) == repr(da2))
print(repr(da1.x) == repr(da2.x))
print(da1.x.dtype == da2.x.dtype)

# identical?  No:
print(da1.equals(da2))
print(da1.x.equals(da2.x))

# in particular:
da1.sel(x=1)  # works
da2.sel(x=1)  # fails

Results in:

True
True
True
False
False
Traceback (most recent call last):
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "mwe105.py", line 19, in <module>
    da2.sel(x=1)  # fails
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/dataarray.py", line 1143, in sel
    ds = self._to_temp_dataset().sel(
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/dataset.py", line 2105, in sel
    pos_indexers, new_indexes = remap_label_indexers(
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/coordinates.py", line 397, in remap_label_indexers
    pos_indexers, new_indexes = indexing.remap_label_indexers(
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/indexing.py", line 275, in remap_label_indexers
    idxr, new_idx = convert_label_indexer(index, label, dim, method, tolerance)
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/xarray/core/indexing.py", line 196, in convert_label_indexer
    indexer = index.get_loc(label_value, method=method, tolerance=tolerance)
  File "/data/gholl/miniconda3/envs/py38/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err
KeyError: 1

Additional context

I suppose this happens because under the hood xarray does something clever to support pandas-style indexing even though the coordinate variable appears like a numpy array with an object dtype, and that this cleverness is lost if the object is already converted to a numpy array. But there is, as far as I can see, no way to tell the difference once the objects have been created.

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp150.12.82-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.16.1
pandas: 1.1.4
numpy: 1.19.4
scipy: 1.5.3
netCDF4: 1.5.4
pydap: None
h5netcdf: 0.8.1
h5py: 3.1.0
Nio: None
zarr: 2.5.0
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.7
cfgrib: None
iris: None
bottleneck: None
dask: 2.30.0
distributed: 2.30.1
matplotlib: 3.3.2
cartopy: 0.18.0
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20201009
pip: 20.2.4
conda: installed
pytest: 6.1.2
IPython: 7.19.0
sphinx: 3.3.0

@max-sixty
Copy link
Collaborator

Thanks for the clear issue @gerritholl . I agree — it's confusing if those two look the same.

Currently, one way of discriminating them:

In [6]: da1.indexes['x']
Out[6]:
IntervalIndex([(0.0, 0.6666666666666666], (0.6666666666666666, 1.3333333333333333], (1.3333333333333333, 2.0]],
              closed='right',
              name='x',
              dtype='interval[float64]')

In [7]: da2.indexes['x']
Out[7]:
Index([               (0.0, 0.6666666666666666],
       (0.6666666666666666, 1.3333333333333333],
                      (1.3333333333333333, 2.0]],
      dtype='object', name='x')

One option is to push the dtype — 'interval[float64] vs object — or the Index type — IntervalIndex vs Index — values into the repr of the array:

In [8]: da1
Out[8]:
<xarray.DataArray (x: 3)>
array([0, 1, 2])
Coordinates:
  * x        (x) object (0.0, 0.6666666666666666] ... (1.3333333333333333, 2.0]

Could be:

  * x        (x) interval[float64] (0.0, 0.6666666666666666] ... (1.3333333333333333, 2.0]

What are others thoughts?

And ref https://github.com/pydata/xarray/projects/1

@benbovy
Copy link
Member

benbovy commented Sep 27, 2022

Perhaps Xarray has been too clever so far regarding how it handles pandas objects passed directly as coordinate data? pandas.MultiIndex objects are handled in a specific way too, which is often hard to deal with.

Expanding on @max-sixty's suggestion, we could:

  • treat all coordinate data as duck arrays, i.e., in the example above handle da1 just like da2 (no more special cases for pandas objects)
  • provide an xarray.indexes.PandasIntervalIndex wrapper, which would inherit from xarray.indexes.PandasIndex with a few addtionnal options and features, e.g., like the ones @dcherian suggests in SciPy Sprint: Creating Flexible Indexes #6783 (comment)
  • build an interval index from an existing coordinate using , e.g., da.set_xindex("x", PandasIntervalIndex, closed="right")
  • figure out how to assign both a coordinate and an index from an existing pandas.IntervalIndex object in a convenient but more explicit way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants