Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow xr.combine_by_coords to work with point coordinates? #3774

Closed
kazimuth opened this issue Feb 17, 2020 · 13 comments
Closed

Allow xr.combine_by_coords to work with point coordinates? #3774

kazimuth opened this issue Feb 17, 2020 · 13 comments

Comments

@kazimuth
Copy link

kazimuth commented Feb 17, 2020

MCVE Code Sample

# Your code here

data_0 = xr.Dataset({'temperature': ('time', [10,20,30])}, coords={'time': [0,1,2]})
data_0.coords['trial'] = 0

data_1 = xr.Dataset({'temperature': ('time', [50,60,70])}, coords={'time': [0,1,2]})
data_1.coords['trial'] = 1

# the following fails with a ValueError:
# all_trials = xr.combine_by_coords([data_0, data_1])

# but this works:
all_trials = xr.combine_by_coords([data_0.expand_dims('trial'), data_1.expand_dims('trial')]) # works

Expected output

Both work and produce the same result.

Problem Description

I found this behavior somewhat unintuitive -- I didn't understand that I needed to call expand_dims to get combine_by_coords to work. It might make things a little easier to understand if both behaved the same way -- maybe the code that concatenates dimensions should look for full dimensions first, and point coordinates second.

This would also make code that selects dimensions with sel and then calls combine_by_coords more easy to write.

On the other hand, this could be seen as "magic", and it might also be a breaking change to combine_by_coords; I'm not sure. So I'd understand if isn't implemented.

Output of xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.6 (default, Jan  8 2020, 19:13:59) 
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 4.14.0-115.10.1.el7a.ppc64le
machine: ppc64le
processor: ppc64le
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.2
libnetcdf: None

xarray: 0.15.0
pandas: 0.25.3
numpy: 1.16.5
scipy: 1.3.0
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.8.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.10.0
distributed: 2.10.0
matplotlib: 3.1.1
cartopy: None
seaborn: None
numbagg: None
setuptools: 44.0.0.post20200106
pip: 19.3.1
conda: None
pytest: 5.3.2
IPython: 7.11.1
sphinx: None

@kazimuth kazimuth changed the title Allow xr.combine_by_coords to work with non-expanded dimensions? Allow xr.combine_by_coords to work with point coordinates? Feb 17, 2020
@TomNicholas
Copy link
Member

TomNicholas commented Feb 19, 2020

Hi @kazimuth , thanks for highlighting this!

This is a case which I can see the argument for handling, but we apparently didn't think of when implementing combine_by_coords.

What this really boils down to is "should combine_by_coords combine by 0D non-dimension coordinates as well as 1D dimension coordinates?". I think yes, that makes sense.

On the other hand, this could be seen as "magic"

This function is supposed to be "magic". But the behaviour still needs to be clearly-defined.

maybe the code that concatenates dimensions should look for full dimensions first, and point coordinates second

Yes - I actually got your example to work with just a few extra lines. Change the top of _infer_concat_order_from_coords to read like this:

def _infer_concat_order_from_coords(datasets):

    # All datasets have same variables because they've been grouped as such
    ds0 = datasets[0]

    concat_dim_candidates = list(ds0.dims.keys())

    # Look for 0D coordinates which can be concatenated over but aren't dims
    non_dimension_0d_coords = [coord for coord in ds0.coords
                               if coord not in ds0.dims
                               and len(ds0[coord].dims) == 0]
    for coord in non_dimension_0d_coords:
        if all(coord in ds.coords for ds in datasets):
            datasets = [ds.expand_dims(coord) for ds in datasets]
            concat_dim_candidates.append(coord)

    concat_dims = []
    tile_ids = [() for ds in datasets]

    for dim in concat_dim_candidates:
        ...   # The rest of the function is unchanged

it might also be a breaking change to combine_by_coords; I'm not sure.

I don't think it would be breaking... the only cases which would behave differently are ones which would previously would have thrown an error. Would have to think about this though.

Might you be interested in submitting a PR along these lines @kazimuth ? It would require some more robust input checks, some tests for different cases, and some minor changes to docstrings I think.

@TomNicholas
Copy link
Member

TomNicholas commented Feb 19, 2020

I did get one test failure TestDask.test_preprocess_mfdataset, which failed with

    def test_preprocess_mfdataset(self):
        original = Dataset({"foo": ("x", np.random.randn(10))})
        with create_tmp_file() as tmp:
            original.to_netcdf(tmp)
    
            def preprocess(ds):
                return ds.assign_coords(z=0)
    
            expected = preprocess(original)
            with open_mfdataset(
                tmp, preprocess=preprocess, combine="by_coords"
            ) as actual:
>               assert_identical(expected, actual)
E               AssertionError: Left and right Dataset objects are not identical
E               Differing dimensions:
E                   (x: 10) != (x: 10, z: 1)
E               Differing coordinates:
E               L   z        int64 0
E               R * z        (z) int64 0
E               Differing data variables:
E               L   foo      (x) float64 0.07341 -0.01756 0.6903 ... -0.08741 0.08472 -0.1905
E               R   foo      (z, x) float64 dask.array<chunksize=(1, 10), meta=np.ndarray>

xarray/tests/test_backends.py:2955: AssertionError

I think the problem here is that it's happily promoting the coord to a coordinate dimension even though there is only 1 dataset. We should probably only expand dims which are going to be used in the concatenation...

@kazimuth
Copy link
Author

is _infer_concat_order_from_coords used in other places than combine_by_coords? I haven't really dug into xarray's source yet.

@TomNicholas
Copy link
Member

is _infer_concat_order_from_coords used in other places than combine_by_coords?

combine_by_coords can be called by open_mfdataset, but other than that no.


By the way, there is an additional problem to consider in addition to the above: ambiguity with multiple coordinates for the same dimension.

data0 = xr.Dataset({'temperature': ('time', [10,20,30])}, coords={'time': [0,1,2]})
data0.coords['trial'] = 0
data0.coords['day'] = 5


data1 = xr.Dataset({'temperature': ('time', [50,60,70])}, coords={'time': [0,1,2]})
data1.coords['trial'] = 1
data1.coord['day'] = 3

In this example then combine_by_coords won't know which coordinate to use to order along the new dimension, should it be 'trial' or 'day'? They will give differently-ordered results.

We could pass extra optional arguments to deal with this, but for a "magic" function that doesn't seem desirable. It might be better to have some logic like:

  1. Is there a dimension coordinate for this dimension? If yes just use that.
  2. Otherwise, can the datasets be unambiguously ordered along the dim by looking at non-dimension coords?
  3. If yes, do that, if no, throw informative error.

That would be backwards-compatible (currently all magic ordering is being done for dimension coords), and not make any arbitrary choices.

@dcherian
Copy link
Contributor

@TomNicholas I think this is the case I ran into.

Does the old auto_combine handle this case? If so, we should get your fix in before releasing a version without _auto_combine

@TomNicholas
Copy link
Member

@dcherian it looks like the old auto_combine does handle this case, but so does the new combine_nested (which is intentional, one of the purposes of combine_nested is to make it easier to explicitly combine along new dimensions):

import xarray as xr
from xarray.core.combine import _old_auto_combine

data_0 = xr.Dataset({'temperature': ('time', [10,20,30])}, coords={'time': [0,1,2]})
data_0.coords['trial'] = 0

data_1 = xr.Dataset({'temperature': ('time', [50,60,70])}, coords={'time': [0,1,2]})
data_1.coords['trial'] = 1

all_trials = _old_auto_combine([data_0, data_1], concat_dim='trial')
print(all_trials)
<xarray.Dataset>
Dimensions:      (time: 3, trial: 2)
Coordinates:
  * time         (time) int64 0 1 2
  * trial        (trial) int64 0 1
Data variables:
    temperature  (trial, time) int64 10 20 30 50 60 70

but you can get the same result with

all_trials = xr.combine_nested([data_0, data_1], concat_dim='trial')
print(all_trials)
<xarray.Dataset>
Dimensions:      (time: 3, trial: 2)
Coordinates:
  * time         (time) int64 0 1 2
  * trial        (trial) int64 0 1
Data variables:
    temperature  (trial, time) int64 10 20 30 50 60 70

So technically this issue doesn't need to be closed before completely removing the old auto_combine, but we probably should fix this anyway.

@dcherian
Copy link
Contributor

Ok sounds good. Thanks

@TomNicholas
Copy link
Member

I opened a PR to fix this.

Also I realised that what I said about dealing with ambiguities of multiple coordinates for the same dimension doesn't make sense. In that situation there is no way to know the user wanted each of the 'trial' and 'day' coords to correspond to the same dim, and if you give them each their own dim it's a well-defined combination operation, you just end up combining along an extra dim 'day' and padding with a lot of NaNs. I don't think there's a problem with that behaviour, so I've just left it without any special treatment.

(I did make sure it handles the case which caused TestDask.test_preprocess_mfdataset to fail though, because that's to do with checking that the coordinates aren't just the same on every dataset.)

@shoyer
Copy link
Member

shoyer commented Apr 19, 2020

Suppose there were multiple scalar coordinates that are unique for each variable. How would combine_by_coords pick a dimension to stack along?

@TomNicholas
Copy link
Member

Suppose there were multiple scalar coordinates that are unique for each variable. How would combine_by_coords pick a dimension to stack along?

@shoyer it would expand and stack along both, filling the (many) gaps created with NaNs.

import xarray as xr

data_0 = xr.Dataset({'temperature': ('time', [10,20,30])}, coords={'time': [0,1,2]})
data_0.coords['trial'] = 0  # scalar coords
data_0.coords['day'] = 1

data_1 = xr.Dataset({'temperature': ('time', [50,60,70])}, coords={'time': [0,1,2]})
data_1.coords['trial'] = 1
data_1.coords['day'] = 0

# both scalar coords will be promoted to dims
all_trials = xr.combine_by_coords([data_0, data_1])
print(all_trials)
<xarray.Dataset>
Dimensions:      (day: 2, time: 3, trial: 2)
Coordinates:
  * time         (time) int64 0 1 2
  * trial        (trial) int64 0 1
  * day          (day) int64 0 1
Data variables:
    temperature  (day, trial, time) float64 nan nan nan 50.0 ... nan nan nan

The gaps created will be filled in with NaNs

print(all_trials['temperature'].data)
[[[nan nan nan]
  [50. 60. 70.]]

 [[10. 20. 30.]
  [nan nan nan]]]

This gap-filling isn't new though - without this PR the same thing already happens with length-1 dimension coords (since PR #3649 - see my comment there)

data_0 = xr.Dataset({'temperature': ('time', [10,20,30])}, coords={'time': [0,1,2]})
data_0.coords['trial'] = [0]   # 1D dimension coords
data_0.coords['day'] = [1]

data_1 = xr.Dataset({'temperature': ('time', [50,60,70])}, coords={'time': [0,1,2]})
data_1.coords['trial'] = [1]
data_1.coords['day'] = [0]

all_trials = xr.combine_by_coords([data_0, data_1])
print(all_trials)
<xarray.Dataset>
Dimensions:      (day: 2, time: 3, trial: 2)
Coordinates:
  * time         (time) int64 0 1 2
  * day          (day) int64 0 1
  * trial        (trial) int64 0 1
Data variables:
    temperature  (trial, day, time) float64 nan nan nan 10.0 ... nan nan nan
# gaps will again be filled in with NaNs
print(all_trials['temperature'].data)
[[[nan nan nan]
  [10. 20. 30.]]

 [[50. 60. 70.]
  [nan nan nan]]]

So all my PR is doing is promoting all scalar coordinates (those which aren't equal across all datasets) to dimension coordinates before combining.

There is a chance this could unwittingly increase the overall size of people's datasets (when they have different scalar coordinates in different datasets), but that could already happen since #3649.

@shoyer
Copy link
Member

shoyer commented Apr 21, 2020

I don't think it's a great idea to automatically turn scalar coords into 1d arrays. It's not uncommon to have datasets with a whole handful of scalar coordinates, which could potentially become very high dimensional due to this change.

I also don't like fallback logic that tries to do matching by coordinates with dimensions first, and then falls back to using scalars. These types of heuristics look very convenient first (and are very convenient much of the time) but then have a tendency to fail in unexpected/unpredictable ways.

The other choice for situations like this would be to encourage switching to combine_nested for use cases like this, e.g.,

>>> xr.combine_nested([data_0, data_1], concat_dim='trial')
<xarray.Dataset>
Dimensions:      (time: 3, trial: 2)
Coordinates:
  * time         (time) int64 0 1 2
  * trial        (trial) int64 0 1
Data variables:
    temperature  (trial, time) int64 10 20 30 50 60 70

We could even put a reference to combine_nested in this error raised by combine_by_coords.

@TomNicholas
Copy link
Member

I don't think it's a great idea to automatically turn scalar coords into 1d arrays.

Fair enough. Those are good reasons.

The other choice for situations like this would be to encourage switching to combine_nested for use cases like this

The fact that you can solve this use case with combine_nested does make this feature a lot less important.

We could even put a reference to combine_nested in this error raised by combine_by_coords.

The error you get is

ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation

so we could try to automatically detect the coordinates the user might have wanted to combine along, or we can just add something like

ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation.
To specify dimensions or coordinates to concatenate along, use the `concat_dim` argument to `xarray.combine_nested`.

@TomNicholas
Copy link
Member

Looking back at @shoyer 's comments above I think I'm just going to close this (and the corresponding PR #3982) for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants