Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor concat to use merge for non-concatenated variables #3239

Merged
merged 38 commits into from
Sep 16, 2019

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Aug 22, 2019

This PR adds two changes:

  1. First, it refactors concat so that non-concatenated variables are passed to merge. concat gains a compat kwarg which tells merge how to merge things. concat's previous behaviour was effectively compat='equals' which is now the default.
  2. Also adds compat="override" to skip equality checking in merge. concat's data_vars='minimal' and coords='minimal' options are now more useful because you can specify compat='override' to skip equality checking and just pick the first one. Previously this would raise an error if any of the variables differed by floating point noise.

xarray/core/concat.py Outdated Show resolved Hide resolved
@dcherian dcherian force-pushed the compat-override branch 2 times, most recently from da23a33 to d3191d9 Compare August 29, 2019 17:13
@dcherian dcherian changed the title [WIP] compat="override", data_vars/coords="sensible" [WIP] Refactor concat to use merge for non-concatenated variables Aug 29, 2019
Copy link
Contributor Author

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for review!

@@ -567,8 +567,8 @@ def test_combine_concat_over_redundant_nesting(self):

def test_combine_nested_but_need_auto_combine(self):
objs = [Dataset({"x": [0, 1]}), Dataset({"x": [2], "wall": [0]})]
with raises_regex(ValueError, "cannot be combined"):
combine_nested(objs, concat_dim="x")
# with raises_regex(ValueError, "cannot be combined"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works now. With this PR, I think concat is very similar to auto_combine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this works because wall and x are both indexes and since combine calls align, this will work.

xarray/tests/test_concat.py Show resolved Hide resolved
@dcherian dcherian changed the title [WIP] Refactor concat to use merge for non-concatenated variables Refactor concat to use merge for non-concatenated variables Aug 30, 2019
@dcherian
Copy link
Contributor Author

dcherian commented Sep 4, 2019

Now with docs! I updated the mfdataset and parallel reading sections in dask.rst and io.rst.

@dcherian dcherian mentioned this pull request Sep 5, 2019
doc/dask.rst Outdated Show resolved Hide resolved
if variables_to_merge:
to_merge = []
for ds in datasets:
if variables_to_merge - set(ds.variables):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that set difference is not symmetric. I think you might want:

Suggested change
if variables_to_merge - set(ds.variables):
if variables_to_merge != set(ds.variables):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right. We are looking for variables that need to be merged but aren't present in ds.

xarray/core/concat.py Outdated Show resolved Hide resolved
xarray/core/concat.py Outdated Show resolved Hide resolved
xarray/core/concat.py Outdated Show resolved Hide resolved
xarray/core/concat.py Outdated Show resolved Hide resolved
# TODO: add preservation of attrs into fillna
out = getattr(out, combine_method)(var)
out.attrs = var.attrs
equals_list.append(getattr(out, compat)(var.compute()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you're now explicitly computing everything ahead of time? In the previous implementation, at least we would not complete everything if there was a failure in comparison between the first two variables.

It seems like the ideal thing to do (in the case of dask) might be to build up a single computation graph and call compute once? Certainly we can save that for later, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you're now explicitly computing everything ahead of time?

This was a mistake. I've fixed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any changes to this logic?

We do have inconsistent behavior here currently (on master), which your PR would resolve:

  • Inside merge(), variables are computed after doing the equality comparison.
  • Inside concat(), variables are computed before doing the equality comparison.

We could certainly change this for consistency, but I'm not 100% sure which is the better option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah now I understand, we want to check attrs, shapes etc. before comparing values. How about

    if equals is None:
        out = out.compute()
        for var in variables[1:]:
            equals = getattr(out, compat)(var)
            if not equals:
                break

This still passes the dask test (test_concat_loads_variables in test_dask.py) but out is still being computed before doing any checks. Not sure how to get around this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty reasonable to me. @crusaderky any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the proper solution (for a different PR) is to have a function that loops through a list of variables and compares everything but values; and then if needed, loops again and compares values.

xarray/core/concat.py Outdated Show resolved Hide resolved
xarray/tests/test_concat.py Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Sep 10, 2019

Hello @dcherian! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-16 14:47:29 UTC

@dcherian
Copy link
Contributor Author

Test failure is #3300

* upstream/master:
  ignore h5py 2.10.0 warnings and fix invalid_netcdf warning test. (pydata#3301)
  Update why-xarray.rst with clearer expression (pydata#3307)
  Updater to testing environment name (pydata#3253)
xarray/core/merge.py Outdated Show resolved Hide resolved
xarray/core/concat.py Outdated Show resolved Hide resolved
dcherian and others added 3 commits September 15, 2019 00:58
Co-Authored-By: Stephan Hoyer <shoyer@google.com>
…pat-override

* 'compat-override' of github.com:dcherian/xarray:
  Update xarray/core/merge.py
@dcherian
Copy link
Contributor Author

Thanks for the review @shoyer

@dcherian dcherian merged commit 756c941 into pydata:master Sep 16, 2019
@dcherian dcherian deleted the compat-override branch September 16, 2019 14:49
dcherian added a commit to dcherian/xarray that referenced this pull request Sep 19, 2019
* master:
  Fix whats-new date :/
  Revert to dev version
  Release v0.13.0
  auto_combine deprecation to 0.14 (pydata#3314)
  Deprecation: groupby, resample default dim. (pydata#3313)
  Raise error if cmap is list of colors (pydata#3310)
  Refactor concat to use merge for non-concatenated variables (pydata#3239)
  Honor `keep_attrs` in DataArray.quantile (pydata#3305)
  Fix DataArray api doc (pydata#3309)
  Accept int value in head, thin and tail (pydata#3298)
  ignore h5py 2.10.0 warnings and fix invalid_netcdf warning test. (pydata#3301)
  Update why-xarray.rst with clearer expression (pydata#3307)
  Compat and encoding deprecation to 0.14 (pydata#3294)
  Remove deprecated concat kwargs. (pydata#3288)
  allow np-array levels and colors in 2D plots (pydata#3295)
  Remove some deprecations (pydata#3292)
  Make argmin/max work lazy with dask (pydata#3244)
  Add head, tail and thin methods (pydata#3278)
  Updater to testing environment name (pydata#3253)
dcherian added a commit that referenced this pull request Sep 24, 2019
* upstream/master: (43 commits)
  Add hypothesis support to related projects (#3335)
  More doc fixes (#3333)
  Improve the documentation of swap_dims (#3331)
  fix the doc names of the return value of swap_dims (#3329)
  Fix isel performance regression (#3319)
  Allow weakref (#3318)
  Clarify that "scatter" is a plotting method in what's new. (#3316)
  Fix whats-new date :/
  Revert to dev version
  Release v0.13.0
  auto_combine deprecation to 0.14 (#3314)
  Deprecation: groupby, resample default dim. (#3313)
  Raise error if cmap is list of colors (#3310)
  Refactor concat to use merge for non-concatenated variables (#3239)
  Honor `keep_attrs` in DataArray.quantile (#3305)
  Fix DataArray api doc (#3309)
  Accept int value in head, thin and tail (#3298)
  ignore h5py 2.10.0 warnings and fix invalid_netcdf warning test. (#3301)
  Update why-xarray.rst with clearer expression (#3307)
  Compat and encoding deprecation to 0.14 (#3294)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

We need a fast path for open_mfdataset
3 participants