Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize dask array equality checks. #3453

Merged
merged 11 commits into from
Nov 5, 2019

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Oct 28, 2019

Dask arrays with the same graph have the same name. We can use this to quickly
compare dask-backed variables without computing. (see #3068 (comment))

I will work on adding extra tests but could use feedback on the approach.

@djhoese, thanks for the great example code!

Dask arrays with the same graph have the same name. We can use this to quickly
compare dask-backed variables without computing.

Fixes pydata#3068 and pydata#3311
xarray/core/duck_array_ops.py Show resolved Hide resolved
):
# GH3068
if arr1.name == arr2.name:
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking a little repetitive. Can you make a helper function for doing these checks?

Also note the similar logic (checking object identity) inside Variable.equals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. We can just call lazy_array_equiv in all these before doing the actual value comparison. Then only the asarray cast is duplicated.

Re Variable.equals:

xarray/xarray/core/variable.py

Lines 1560 to 1576 in fb0cf7b

def equals(self, other, equiv=duck_array_ops.array_equiv):
"""True if two Variables have the same dimensions and values;
otherwise False.
Variables can still be equal (like pandas objects) if they have NaN
values in the same locations.
This method is necessary because `v1 == v2` for Variables
does element-wise comparisons (like numpy.ndarrays).
"""
other = getattr(other, "variable", other)
try:
return self.dims == other.dims and (
self._data is other._data or equiv(self.data, other.data)
)
except (TypeError, AttributeError):
return False

equiv=duck_array_ops.array_equiv by default which now calls duck_array_ops.lazy_array_equiv first so things are good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I've added the identity check to lazy_array_equiv too

* upstream/master:
  Remove deprecated behavior from dataset.drop docstring (pydata#3451)
  jupyterlab dark theme (pydata#3443)
  Drop groups associated with nans in group variable (pydata#3406)
  Allow ellipsis (...) in transpose (pydata#3421)
  Another groupby.reduce bugfix. (pydata#3403)
  add icomoon license (pydata#3448)
@dcherian dcherian changed the title [WIP] Optimize dask array equality checks. Optimize dask array equality checks. Oct 29, 2019
@dcherian
Copy link
Contributor Author

Ready for final review and merge.

@crusaderky crusaderky mentioned this pull request Oct 29, 2019
12 tasks
* upstream/master:
  upgrade black verison to 19.10b0 (pydata#3456)
  Remove outdated code related to compatibility with netcdftime (pydata#3450)
@pep8speaks
Copy link

pep8speaks commented Oct 31, 2019

Hello @dcherian! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-01 03:20:32 UTC

for var in variables[1:]:
equals = getattr(out, compat)(var)
equals = getattr(out, compat)(var, equiv=lazy_array_equiv)
if not equals:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be if equals is not None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right (as above) but the next one is wrong. I've fixed that and added a test.

equals[k] = getattr(variables[0], compat)(
var, equiv=lazy_array_equiv
)
if not equals[k]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concern as below -- should this be checking against None explicitly, which I believe lazy_array_equiv can return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right. We are exiting early because equals[k] is not True i.e. either the shapes are not equal, or one (or both) of the arrays is a numpy array or the dask names are not equal

Copy link
Contributor Author

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shoyer.

I've added more comments and made the if conditions more explicit so the logic is (hopefully) clearer now. I did find a bug in merge.py and fixed it. I've added better tests too.

equals[k] = getattr(variables[0], compat)(
var, equiv=lazy_array_equiv
)
if not equals[k]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right. We are exiting early because equals[k] is not True i.e. either the shapes are not equal, or one (or both) of the arrays is a numpy array or the dask names are not equal

for var in variables[1:]:
equals = getattr(out, compat)(var)
equals = getattr(out, compat)(var, equiv=lazy_array_equiv)
if not equals:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right (as above) but the next one is wrong. I've fixed that and added a test.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Thanks for adding the explicit checks is not True, etc. It's definitely good to be explicit about exactly what conditions we are checking rather than relying upon "implicit false" here (even though that's the normal Python idiom).

@dcherian
Copy link
Contributor Author

dcherian commented Nov 5, 2019

Thanks for the review @shoyer

Thanks for adding the explicit checks is not True, etc. It's definitely good to be explicit

I myself was confused when I went through it the second time :)

@dcherian dcherian merged commit af28c6b into pydata:master Nov 5, 2019
@dcherian dcherian deleted the fix/dask-computes branch November 5, 2019 15:41
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 7, 2019
…tests

* upstream/master:
  drop_vars; deprecate drop for variables (pydata#3475)
  uamiv test using only raw uamiv variables (pydata#3485)
  Optimize dask array equality checks. (pydata#3453)
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 8, 2019
* upstream/master: (27 commits)
  drop_vars; deprecate drop for variables (pydata#3475)
  uamiv test using only raw uamiv variables (pydata#3485)
  Optimize dask array equality checks. (pydata#3453)
  Propagate indexes in DataArray binary operations. (pydata#3481)
  python 3.8 tests (pydata#3477)
  __dask_tokenize__ (pydata#3446)
  Type check sentinel values (pydata#3472)
  Fix typo in docstring (pydata#3474)
  fix test suite warnings re `drop` (pydata#3460)
  Fix integrate docs (pydata#3469)
  Fix leap year condition in monthly means example (pydata#3464)
  Hypothesis tests for roundtrip to & from pandas (pydata#3285)
  unpin cftime (pydata#3463)
  Cleanup whatsnew (pydata#3462)
  enable xr.ALL_DIMS in xr.dot (pydata#3424)
  Merge stable into master (pydata#3457)
  upgrade black verison to 19.10b0 (pydata#3456)
  Remove outdated code related to compatibility with netcdftime (pydata#3450)
  Remove deprecated behavior from dataset.drop docstring (pydata#3451)
  jupyterlab dark theme (pydata#3443)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 12, 2019
* upstream/master:
  add missing pint integration tests (pydata#3508)
  DOC: update bottleneck repo url (pydata#3507)
  add drop_sel, drop_vars, map to api.rst (pydata#3506)
  remove syntax warning (pydata#3505)
  Dataset.map, GroupBy.map, Resample.map (pydata#3459)
  tests for datasets with units (pydata#3447)
  fix pandas-dev tests (pydata#3491)
  unpin pseudonetcdf (pydata#3496)
  whatsnew corrections (pydata#3494)
  drop_vars; deprecate drop for variables (pydata#3475)
  uamiv test using only raw uamiv variables (pydata#3485)
  Optimize dask array equality checks. (pydata#3453)
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 13, 2019
* upstream/master:
  format indexing.rst code with black (pydata#3511)
  add missing pint integration tests (pydata#3508)
  DOC: update bottleneck repo url (pydata#3507)
  add drop_sel, drop_vars, map to api.rst (pydata#3506)
  remove syntax warning (pydata#3505)
  Dataset.map, GroupBy.map, Resample.map (pydata#3459)
  tests for datasets with units (pydata#3447)
  fix pandas-dev tests (pydata#3491)
  unpin pseudonetcdf (pydata#3496)
  whatsnew corrections (pydata#3494)
  drop_vars; deprecate drop for variables (pydata#3475)
  uamiv test using only raw uamiv variables (pydata#3485)
  Optimize dask array equality checks. (pydata#3453)
  Propagate indexes in DataArray binary operations. (pydata#3481)
  python 3.8 tests (pydata#3477)
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 13, 2019
commit d430ae0
Author: dcherian <deepak@cherian.net>
Date:   Wed Nov 13 08:27:04 2019 -0700

    proper fix.

commit 7fd69be
Author: dcherian <deepak@cherian.net>
Date:   Wed Nov 13 08:03:26 2019 -0700

    fix whats-new merge.

commit 4489394
Merge: 279ff1d b74f80c
Author: dcherian <deepak@cherian.net>
Date:   Wed Nov 13 08:03:06 2019 -0700

    Merge remote-tracking branch 'upstream/master' into fix/plot-broadcast

    * upstream/master:
      format indexing.rst code with black (pydata#3511)
      add missing pint integration tests (pydata#3508)
      DOC: update bottleneck repo url (pydata#3507)
      add drop_sel, drop_vars, map to api.rst (pydata#3506)
      remove syntax warning (pydata#3505)
      Dataset.map, GroupBy.map, Resample.map (pydata#3459)
      tests for datasets with units (pydata#3447)
      fix pandas-dev tests (pydata#3491)
      unpin pseudonetcdf (pydata#3496)
      whatsnew corrections (pydata#3494)
      drop_vars; deprecate drop for variables (pydata#3475)
      uamiv test using only raw uamiv variables (pydata#3485)
      Optimize dask array equality checks. (pydata#3453)
      Propagate indexes in DataArray binary operations. (pydata#3481)
      python 3.8 tests (pydata#3477)

commit 279ff1d
Author: dcherian <deepak@cherian.net>
Date:   Wed Nov 13 08:02:44 2019 -0700

    Undo the transpose change and add test to make sure transposition is right.

commit c9cc698
Author: dcherian <deepak@cherian.net>
Date:   Wed Nov 13 08:01:39 2019 -0700

    Test to make sure transpose is right

commit 9b35ecf
Author: dcherian <deepak@cherian.net>
Date:   Sat Nov 2 15:49:08 2019 -0600

    Additional test.

commit 7aed950
Author: dcherian <deepak@cherian.net>
Date:   Sat Nov 2 15:20:07 2019 -0600

    make plotting work with transposed nondim coords.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants