Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large coordinate arrays trigger computation #3454

Closed
jbusecke opened this issue Oct 29, 2019 · 2 comments · Fixed by #3453
Closed

Large coordinate arrays trigger computation #3454

jbusecke opened this issue Oct 29, 2019 · 2 comments · Fixed by #3453

Comments

@jbusecke
Copy link
Contributor

I want to bring up an issue that has tripped up my workflow with large climate models many times. I am dealing with large data arrays of vertical cell thickness. These are 4d arrays (x, y, z, time) but I would define them as coordinates, not data_variables in the xarrays data model (e.g. they should not be multiplied by a value if a dataset is multiplied).
These sort of coordinates might become more prevalent with newer ocean models like MOM6

Whenever I assign these arrays as coordinates operations on the arrays seem to trigger computation, whereas they don't if I set them up as data_variables. The example below shows this behavior.
Is this a bug or done on purpose? Is there a workaround to keep these vertical thicknesses as coordinates?

import xarray as xr
import numpy as np
import dask.array as dsa

# create dataset with with vertical thickness `dz` as data variable
data = xr.DataArray(dsa.random.random([30, 50, 200, 1000]), dims=['x','y', 'z', 't'])
dz = xr.DataArray(dsa.random.random([30, 50, 200, 1000]), dims=['x','y', 'z', 't'])
ds = xr.Dataset({'data':data, 'dz':dz})

#another dataset with `dz` as coordinate
ds_new = xr.Dataset({'data':data})
ds_new.coords['dz'] = dz
%%time
test = ds['data'] * ds['dz']

CPU times: user 1.94 ms, sys: 19.1 ms, total: 21 ms Wall time: 21.6 ms

%%time
test = ds_new['data'] * ds_new['dz']

CPU times: user 17.4 s, sys: 1.98 s, total: 19.4 s Wall time: 12.5 s

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.4 (default, Aug 13 2019, 15:17:50) [Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 18.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: None libnetcdf: None

xarray: 0.13.0+24.g4254b4af
pandas: 0.25.1
numpy: 1.17.2
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.5.0
distributed: 2.5.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.2.0
pip: 19.2.3
conda: None
pytest: 5.2.0
IPython: 7.8.0
sphinx: None

@dcherian
Copy link
Contributor

Totally fixed by #3453!!!
Both statements take the same time on that branch.

@jbusecke
Copy link
Contributor Author

jbusecke commented Oct 29, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants