Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nbytes does not return the true size for sparse variables #4842

Closed
Huite opened this issue Jan 25, 2021 · 2 comments · Fixed by #6797
Closed

nbytes does not return the true size for sparse variables #4842

Huite opened this issue Jan 25, 2021 · 2 comments · Fixed by #6797

Comments

@Huite
Copy link
Contributor

Huite commented Jan 25, 2021

This wasn't entirely surprising to me, but nbytes currently doesn't return the right value for sparse data -- at least, I think nbytes should show the actual size in memory?

Since it uses size here:

return self.size * self.dtype.itemsize

Rather than something like data.nnz, which of course only exists for sparse arrays...
I'm not sure if there's a sparse flag or something, or whether you'd have to do a typecheck?

Minimal Complete Verifiable Example:

import pandas as pd
import numpy as np
import xarray as xr


df = pd.DataFrame()
df["x"] = np.repeat(np.random.rand(10_000), 10)
df["y"] = np.repeat(np.random.rand(10_000), 10)
df["time"] = np.tile(pd.date_range("2000-01-01", "2000-03-10", freq="W"), 10_000)
df["rate"] = 10.0
df = df.set_index(["time", "y", "x"])

sparse_ds = xr.Dataset.from_dataframe(df, sparse=True)
print(sparse_ds["rate"].nbytes)
8000000000

Anything else we need to know?:

Environment:

Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.9 (default, Aug 31 2020, 17:10:11) [MSC v.1916 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
libhdf5: 1.10.5
libnetcdf: 4.7.3

xarray: 0.16.1
pandas: 1.1.2
numpy: 1.19.1
scipy: 1.5.2
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.8.0
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.2
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.27.0
distributed: 2.30.1
matplotlib: 3.3.1
cartopy: None
seaborn: 0.11.0
numbagg: None
pint: None
setuptools: 49.6.0.post20201009
pip: 20.3.3
conda: None
pytest: 6.1.0
IPython: 7.19.0
sphinx: 3.2.1
@benbovy
Copy link
Member

benbovy commented Jan 25, 2021

I guess it could be possible to do a typecheck for this case. It's tricky to have a smart nbytes that works in all cases, though. For example, it's not really possible to get this information for dask arrays with sparse chunks (dask/dask#5313).

@dcherian dcherian added the topic-arrays related to flexible array support label Jan 25, 2021
@shoyer
Copy link
Member

shoyer commented Jan 28, 2021

We should probably be pulling xarray's nbytes from nbytes attribute on arrays. Then at least we can leave this calculation up to the backend array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants