nbytes does not return the true size for sparse variables #4842

Huite · 2021-01-25T10:17:56Z

This wasn't entirely surprising to me, but nbytes currently doesn't return the right value for sparse data -- at least, I think nbytes should show the actual size in memory?

Since it uses size here:

xarray/xarray/core/variable.py

Line 349 in a0c71c1

return self.size * self.dtype.itemsize

Rather than something like data.nnz, which of course only exists for sparse arrays...
I'm not sure if there's a sparse flag or something, or whether you'd have to do a typecheck?

Minimal Complete Verifiable Example:

import pandas as pd
import numpy as np
import xarray as xr


df = pd.DataFrame()
df["x"] = np.repeat(np.random.rand(10_000), 10)
df["y"] = np.repeat(np.random.rand(10_000), 10)
df["time"] = np.tile(pd.date_range("2000-01-01", "2000-03-10", freq="W"), 10_000)
df["rate"] = 10.0
df = df.set_index(["time", "y", "x"])

sparse_ds = xr.Dataset.from_dataframe(df, sparse=True)
print(sparse_ds["rate"].nbytes)

8000000000

Anything else we need to know?:

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.9 (default, Aug 31 2020, 17:10:11) [MSC v.1916 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
libhdf5: 1.10.5
libnetcdf: 4.7.3

xarray: 0.16.1
pandas: 1.1.2
numpy: 1.19.1
scipy: 1.5.2
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.8.0
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.2
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.27.0
distributed: 2.30.1
matplotlib: 3.3.1
cartopy: None
seaborn: 0.11.0
numbagg: None
pint: None
setuptools: 49.6.0.post20201009
pip: 20.3.3
conda: None
pytest: 6.1.0
IPython: 7.19.0
sphinx: 3.2.1

The text was updated successfully, but these errors were encountered:

benbovy · 2021-01-25T10:56:05Z

I guess it could be possible to do a typecheck for this case. It's tricky to have a smart nbytes that works in all cases, though. For example, it's not really possible to get this information for dask arrays with sparse chunks (dask/dask#5313).

shoyer · 2021-01-28T16:18:16Z

We should probably be pulling xarray's nbytes from nbytes attribute on arrays. Then at least we can leave this calculation up to the backend array.

dcherian added the topic-arrays related to flexible array support label Jan 25, 2021

dcherian mentioned this issue May 3, 2022

DataArray.nbytes listed twice in API doc block #6565

Closed

dcherian added contrib-help-wanted contrib-good-first-issue labels Jul 14, 2022

maxrjones mentioned this issue Jul 16, 2022

Pull xarray's nbytes from nbytes attribute on arrays #6797

Merged

2 tasks

dcherian closed this as completed in #6797 Jul 22, 2022

hmaarrfk mentioned this issue Dec 5, 2022

Avoid loading entire dataset by getting the nbytes in an array #7356

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nbytes does not return the true size for sparse variables #4842

nbytes does not return the true size for sparse variables #4842

Huite commented Jan 25, 2021

benbovy commented Jan 25, 2021

shoyer commented Jan 28, 2021

nbytes does not return the true size for sparse variables #4842

nbytes does not return the true size for sparse variables #4842

Comments

Huite commented Jan 25, 2021

benbovy commented Jan 25, 2021

shoyer commented Jan 28, 2021