Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling() gives values different from pd.rolling() #5877

Open
chiaral opened this issue Oct 19, 2021 · 4 comments
Open

Rolling() gives values different from pd.rolling() #5877

chiaral opened this issue Oct 19, 2021 · 4 comments

Comments

@chiaral
Copy link
Contributor

chiaral commented Oct 19, 2021

I am not sure this is a bug - but it clearly doesn't give the results the user would expect.

The rolling sum of zeros gives me values that are not zeros

 var = np.array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.31      , 0.91999996, 8.3       ,
       1.42      , 0.03      , 1.22      , 0.09999999, 0.14      ,
       0.13      , 0.        , 0.12      , 0.03      , 2.53      ,
       0.        , 0.19999999, 0.19999999, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ],
               dtype='float32')

timet = np.array([  43200000000000,  129600000000000,  216000000000000,  302400000000000,
        388800000000000,  475200000000000,  561600000000000,  648000000000000,
        734400000000000,  820800000000000,  907200000000000,  993600000000000,
       1080000000000000, 1166400000000000, 1252800000000000, 1339200000000000,
       1425600000000000, 1512000000000000, 1598400000000000, 1684800000000000,
       1771200000000000, 1857600000000000, 1944000000000000, 2030400000000000,
       2116800000000000, 2203200000000000, 2289600000000000, 2376000000000000,
       2462400000000000, 2548800000000000, 2635200000000000, 2721600000000000,
       2808000000000000, 2894400000000000, 2980800000000000],
      dtype='timedelta64[ns]')

ds_ex = xr.Dataset(data_vars=dict(
                          pr=(["time"], var),
                        ),
                        coords=dict(
                        time=("time", timet)
                        ),
    )

ds_ex.rolling(time=3).sum().pr.values

it gives me this result:

array([ nan, nan, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 3.1000000e-01,
1.2300000e+00, 9.5300007e+00, 1.0640000e+01, 9.7500000e+00,
2.6700001e+00, 1.3500001e+00, 1.4600002e+00, 3.7000012e-01,
2.7000013e-01, 2.5000012e-01, 1.5000013e-01, 2.6800001e+00,
2.5600002e+00, 2.7300003e+00, 4.0000033e-01, 4.0000033e-01,
2.0000035e-01, 3.5762787e-07, 3.5762787e-07, 3.5762787e-07,
3.5762787e-07, 3.5762787e-07, 3.5762787e-07, 3.5762787e-07,
3.5762787e-07, 3.5762787e-07, 3.5762787e-07
], dtype=float32)

Note the non zero values - the non zero value changes depending on whether i use float64 or float32 as precision of my data. So this seems to be a precision related issue (although the first values are correctly set to zero), in fact other sums of values are not exactly what they should be.

The small difference at the 8th/9th decimal position can be expected due to precision, but the fact that the 0s become non zeros is problematic imho, especially if not documented. Oftentimes zero in geoscience data can mean a very specific thing (i.e. zero rainfall will be characterized differently than non-zero).

in pandas this instead works:

df_ex = ds_ex.to_dataframe()
df_ex.rolling(window=3).sum().values.T

gives me

array([[ nan, nan, 0. , 0. , 0. ,
0. , 0. , 0.31 , 1.22999996, 9.53000015,
10.6400001 , 9.75000015, 2.66999999, 1.35000001, 1.46000002,
0.36999998, 0.27 , 0.24999999, 0.15 , 2.67999997,
2.55999997, 2.72999996, 0.39999998, 0.39999998, 0.19999999,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ]])

What you expected to happen:

the sum of zeros should be zero.
If this cannot be achieved/expected because of precision issues, it should be documented.

Anything else we need to know?:

I discovered this behavior in my old environments, but I created a new ad hoc environment with the latest versions, and it does the same thing.

Environment:

INSTALLED VERSIONS

commit: None
python: 3.9.7 (default, Sep 16 2021, 08:50:36)
[Clang 10.0.0 ]
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 0.19.0
pandas: 1.3.3
numpy: 1.21.2
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 58.0.4
pip: 21.2.4
conda: None
pytest: None
IPython: 7.28.0
sphinx: None

@chiaral
Copy link
Contributor Author

chiaral commented Oct 20, 2021

Adding a few extra observations:

ds_ex.rolling(time=3).mean().pr.values
df_ex.rolling(window=3).mean().values.T

have a similar behaviour, in that once again xr.rolling() doesn't have zero where it should, but pd.rolling does.

But when I switch to other operations, like var or std the behaviour is the opposite, i.e.:

ds_ex.rolling(time=3).std().pr.values

array([ nan, nan, 0. , 0. , 0. ,
0. , 0. , 0.1461354 , 0.38218665, 3.631293 ,
3.367307 , 3.6156974 , 0.61356837, 0.54522127, 0.5188016 ,
0.01698606, 0.06376763, 0.05906381, 0.05098677, 1.157881 ,
1.1856455 , 1.148419 , 0.09427918, 0.09427918, 0.09427926,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ],
dtype=float32)

whereas

df_ex.rolling(window=3).std().values.T

gives

array([[ nan, nan, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.78978585e-01,
4.68081166e-01, 4.44740760e+00, 4.12409195e+00, 4.42830679e+00,
7.51465227e-01, 6.67757461e-01, 6.35400157e-01, 2.08166670e-02,
7.81024957e-02, 7.23417792e-02, 6.24499786e-02, 1.41810905e+00,
1.45211339e+00, 1.40652052e+00, 1.15470047e-01, 1.15470047e-01,
1.15470047e-01, 9.60572442e-08, 9.60572442e-08, 9.60572442e-08,
9.60572442e-08, 9.60572442e-08, 9.60572442e-08, 9.60572442e-08,
9.60572442e-08, 9.60572442e-08, 9.60572442e-08]])

@mathause
Copy link
Collaborator

Thanks for the report. Without testing anything I suspect that this is due to the use of float32 data and/ or bottleneck - see also #1346. You can test this by uninstalling bottleneck (there is an option to disable bottleneck but it's not yet released (#5560).

@chiaral
Copy link
Contributor Author

chiaral commented Oct 20, 2021

Yup - just followed your suggestion and:

  1. conda removed bottleneck and it removed xarray and pandas as well
  2. conda installed xarray which installed xarray, pandas, and pytz

and now the xr.rolling(time=3).sum() yields:

array([ nan, nan, 0. , 0. , 0. ,
0. , 0. , 0.31 , 1.23 , 9.530001 ,
10.64 , 9.75 , 2.67 , 1.35 , 1.46 ,
0.36999997, 0.26999998, 0.25 , 0.14999999, 2.68 ,
2.56 , 2.73 , 0.39999998, 0.39999998, 0.19999999,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ],
dtype=float32)

could you elaborate more on the issue? is this because of some bouncing between precisions across packages?
But why do I have zeros at the beginning of the rolling sum and non zeros after having calculated a sum?
it is not consistent in the behaviour.

Thanks tho!

@mathause
Copy link
Collaborator

AFAIK bottleneck uses a less precise algorithm for sums than numpy (pydata/bottleneck#379). However, I don't know why this yields 0 at the beginning but not at the end.

A slightly more minimal example:

import bottleneck as bn
import numpy as np
import pandas as pd

data = np.array(
    [
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.31,
        0.91999996,
        8.3,
        1.42,
        0.03,
        1.22,
        0.09999999,
        0.14,
        0.13,
        0.0,
        0.12,
        0.03,
        2.53,
        0.0,
        0.19999999,
        0.19999999,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
    ],
    dtype="float32",
)

bn.move_sum(data, window=3)
pd.Series(data).rolling(3).mean()
np.convolve(data, np.ones(3), 'valid') / 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants