bottleneck : Wrong mean for float32 array #1346

matteodefelice · 2017-03-31T10:15:42Z

I think it is better to have this discussion here instead of on the dask page dask/dask#2095

This is the replicable "bug":

ds = xarray.open_dataset('/opt/data/ERAIN/ERAIN-t2m-1983-2012.seasmean.nc')
ds.var167.mean()
Out[14]: 
<xarray.DataArray 'var167' ()>
array(261.6441345214844)
ds.var167.data.mean()
Out[15]: 278.62466

The dataset is ~65 MB, here the file https://www.dropbox.com/s/xtj3fm7ihtbwd5r/ERAIN-t2m-1983-2012.seasmean.nc?dl=0
It is a quite normal NetCDF (no NaN), just processed with CDO as you can see on the dask issue.

The text was updated successfully, but these errors were encountered:

fmaussion · 2017-03-31T11:50:05Z

I can't reproduce this:

In [6]: ds = xr.open_dataset('./Downloads/ERAIN-t2m-1983-2012.seasmean.nc')

In [7]: ds.var167.mean()
Out[7]: 
<xarray.DataArray 'var167' ()>
array(278.6246643066406, dtype=float32)

In [8]: ds.var167.data.mean()
Out[8]: 278.62466

which version of xarray, dask, python are you using?

matteodefelice · 2017-03-31T11:53:12Z

Ok, I am on MacOS:

Python 2.7.13 from Macports
Dask 0.14.1 from Macports
xarray from GitHub

andrew-c-ross · 2017-03-31T15:38:12Z

Also on macOS, and I can reproduce.

Using python 2.7.11, xarray 0.9.1, dask 0.14.1 installed through Anaconda. I get the same results with xarray 0.9.1-38-gc0178b7 from GitHub.

In [3]: ds = xarray.open_dataset('ERAIN-t2m-1983-2012.seasmean.nc')

In [4]: ds.var167.mean()
Out[4]:
<xarray.DataArray 'var167' ()>
array(261.6441345214844, dtype=float32)

Curiously, I get the right results with skipna=False...

In [10]: ds.var167.mean(skipna=False)
Out[10]:
<xarray.DataArray 'var167' ()>
array(278.6246643066406, dtype=float32)

... or by specifying coordinates to average over:

In [5]: ds.var167.mean(('time', 'lat', 'lon'))
Out[5]:
<xarray.DataArray 'var167' ()>
array(278.6246643066406, dtype=float32)

fmaussion · 2017-03-31T16:02:53Z

Does it make a difference if you load the data first? (ds.var167.load().mean()) Or use python 3?

andrew-c-ross · 2017-03-31T16:07:56Z

I think this might be a problem with bottleneck? My interpretation of _create_nan_agg_method in xarray/core/ops.py is that it may use bottleneck to get the mean unless you pass skipna=False or specify multiple axes. And,

In [2]: import bottleneck
In [3]: bottleneck.__version__
Out[3]: '1.2.0'

In [6]: bottleneck.nanmean(ds.var167.data)
Out[6]: 261.6441345214844

Forgive me if I'm wrong, I'm still a bit new.

shoyer · 2017-03-31T16:24:04Z

Yes, this is probably related to the fact that .mean() in xarray uses bottleneck if available, and bottleneck has a slightly different mean implementation, quite possibly with a less numerically stable algorithm.

The fact that the dtype is float32 is a sign that this is probably a numerical precision issue. Try casting with .astype(np.float64) and see if the problem goes away.

If you really cared about performance using float32, the other thing to do to improve conditioning is to subtract and add a number close to the mean, e.g., (ds.var167 - 270).mean() + 270.

matteodefelice · 2017-03-31T20:31:56Z

Thanks all guys for the replies.
@Aegaeon I get the same your results with bottleneck...
@shoyer The point is that I haven't decided the use of float32 and — yes — using .astype(np.float64) solves the issue...the point is that is not an expected behaviour, with such standard dataset I would not expect any problem related to numerical precision...

shoyer · 2017-03-31T22:55:18Z

@matteodefelice you didn't decide on float32, but your data is stored that way. It's really hard to make choices about numerical precision for computations automatically: if we converted automatically to float64, somebody else would be complaining about unexpected memory usage :).

Looking at our options, we could:

Stop using bottleneck on float32 arrays, or provide a flag or option to disable using bottleneck. This is not ideal, because bottleneck is much faster.
Automatically convert float32 arrays to float64 before doing aggregations. This is not ideal, because it could significant increase memory requirements.
Add a dtype option for aggregations (like NumPy) and consider defaulting to dype=np.float64 when doing aggregations on float32 arrays. I would generally be happy with this, but bottleneck currently doesn't provide the option currently.
Write a higher precision algorithm for bottleneck's mean.

leifdenby · 2019-01-21T17:33:31Z

Sorry to unearth this issue again, but I just got bitten by this quite badly. I'm looking at absolute temperature perturbations and bottleneck's implementation together with my data being loaded as float32 (correctly, as it's stored like that) causes an error on the size of the perturbations I'm looking for.

Example:

In [1]: import numpy as np
   ...: import bottleneck

In [2]: a = 300*np.ones((800**2,), dtype=np.float32)

In [3]: np.mean(a)
Out[3]: 300.0

In [4]: bottleneck.nanmean(a)
Out[4]: 302.6018981933594

Would it be worth adding a warning (until the right solution is found) if someone is doing .mean() on a DataArray which is float32?

Based a little experimentation (https://gist.github.com/leifdenby/8e874d3440a1ac96f96465a418f158ab) bottleneck's mean function builds up significant errors even with moderately sized arrays if they are float32, so I'm going to stop using .mean() as-is from now and always pass in dtype=np.float64.

shoyer · 2019-01-21T19:09:43Z

Would it be worth adding a warning (until the right solution is found) if someone is doing .mean() on a DataArray which is float32?

I would rather pick option (1) above, that is, "Stop using bottleneck on float32 arrays"

aquasync · 2019-01-29T06:52:01Z

Is it worth changing bottleneck to use double for single precision reductions? AFAICT this is a matter of changing npy_DTYPE0 to double in the float{64,32} versions of functions in reduce_template.c.

lumbric · 2019-02-13T19:02:52Z

I think (!) xarray is not effected any longer, but pandas is. Bisecting the GIT history leads to commit 0b9ab2d, which means that xarray >= v0.10.9 should not be affected. Uninstalling bottleneck is also a valid workaround.

~~Bottleneck's documentation explicitly mentions that no error is raised in case of an overflow. But it seams to be very evil behavior, so it might be worth reporting upstream.~~ What do you think? (I think pydata/bottleneck#164 is something different, isn't it?)
Edit: this is not an overflow. It's a numerical error by not applying pairwise summation.

A couple of minimal examples:

>>> import numpy as np
>>> import pandas as pd
>>> import xarray as xr
>>> import bottleneck as bn
>>> bn.nanmean(np.ones(2**25, dtype=np.float32))                                                                 
0.5
>>> pd.Series(np.ones(2**25, dtype=np.float32)).mean()                                                           
0.5
>>> xr.DataArray(np.ones(2**25, dtype=np.float32)).mean()  # not affected for this version
<xarray.DataArray ()>
array(1., dtype=float32)

Done with the following versions:

$ pip3 freeze 
Bottleneck==1.2.1
numpy==1.16.1
pandas==0.24.1
xarray==0.11.3
...

aquasync · 2019-02-15T11:06:06Z

Ah ok, I suppose bottleneck is indeed now avoided for float32 xarray. Yeah that issue is for a different function, but the source of the problem and proposed solution in the thread is the same - use higher precision intermediates for float32 (double arithmetic); a small speed vs accuracy/precision trade off.

lumbric · 2019-02-15T11:41:36Z

Oh hm, I think I didn't really understand what happens in bottleneck.nanmean(). I understand that integers can overflow and that float32 have varying absolute precision. The max float32 3.4E+38 is not hit here. So how can the mean of a list of ones be 0.5?

Isn't this what bottleneck is doing? Summing up a bunch of float32 values and then dividing by the length?

>>> d = np.ones(2**25, dtype=np.float32)
>>> d.sum()/np.float32(len(d))
1.0

shoyer · 2019-02-15T16:39:08Z

The difference is that Bottleneck does the sum in the naive way, whereas NumPy uses the more numerically stable pairwise summation.

lumbric · 2019-02-16T11:20:20Z

Oh yes, of course! I've underestimated the low precision of float32 values above 2**24. Thanks for the hint.

andersy005 · 2022-05-06T16:01:44Z

conditionally disable bottleneck #5560 introduced "use_bottleneck" option to disable/enable using bottleneck. can we close this issue or keep it open?

dcherian · 2022-05-06T16:21:42Z

Yes that sounds right. Thanks!

dcherian · 2022-05-06T16:22:32Z

On second thought we should add this to a FAQ page.

matteodefelice mentioned this issue Mar 31, 2017

Unexpected behavior for bn.move_std with float32 array pydata/bottleneck#164

Open

shoyer changed the title ~~Wrong mean on a NetCDF~~ Wrong mean for float32 array Apr 3, 2017

jcrist mentioned this issue Apr 10, 2017

Issues (wrong result) when computing the mean on a NetCDF file dask/dask#2095

Closed

spencerkclark mentioned this issue Oct 15, 2017

Inaccuracy of some operations (e.g. stacked average) when underlying data is float32 spencerahill/aospy#217

Closed

lumbric mentioned this issue Feb 13, 2019

Wrong result for float32 Series when using bottleneck pandas-dev/pandas#25307

Closed

dcherian added the bug label Apr 16, 2019

mathause mentioned this issue Oct 20, 2021

Rolling() gives values different from pd.rolling() #5877

Open

dcherian closed this as completed May 6, 2022

dcherian changed the title ~~Wrong mean for float32 array~~ bottleneck : Wrong mean for float32 array May 6, 2022

dcherian reopened this May 6, 2022

dcherian removed the bug label May 6, 2022

dcherian added the topic-documentation label May 6, 2022

andersy005 added the contrib-good-first-issue label May 8, 2022

dcherian mentioned this issue Dec 1, 2022

Disable bottleneck by default? #7344

Open

mathause mentioned this issue May 15, 2023

Multiple nc files (3000+) to be masked by a single shape file regionmask/regionmask#411

Closed

riley-brady mentioned this issue Jul 12, 2024

Rewrite TestRPmatrix::test_local_exceedance_imp_pass to not rely on loading data CLIMADA-project/climada_python#915

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bottleneck : Wrong mean for float32 array #1346

bottleneck : Wrong mean for float32 array #1346

matteodefelice commented Mar 31, 2017

fmaussion commented Mar 31, 2017

matteodefelice commented Mar 31, 2017

andrew-c-ross commented Mar 31, 2017 •

edited

Loading

fmaussion commented Mar 31, 2017

andrew-c-ross commented Mar 31, 2017

shoyer commented Mar 31, 2017

matteodefelice commented Mar 31, 2017

shoyer commented Mar 31, 2017

leifdenby commented Jan 21, 2019

shoyer commented Jan 21, 2019

aquasync commented Jan 29, 2019

lumbric commented Feb 13, 2019 •

edited

Loading

aquasync commented Feb 15, 2019

lumbric commented Feb 15, 2019

shoyer commented Feb 15, 2019

lumbric commented Feb 16, 2019

andersy005 commented May 6, 2022

dcherian commented May 6, 2022

dcherian commented May 6, 2022

bottleneck : Wrong mean for float32 array #1346

bottleneck : Wrong mean for float32 array #1346

Comments

matteodefelice commented Mar 31, 2017

fmaussion commented Mar 31, 2017

matteodefelice commented Mar 31, 2017

andrew-c-ross commented Mar 31, 2017 • edited Loading

fmaussion commented Mar 31, 2017

andrew-c-ross commented Mar 31, 2017

shoyer commented Mar 31, 2017

matteodefelice commented Mar 31, 2017

shoyer commented Mar 31, 2017

leifdenby commented Jan 21, 2019

shoyer commented Jan 21, 2019

aquasync commented Jan 29, 2019

lumbric commented Feb 13, 2019 • edited Loading

aquasync commented Feb 15, 2019

lumbric commented Feb 15, 2019

shoyer commented Feb 15, 2019

lumbric commented Feb 16, 2019

andersy005 commented May 6, 2022

dcherian commented May 6, 2022

dcherian commented May 6, 2022

andrew-c-ross commented Mar 31, 2017 •

edited

Loading

lumbric commented Feb 13, 2019 •

edited

Loading