Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccuracy of some operations (e.g. stacked average) when underlying data is float32 #217

Closed
chuaxr opened this issue Oct 14, 2017 · 10 comments

Comments

@chuaxr
Copy link

chuaxr commented Oct 14, 2017

If no data is missing in the underlying dataset, then the stacked average (#208) and reg.av should yield the same answer. (In my case, doing a reg.av automatically results in a time average, since all data is output within the same year).

However, this is not the case.

Using the full data file (1000 time steps at 20s, ~5h on a 96x96 horizontal grid), I obtain the following:
regav_sav

With only the first hour of data, the difference isn't as large:
regav_sav_1h

Adding a typecast statement in stack_mean (see code below) appeared to fix that (see plot below), presumably because the model output is in float32 and reg.av does a typecast somewhere while the stacked average did not. Since any errors here contaminate all my analyses, I would very much appreciate knowing if there are other fixes I should make. For example, I may need to average over days at a very high frequency (e.g. 20s). In that case, would numpy float precision be enough, or would I have to average it within the model to make the size manageable?

def stack_mean(self,var):
    """Takes mean of variable over stacked dimensions.
    Meant for masked data. 
                                                                                                             
    Parameters
    ---------
    var: DataArray to take mean of
    var_like: DataArray with appropriate time dimension
                                                                                                             
    """
    var=var.astype(float)
    var=var.stack(sdim=[internal_names.TIME_STR,internal_names.LAT_STR,internal_names.LON_STR]).mean('sdim')
    return(var)

Same as the first plot, except with modified stack_mean
regav_sav_5h_float

@spencerkclark
Copy link
Collaborator

@chuaxr we do not do any explicit typecasting in aospy, so whatever is happening must be going on implicitly somewhere. Am I interpreting things properly that you would like to use the float32 datatype throughout your calculations (and currently values are being cast up to float64 somewhere in computing regional means)?

@chuaxr
Copy link
Author

chuaxr commented Oct 15, 2017

@spencerkclark To follow up on our discussion yesterday (see /home/xrc/WRF/aospy_code/float_issue.ipynb for the actual calculations):

  1. The issue is reproducible outside aospy.

  2. It's mentioned in the documentation for numpy.mean, but using randomly generated data did not give differences of that order of magnitude.

  3. Taking the mean of the values of the dataarray after stacking gives different results from using .mean, so perhaps it has something to do with xarray. Wrapping randomly generated data in a dataarray did not yield such differences either.

Either way, it seems that typecasting all relevant variables to float would circumvent these problems. However, typecasting the entire dataset is inefficient given that all variables in the dataset are typecasted when only one might be used. Perhaps variables could be cast to float in _sel_var in data_loader.py?

Re your comment: my main aim is to avoid my results being contaminated by floating point issues, not that I would like to use float32 throughout. Separately, the output of reg.av is float64, although it is not clear when exactly the typecasting occurs.

@spencerkclark
Copy link
Collaborator

@chuaxr thanks, this seems much clearer now. The note in the numpy documentation is particularly helpful. It does seem that one way to address this would be in DataLoader.load_variable; another possibility would be to specify to use an np.float64 accumulator whenever we do reduction operations. For example from the numpy docs:

In [1]: import numpy as np

In [2]: import xarray as xr

In [3]: a = np.zeros((2, 512*512), dtype=np.float32)

In [4]: a[0, :] = 1.0; a[1, :] = 0.1

In [5]: np.mean(a)
Out[5]: 0.54999924

In [6]: np.mean(a, dtype=np.float64)
Out[6]: 0.55000000074505806

In my reading of the xarray documentation for mean, it seems like one should be able to pass a dtype keyword argument to mean and things should work like they do above:

**kwargs : dict

Additional keyword arguments passed on to the appropriate array function for calculating mean on this object’s data.

However that does not seem to be the case:

In [7]: da = xr.DataArray(a, dims=['x', 'y'])

In [8]: da.mean(['x', 'y'])
Out[8]:
<xarray.DataArray ()>
array(0.5499992370605469, dtype=float32)

In [9]: da.mean(['x', 'y'], dtype=np.float64)
Out[9]:
<xarray.DataArray ()>
array(0.5499992370605469, dtype=float32)

So perhaps it seems that the only immediate fix would indeed be casting all variables to float64 upon loading them into memory.

In [10]: da.astype(np.float64).mean(['x', 'y'])
Out[10]:
<xarray.DataArray ()>
array(0.5500000007450581)

It's remarkable it makes such a difference in your case (even in this numpy example the difference in means is really small), but your notebook makes it pretty apparent.

@spencerahill do you have any thoughts here?

@chuaxr
Copy link
Author

chuaxr commented Oct 15, 2017

@spencerkclark My issue can be reproduced as follows:

import numpy as np
import xarray as xr
def f(size):

    y=np.random.random((size,size,size,size))
    y32=y.astype('float32')
    day=xr.DataArray(y32)
    stack_mean=day.stack(sdim=['dim_0','dim_1','dim_2','dim_3']).mean('sdim')
    stack_value_mean=day.stack(sdim=['dim_0','dim_1','dim_2','dim_3']).values.mean()
    return (stack_mean,stack_value_mean)

f(10) #a small data set
(<xarray.DataArray ()>
 array(0.5006948113441467), 0.50069541)

f(100) #a large data set of the order of magnitude I am using
(<xarray.DataArray ()>
 array(0.16777215898036957), 0.49996477)

@spencerkclark
Copy link
Collaborator

spencerkclark commented Oct 15, 2017

@chuaxr I just remembered an issue in xarray directly related to this: pydata/xarray#1346

The issue is that xarray uses another library (called bottleneck) which provides a set of optimized functions for computing things like means (and numpy does not use this library), and it has issues with the float32 datatype.

So long story short, I think for our purposes casting up to float64 is the way to go.

@spencerahill
Copy link
Owner

spencerahill commented Oct 16, 2017

@chuaxr @spencerkclark thanks for taking what looks like a pretty deep dive to get this sorted out.

Admittedly I have only skimmed the above discussion, but it seems that casting to float64 does indeed seem to be the best solution for this particular case, especially in light of the xarray/bottleneck problem that @spencerkclark nicely remembered.

The question in my mind then is whether or not we want to do this automatically for all calculations, i.e. not rely on users including it themselves within their Var functions. If we were to go that route, I would say we should do so only for float32 (i.e. not integers), include a log message of the automatic casting, include a kwarg flag that could be used to turn it off if the user desires. I'm leaning toward 'yes', but as mentioned I frankly haven't thought super carefully about this yet.

Either way, we should add a Warning in our docs about the potential for this problem with float32 data.

@spencerahill spencerahill changed the title Typecasting in stacked average Inaccuracy of some operations (e.g. stacked average) when underlying data is float32 Oct 16, 2017
@spencerkclark
Copy link
Collaborator

If we were to go that route, I would say we should do so only for float32 (i.e. not integers), include a log message of the automatic casting, include a kwarg flag that could be used to turn it off if the user desires.

Thanks for weighing in @spencerahill; I think this is the way to go. Considering that this impacts reduction operations, which can be used on any variable (including model-native ones), dealing with this within Var functions would be pretty inconvenient for a user impacted by it.

Either way, we should add a Warning in our docs about the potential for this problem with float32 data.

Very much agreed. This is a tricky issue to diagnose!

@spencerkclark
Copy link
Collaborator

By the way @chuaxr, thanks for reporting this. I'm sorry that this puzzling issue impacted some of the figures you showed Yi, but I'm glad we got to the bottom of it (and it's something to keep in mind even outside of using aospy).

@chuaxr
Copy link
Author

chuaxr commented Oct 16, 2017

Thanks @spencerkclark for your help in getting to the bottom of this, it makes sense that my problems are consistent with other users using xarray. I never meant to imply that this was a fault of aospy per se. Rather, I raised the issue here because I figured that a workaround would likely need to be placed within aospy. It's quite likely that this problem would have (or perhaps already have) manifested at some point, aospy or no.

@spencerahill
Copy link
Owner

@chuaxr no worries even if you did imply that! We want to get to the bottom of any issue affecting any of our users, even if technically as in this case they don't come from our code. Because regardless, we need to develop a solution. We wouldn't have caught this one without your help.

So keep the bug reports/feature requests/usage questions/pretty much anything else coming! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants