Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different data values from xarray open_mfdataset when using chunks #3686

Closed
abarciauskas-bgse opened this issue Jan 11, 2020 · 7 comments
Closed

Comments

@abarciauskas-bgse
Copy link

abarciauskas-bgse commented Jan 11, 2020

MCVE Code Sample

You will first need to download or (mount podaac's drive) from PO.DAAC, including credentials:

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/152/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/153/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/153/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/154/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/154/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/155/20020604090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/155/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/156/20020605090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/156/

Then run the following code:

from datetime import datetime
import xarray as xr
import glob

def generate_file_list(start_doy, end_doy):   
    """
    Given a start day and end end day, generate a list of file locations.
    Assumes a 'prefix' and 'year' variables have already been defined.
    'Prefix' should be a local directory or http url and path.
    'Year' should be a 4 digit year.
    """
    days_of_year = list(range(start_doy, end_doy))
    fileObjs = []
    for doy in days_of_year:
        if doy < 10:
            doy = f"00{doy}"
        elif doy >= 10 and doy < 100:
            doy = f"0{doy}"            
        file = glob.glob(f"{prefix}/{doy}/*.nc")[0]
        fileObjs.append(file)
    return fileObjs

# Invariants - but could be made configurable
year = 2002
prefix = f"data/mursst_netcdf"
chunks = {'time': 1, 'lat': 1799, 'lon': 3600}
# Create a list of files
start_doy = 152
num_days = 5
end_doy = start_doy + num_days
fileObjs = generate_file_list(start_doy, end_doy)
# will use this timeslice in query later on
time_slice = slice(datetime.strptime(f"{year}-06-02", '%Y-%m-%d'), datetime.strptime(f"{year}-06-04", '%Y-%m-%d'))

print("results from unchunked dataset")
ds_unchunked = xr.open_mfdataset(fileObjs, combine='by_coords')
print(ds_unchunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)
print(ds_unchunked.analysed_sst.sel(time=time_slice).mean().values)

print(f"results from chunked dataset using {chunks}")
ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks=chunks)
print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)
print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values)

print("results from chunked dataset using 'auto'")
ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks={'time': 'auto', 'lat': 'auto', 'lon': 'auto'})
print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)
print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values)

Note, these are just a few examples but I tried a variety of other chunk options and got similar discrepancies between the unchunked and chunked datasets.

Output:

results from unchunked dataset
290.13754
286.7869
results from chunked dataset using {'time': 1, 'lat': 1799, 'lon': 3600}
290.13757
286.81107
results from chunked dataset using 'auto'
290.1377
286.8118

Expected Output

Values output from queries of chunked and unchunked xarray dataset are equal.

Problem Description

I want to understand how to chunk or query data to verify data opened using chunks will have the same output as data opened without chunking. Would like to store data ultimately in Zarr but verifying data integrity is critical.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.8.1 | packaged by conda-forge | (default, Jan 5 2020, 20:58:18) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.154-99.181.amzn1.x86_64 machine: x86_64 processor: byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.3

xarray: 0.14.1
pandas: 0.25.3
numpy: 1.17.3
scipy: None
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.3.2
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.9.1
distributed: 2.9.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 44.0.0.post20200102
pip: 19.3.1
conda: None
pytest: None
IPython: 7.11.1
sphinx: None

@dmedv
Copy link

dmedv commented Jan 12, 2020

Actually, that's true not just for open_mfdataset, but even for open_dataset with a single file. I've tried it with one of those files from PO.DAAC, and got similar results - slightly different values depending on the chunking strategy.

Just a guess, but I think the problem here is that the calculations are done in floating-point arithmetic (probably float32...), and you get accumulated precision errors depending on the number of chunks.

Internally in the NetCDF file analysed_sst values are stored as int16, with real scale and offset values, so the correct way to calculate the mean would be to do it in original int16, and then apply scale and offset to the result. Automatic scaling is on by default (i.e. it will replace original array values with new scaled values), but you can turn it off in open_dataset with the mask_and_scale=False option: http://xarray.pydata.org/en/stable/generated/xarray.open_dataset.html I tried doing this, and then I got identical results with chunked and unchunked versions. Can pass this option to open_mfdataset as well with **kwargs.

I'm basically just starting to use xarray myself, so please someone correct me if any of the above is wrong.

@abarciauskas-bgse
Copy link
Author

@dmedv Thanks for this, it all makes sense to me and I see the same results, however I wasn't able to "convert back" using scale_factor and add_offset

from netCDF4 import Dataset

d = Dataset(fileObjs[0])
v = d.variables['analysed_sst']

print("Result with mask_and_scale=True")
ds_unchunked = xr.open_dataset(fileObjs[0])
print(ds_unchunked.analysed_sst.sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)

print("Result with mask_and_scale=False")
ds_unchunked = xr.open_dataset(fileObjs[0], mask_and_scale=False)
scaled = ds_unchunked.analysed_sst * v.scale_factor + v.add_offset
scaled.sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values

^^ That returns a different result than what I expect. I wonder if this is because of the _FillValue missing from trying to convert back.

However this led me to another seemingly related issue: #2304

Loss of precision seems to be the key here, so coercing the float32s to float64s appears to get the same results from both chunked and unchunked versions - but still not

print("results from unchunked dataset")
ds_unchunked = xr.open_mfdataset(fileObjs, combine='by_coords')
ds_unchunked['analysed_sst'] = ds_unchunked['analysed_sst'].astype(np.float64)
print(ds_unchunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)

print(f"results from chunked dataset using {chunks}")
ds_chunked = xr.open_mfdataset(fileObjs, chunks=chunks, combine='by_coords')
ds_chunked['analysed_sst'] = ds_chunked['analysed_sst'].astype(np.float64)
print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)

print("results from chunked dataset using 'auto'")
ds_chunked = xr.open_mfdataset(fileObjs, chunks={'time': 'auto', 'lat': 'auto', 'lon': 'auto'}, combine='by_coords')
ds_chunked['analysed_sst'] = ds_chunked['analysed_sst'].astype(np.float64)
print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)

returns:

results from unchunked dataset
290.1375818862207
results from chunked dataset using {'time': 1, 'lat': 1799, 'lon': 3600}
290.1375818862207
results from chunked dataset using 'auto'
290.1375818862207

@dmedv
Copy link

dmedv commented Jan 12, 2020

@abarciauskas-bgse Yes, indeed, I forgot about _FillValue. That would mess up the mean calculation with mask_and_scale=False. I think it would be nice if it were possible to control the mask application in open_dataset separately from scale/offset.

@rabernat
Copy link
Contributor

Thanks for the useful issue @abarciauskas-bgse and valuable test @dmedv.

I believe this is fundamentally a Dask issue. In general, Dask's algorithms do not guarantee numerically identical results for different chunk sizes. Roundoff errors accrue slightly differently based on how the array is split up. These errors are usually acceptable to users. For example, 290.13754 vs 290.13757, the error is in the 8th significant digit, 1 part in 100,00,000. Since there are only 65,536 16-bit integers (the original data type in the netCDF file), this seems more than adequate precision to me.

Calling .mean() on a dask array is not the same as a checksum. As with all numerical calculations, equality should be verified with a precision appropriate to the data type and algorithm, e.g. using assert_allclose.

There appears to be a second issue here related to fill values, but I haven't quite grasped whether we think there is a bug.

I think it would be nice if it were possible to control the mask application in open_dataset separately from scale/offset.

There may be a reason why these operations are coupled. Would have to look more closely at the code to know for sure.

@dmedv
Copy link

dmedv commented Jan 12, 2020

Actually, there is no need to separate them. One can simply do something like this to apply the mask:

ds.analysed_sst.where(ds.analysed_sst != fill_value).mean() * scale_factor + offset

It's not a bug, but if we set mask_and_scale=False, it's left up to us to apply the mask manually.

@abarciauskas-bgse
Copy link
Author

Thanks @rabernat I would like to use assert_allclose to test the output but at first pass it seems that might be prohibitively slow to test for large datasets, do you recommend sampling or other good testing strategies (e.g. to assert the xarray datasets are equal to some precision)

@abarciauskas-bgse
Copy link
Author

Closing as using mask_and_scale=False produced precise results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants