Different data values from xarray open_mfdataset when using chunks #3686

abarciauskas-bgse · 2020-01-11T20:15:12Z

MCVE Code Sample

You will first need to download or (mount podaac's drive) from PO.DAAC, including credentials:

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/152/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/153/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/153/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/154/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/154/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/155/20020604090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/155/

curl -u USERNAME:PASSWORD https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/156/20020605090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc -O data/mursst_netcdf/156/

Then run the following code:

from datetime import datetime
import xarray as xr
import glob

def generate_file_list(start_doy, end_doy):   
    """
    Given a start day and end end day, generate a list of file locations.
    Assumes a 'prefix' and 'year' variables have already been defined.
    'Prefix' should be a local directory or http url and path.
    'Year' should be a 4 digit year.
    """
    days_of_year = list(range(start_doy, end_doy))
    fileObjs = []
    for doy in days_of_year:
        if doy < 10:
            doy = f"00{doy}"
        elif doy >= 10 and doy < 100:
            doy = f"0{doy}"            
        file = glob.glob(f"{prefix}/{doy}/*.nc")[0]
        fileObjs.append(file)
    return fileObjs

# Invariants - but could be made configurable
year = 2002
prefix = f"data/mursst_netcdf"
chunks = {'time': 1, 'lat': 1799, 'lon': 3600}
# Create a list of files
start_doy = 152
num_days = 5
end_doy = start_doy + num_days
fileObjs = generate_file_list(start_doy, end_doy)
# will use this timeslice in query later on
time_slice = slice(datetime.strptime(f"{year}-06-02", '%Y-%m-%d'), datetime.strptime(f"{year}-06-04", '%Y-%m-%d'))

print("results from unchunked dataset")
ds_unchunked = xr.open_mfdataset(fileObjs, combine='by_coords')
print(ds_unchunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)
print(ds_unchunked.analysed_sst.sel(time=time_slice).mean().values)

print(f"results from chunked dataset using {chunks}")
ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks=chunks)
print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)
print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values)

print("results from chunked dataset using 'auto'")
ds_chunked = xr.open_mfdataset(fileObjs, combine='by_coords', chunks={'time': 'auto', 'lat': 'auto', 'lon': 'auto'})
print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)
print(ds_chunked.analysed_sst.sel(time=time_slice).mean().values)

Note, these are just a few examples but I tried a variety of other chunk options and got similar discrepancies between the unchunked and chunked datasets.

Output:

results from unchunked dataset
290.13754
286.7869
results from chunked dataset using {'time': 1, 'lat': 1799, 'lon': 3600}
290.13757
286.81107
results from chunked dataset using 'auto'
290.1377
286.8118

Expected Output

Values output from queries of chunked and unchunked xarray dataset are equal.

Problem Description

I want to understand how to chunk or query data to verify data opened using chunks will have the same output as data opened without chunking. Would like to store data ultimately in Zarr but verifying data integrity is critical.

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.8.1 | packaged by conda-forge | (default, Jan 5 2020, 20:58:18) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.154-99.181.amzn1.x86_64 machine: x86_64 processor: byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.3

xarray: 0.14.1
pandas: 0.25.3
numpy: 1.17.3
scipy: None
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.3.2
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.9.1
distributed: 2.9.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 44.0.0.post20200102
pip: 19.3.1
conda: None
pytest: None
IPython: 7.11.1
sphinx: None

The text was updated successfully, but these errors were encountered:

dmedv · 2020-01-12T04:18:43Z

Actually, that's true not just for open_mfdataset, but even for open_dataset with a single file. I've tried it with one of those files from PO.DAAC, and got similar results - slightly different values depending on the chunking strategy.

Just a guess, but I think the problem here is that the calculations are done in floating-point arithmetic (probably float32...), and you get accumulated precision errors depending on the number of chunks.

Internally in the NetCDF file analysed_sst values are stored as int16, with real scale and offset values, so the correct way to calculate the mean would be to do it in original int16, and then apply scale and offset to the result. Automatic scaling is on by default (i.e. it will replace original array values with new scaled values), but you can turn it off in open_dataset with the mask_and_scale=False option: http://xarray.pydata.org/en/stable/generated/xarray.open_dataset.html I tried doing this, and then I got identical results with chunked and unchunked versions. Can pass this option to open_mfdataset as well with **kwargs.

I'm basically just starting to use xarray myself, so please someone correct me if any of the above is wrong.

abarciauskas-bgse · 2020-01-12T18:37:59Z

@dmedv Thanks for this, it all makes sense to me and I see the same results, however I wasn't able to "convert back" using scale_factor and add_offset

from netCDF4 import Dataset

d = Dataset(fileObjs[0])
v = d.variables['analysed_sst']

print("Result with mask_and_scale=True")
ds_unchunked = xr.open_dataset(fileObjs[0])
print(ds_unchunked.analysed_sst.sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)

print("Result with mask_and_scale=False")
ds_unchunked = xr.open_dataset(fileObjs[0], mask_and_scale=False)
scaled = ds_unchunked.analysed_sst * v.scale_factor + v.add_offset
scaled.sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values

^^ That returns a different result than what I expect. I wonder if this is because of the _FillValue missing from trying to convert back.

However this led me to another seemingly related issue: #2304

Loss of precision seems to be the key here, so coercing the float32s to float64s appears to get the same results from both chunked and unchunked versions - but still not

print("results from unchunked dataset")
ds_unchunked = xr.open_mfdataset(fileObjs, combine='by_coords')
ds_unchunked['analysed_sst'] = ds_unchunked['analysed_sst'].astype(np.float64)
print(ds_unchunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)

print(f"results from chunked dataset using {chunks}")
ds_chunked = xr.open_mfdataset(fileObjs, chunks=chunks, combine='by_coords')
ds_chunked['analysed_sst'] = ds_chunked['analysed_sst'].astype(np.float64)
print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)

print("results from chunked dataset using 'auto'")
ds_chunked = xr.open_mfdataset(fileObjs, chunks={'time': 'auto', 'lat': 'auto', 'lon': 'auto'}, combine='by_coords')
ds_chunked['analysed_sst'] = ds_chunked['analysed_sst'].astype(np.float64)
print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)

returns:

results from unchunked dataset
290.1375818862207
results from chunked dataset using {'time': 1, 'lat': 1799, 'lon': 3600}
290.1375818862207
results from chunked dataset using 'auto'
290.1375818862207

dmedv · 2020-01-12T19:59:31Z

@abarciauskas-bgse Yes, indeed, I forgot about _FillValue. That would mess up the mean calculation with mask_and_scale=False. I think it would be nice if it were possible to control the mask application in open_dataset separately from scale/offset.

rabernat · 2020-01-12T20:41:53Z

Thanks for the useful issue @abarciauskas-bgse and valuable test @dmedv.

I believe this is fundamentally a Dask issue. In general, Dask's algorithms do not guarantee numerically identical results for different chunk sizes. Roundoff errors accrue slightly differently based on how the array is split up. These errors are usually acceptable to users. For example, 290.13754 vs 290.13757, the error is in the 8th significant digit, 1 part in 100,00,000. Since there are only 65,536 16-bit integers (the original data type in the netCDF file), this seems more than adequate precision to me.

Calling .mean() on a dask array is not the same as a checksum. As with all numerical calculations, equality should be verified with a precision appropriate to the data type and algorithm, e.g. using assert_allclose.

There appears to be a second issue here related to fill values, but I haven't quite grasped whether we think there is a bug.

I think it would be nice if it were possible to control the mask application in open_dataset separately from scale/offset.

There may be a reason why these operations are coupled. Would have to look more closely at the code to know for sure.

dmedv · 2020-01-12T20:48:20Z

Actually, there is no need to separate them. One can simply do something like this to apply the mask:

ds.analysed_sst.where(ds.analysed_sst != fill_value).mean() * scale_factor + offset

It's not a bug, but if we set mask_and_scale=False, it's left up to us to apply the mask manually.

abarciauskas-bgse · 2020-01-12T21:17:11Z

Thanks @rabernat I would like to use assert_allclose to test the output but at first pass it seems that might be prohibitively slow to test for large datasets, do you recommend sampling or other good testing strategies (e.g. to assert the xarray datasets are equal to some precision)

abarciauskas-bgse · 2020-01-20T20:35:47Z

Closing as using mask_and_scale=False produced precise results

abarciauskas-bgse closed this as completed Jan 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different data values from xarray open_mfdataset when using chunks #3686

Different data values from xarray open_mfdataset when using chunks #3686

abarciauskas-bgse commented Jan 11, 2020 •

edited

Loading

dmedv commented Jan 12, 2020 •

edited

Loading

abarciauskas-bgse commented Jan 12, 2020

dmedv commented Jan 12, 2020 •

edited

Loading

rabernat commented Jan 12, 2020

dmedv commented Jan 12, 2020 •

edited

Loading

abarciauskas-bgse commented Jan 12, 2020

abarciauskas-bgse commented Jan 20, 2020

Different data values from xarray open_mfdataset when using chunks #3686

Different data values from xarray open_mfdataset when using chunks #3686

Comments

abarciauskas-bgse commented Jan 11, 2020 • edited Loading

MCVE Code Sample

Expected Output

Problem Description

Output of xr.show_versions()

dmedv commented Jan 12, 2020 • edited Loading

abarciauskas-bgse commented Jan 12, 2020

dmedv commented Jan 12, 2020 • edited Loading

rabernat commented Jan 12, 2020

dmedv commented Jan 12, 2020 • edited Loading

abarciauskas-bgse commented Jan 12, 2020

abarciauskas-bgse commented Jan 20, 2020

abarciauskas-bgse commented Jan 11, 2020 •

edited

Loading

Output of `xr.show_versions()`

dmedv commented Jan 12, 2020 •

edited

Loading

dmedv commented Jan 12, 2020 •

edited

Loading

dmedv commented Jan 12, 2020 •

edited

Loading