Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need better user control of _FillValue attribute in NetCDF files #1598

Closed
mmartini-usgs opened this issue Sep 28, 2017 · 9 comments
Closed

Comments

@mmartini-usgs
Copy link

This issue is under discussion here: #1165

It is not desirable for us to have _FillValue = NaN for dimensions and coordinate variables.

In trying to use xarray, _FillValue was carefully kept from these variables and dimensions during the creation of the un-resampled file and then were found to appear during the to_netcdf operation. This happens in spite of mask_and_scale=False is being used with xr.open_dataset

I would hope that downstream code would have trouble with coordinates that don't make logical sense (time or place being NaN, for instance). We would prefer NOT to instantiate coordinate variable data with any fill value. Keeping NaNs out of coordinate variables, dimensions and minima and maxima is part of our QA/QC process to avoid downstream issues.

@shoyer shoyer changed the title Need better user control of _FillValue Need better user control of _FillValue attribute in NetCDF files Sep 28, 2017
@shoyer
Copy link
Member

shoyer commented Sep 28, 2017

cc @thenaomig @laliberte

There are at least two ways to fix this:

  1. Support a flag of some sort in encoding (e.g., _FillValue = False) to indicate that fill value shouldn't be added. This would be easy to add, but is somewhat inelegant.
  2. Check for the presence of NaNs before setting _FillValue = NaN. This would be easy to add for dimension coordinates because they are already guaranteed to be in memory, but could cause performance trouble if any inputs are loaded as dask arrays. I don't know a satisfactory way to handle dask arrays with our current design, since we don't want to add another pass over the data to check for NaNs. I suppose one option would be to refactor our backend classes to write data before writing attributes and then make some sort of dask array operation that checks for NaNs as the data is written. But I'm not even sure this would work with the standard dask task schedulers.

@mmartini-usgs
Copy link
Author

There is also the philosophical problem of fill values for coordinate variables. To be true to reality, one really would want to add an interpolated value that fills whatever gap or bad value exists. That seems to be out of the scope of xarray though.

I'm fine with a flag that controls only the coordinate data. That said, for the rest of the variables, we avoid NaN in _FillValue. We use 1E35. So there you could give the user a choice in default fill value. It seems pythonic to give the user flexibility. And the minute you satisfy us, there will be another use case that comes along with conflicting requirements. So you could use a flag and make it the user's choice, and not xarray's concern.

It also depends on where in the process one cleans up one's data - reduce first, then QA/QC, or QA/QC first, then reduce. We do both, it depends on the instrument.

@shoyer
Copy link
Member

shoyer commented Sep 28, 2017

There is also the philosophical problem of fill values for coordinate variables.

Indeed, this is prohibited by CF conventions -- but xarray (like pandas) takes a more flexible approach here, allowing for missing values for all variables.

You can already specify an explicit choice for _FillValue, e.g., ds.to_netcdf(..., encoding={'my_variable': {'_FillValue': 1e35}}). Allowing {'_FillValue': False} to indicate that _FillValue should not be included would be a simple, easy fix, so we should probably do that regardless.

(There is no need worry about False conflicting with a legitimate fill value since netCDF does not have a boolean dtype.)

@jhamman
Copy link
Member

jhamman commented Sep 28, 2017

I actually think we should use None as the _FillValue sentinel value. We do (sort of) support boolean arrays (#849).

@shoyer
Copy link
Member

shoyer commented Sep 28, 2017 via email

@dnowacki-usgs
Copy link
Contributor

Allowing {'_FillValue': False} to indicate that _FillValue should not be included would be a simple, easy fix, so we should probably do that regardless.

Correct me if you're talking about something different, but xarray already supports setting _FillValue to False to turn off filling. (Is there any use case where filling remains on but without a valid _FillValue?) For example, I have a netCDF processing routine using xarray. In the code I have the line for the lon dimension:

ds.lon.encoding['_FillValue'] = False

Which, for the relevant dimension, yields in ncinfo:

$ ncinfo -v lon yesfalse.nc 
<type 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
    units: degrees_east
    long_name: Longitude
    epic_code: 502
unlimited dimensions: 
current shape = (1,)
filling off

If I comment out that line in my processing routine, I get the following:

$ ncinfo -v lon nofalse.nc 
<type 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
    _FillValue: nan
    units: degrees_east
    long_name: Longitude
    epic_code: 502
unlimited dimensions: 
current shape = (1,)
filling on

I agree that changing from False to None does make better semantic sense.

@jhamman
Copy link
Member

jhamman commented Sep 29, 2017

@dnowacki-usgs - you've made a good point. At least for the netCDF4 backend, this seems to work out of the box with None/False. Can someone check that this works for the scipy/h5netcdf backends?

https://github.com/Unidata/netcdf4-python/blob/366debfff8b0bc53999c9e1ce9f4818bf7cf079a/netCDF4/_netCDF4.pyx#L3455-L3457

@dnowacki-usgs
Copy link
Contributor

dnowacki-usgs commented Sep 29, 2017

@jhamman In brief, it's weird.

Engine encoding['_FillValue'] = False Do nothing
netCDF4 Filling off Filling on
scipy Filling off Filling off
h5netcdf Filling on Filling off

So, this is some peculiar behavior. Setting _FillValue to False works for netCDF4 (as we have seen), has no effect using the scipy engine, and seems to do the opposite of intended for h5netcdf.

Code below:

import xarray as xr
import numpy as np
import pandas as pd

ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))},
                 coords={'x': [10, 20, 30, 40],
                         'y': pd.date_range('2000-01-01', periods=5),
                        'z': ('x', list('abcd'))})

ds.to_netcdf('notset_scipy.nc', engine='scipy')
ds.to_netcdf('notset_netcdf4.nc', engine='netcdf4')
ds.to_netcdf('notset_h5netcdf.nc', engine='h5netcdf')

ds.y.encoding['_FillValue'] = False

ds.to_netcdf('False_scipy.nc', engine='scipy')
ds.to_netcdf('False_netcdf4.nc', engine='netcdf4')
ds.to_netcdf('False_h5netcdf.nc', engine='h5netcdf')

netCDF4

$ ncinfo -v y notset_netcdf4.nc 
<type 'netCDF4._netCDF4.Variable'>
int64 y(y)
    units: days since 2000-01-01 00:00:00
    calendar: proleptic_gregorian
unlimited dimensions: 
current shape = (5,)
filling on, default _FillValue of -9223372036854775806 used
$ ncinfo -v y False_netcdf4.nc 
<type 'netCDF4._netCDF4.Variable'>
int64 y(y)
    units: days since 2000-01-01 00:00:00
    calendar: proleptic_gregorian
unlimited dimensions: 
current shape = (5,)
filling off

scipy

$ ncinfo -v y notset_scipy.nc 
<type 'netCDF4._netCDF4.Variable'>
int32 y(y)
    units: days since 2000-01-01 00:00:00
    calendar: proleptic_gregorian
unlimited dimensions: 
current shape = (5,)
filling off
$ ncinfo -v y False_scipy.nc 
<type 'netCDF4._netCDF4.Variable'>
int32 y(y)
    units: days since 2000-01-01 00:00:00
    calendar: proleptic_gregorian
    _FillValue: 0
unlimited dimensions: 
current shape = (5,)
filling off

h5netcdf

$ ncinfo -v y notset_h5netcdf.nc 
<type 'netCDF4._netCDF4.Variable'>
int64 y(y)
    units: days since 2000-01-01 00:00:00
    calendar: proleptic_gregorian
unlimited dimensions: 
current shape = (5,)
filling off
$ ncinfo -v y False_h5netcdf.nc 
<type 'netCDF4._netCDF4.Variable'>
int64 y(y)
    _FillValue: 0
    units: days since 2000-01-01 00:00:00
    calendar: proleptic_gregorian
unlimited dimensions: 
current shape = (5,)
filling on

@shoyer
Copy link
Member

shoyer commented Sep 29, 2017

It sounds like we should control this in xarray to ensure consistent behavior.

tsjackson-noaa added a commit to tsjackson-noaa/MDTF-diagnostics that referenced this issue Jan 18, 2021
First condition: unset _FillValue attribute for all independent
variables (coordinates and their bounds) as per CF convention but
contrary to xarray default; see
pydata/xarray#1598.

Second condition: 'NaN' not a valid _FillValue in NCL for any
variable; see
https://www.ncl.ucar.edu/Support/talk_archives/2012/1689.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants