-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long term solution for limited permissible dates due to np.datetime64 nanosecond resolution #98
Comments
It seems that this will likely not change anytime soon in pandas. In the meantime, their suggested workaround in the docs is to use Xarray's WorkaroundAs we know, xarray doesn't officially use the pandas suggested workaround; instead it uses In [1]: import xarray as xr
In [2]: da = xr.DataArray([15, 45, 74], coords=[[15, 45, 74]], dims=['time'], name='a')
In [3]: da['time'].attrs['units'] = 'days since 0001-01-01'
In [4]: da = da.to_dataset()
In [5]: da = xr.decode_cf(da)
//anaconda/lib/python2.7/site-packages/xarray/conventions.py:377: RuntimeWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy netCDF4.datetime objects instead, reason: dates out of range
result = decode_cf_datetime(example_value, units, calendar)
In [6]: da.groupby('time.month').mean('time')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-b8680c841459> in <module>()
----> 1 da.groupby('time.month').mean('time')
//anaconda/lib/python2.7/site-packages/xarray/core/common.pyc in groupby(self, group, squeeze)
342 """
343 if isinstance(group, basestring):
--> 344 group = self[group]
345 return self.groupby_cls(self, group, squeeze=squeeze)
346
//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in __getitem__(self, key)
518
519 if hashable(key):
--> 520 return self._construct_dataarray(key)
521 else:
522 return self._copy_listed(np.asarray(key))
//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in _construct_dataarray(self, name)
467 variable = self._variables[name]
468 except KeyError:
--> 469 _, name, variable = _get_virtual_variable(self._variables, name)
470
471 coords = OrderedDict()
//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in _get_virtual_variable(variables, key)
60 data = seasons[(month // 3) % 4]
61 else:
---> 62 data = getattr(date, var_name)
63 return ref_name, var_name, Variable(ref_var.dims, data)
64
AttributeError: 'Index' object has no attribute 'month' This is critical in our use case, because we would like to be able to compute monthly and seasonal averages. That said, subsetting data using time slices works great under the In [7]: from netcdftime._datetime import datetime
In [8]: da.sel(time=slice(datetime(1, 1, 1), datetime(1, 2, 1)))
Out[8]:
<xarray.Dataset>
Dimensions: (time: 1)
Coordinates:
* time (time) object 1-01-16 00:00:00
Data variables:
a (time) int64 15 Serialization fails when a Dataset is indexed by In [9]: da.to_netcdf('test_netcdftime.nc')
//anaconda/lib/python2.7/site-packages/xarray/conventions.py:396: RuntimeWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy netCDF4.datetime objects instead, reason: dates out of range
calendar=self.calendar)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-9-efe2b453be6e> in <module>()
----> 1 da.to_netcdf('test_netcdftime.nc')
//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in to_netcdf(self, path, mode, format, group, engine, encoding)
780 from ..backends.api import to_netcdf
781 return to_netcdf(self, path, mode, format=format, group=group,
--> 782 engine=engine, encoding=encoding)
783
784 dump = utils.function_alias(to_netcdf, 'dump')
//anaconda/lib/python2.7/site-packages/xarray/backends/api.pyc in to_netcdf(dataset, path, mode, format, group, engine, writer, encoding)
352 store = store_cls(path, mode, format, group, writer)
353 try:
--> 354 dataset.dump_to_store(store, sync=sync, encoding=encoding)
355 if isinstance(path, BytesIO):
356 return path.getvalue()
//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in dump_to_store(self, store, encoder, sync, encoding)
726 variables, attrs = encoder(variables, attrs)
727
--> 728 store.store(variables, attrs, check_encoding)
729 if sync:
730 store.sync()
//anaconda/lib/python2.7/site-packages/xarray/backends/common.pyc in store(self, variables, attributes, check_encoding_set)
230 # All NetCDF files get CF encoded by default, without this attempting
231 # to write times, for example, would fail.
--> 232 cf_variables, cf_attrs = cf_encoder(variables, attributes)
233 AbstractWritableDataStore.store(self, cf_variables, cf_attrs,
234 check_encoding_set)
//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in cf_encoder(variables, attributes)
1058 """
1059 new_vars = OrderedDict((k, encode_cf_variable(v, name=k))
-> 1060 for k, v in iteritems(variables))
1061 return new_vars, attributes
//anaconda/python.app/Contents/lib/python2.7/collections.pyc in __init__(*args, **kwds)
67 root[:] = [root, root, None]
68 self.__map = {}
---> 69 self.__update(*args, **kwds)
70
71 def __setitem__(self, key, value, dict_setitem=dict.__setitem__):
//anaconda/python.app/Contents/lib/python2.7/_abcoll.pyc in update(*args, **kwds)
569 self[key] = other[key]
570 else:
--> 571 for key, value in other:
572 self[key] = value
573 for key, value in kwds.items():
//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in <genexpr>((k, v))
1058 """
1059 new_vars = OrderedDict((k, encode_cf_variable(v, name=k))
-> 1060 for k, v in iteritems(variables))
1061 return new_vars, attributes
//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in encode_cf_variable(var, needs_copy, name)
712 var, needs_copy = maybe_encode_offset_and_scale(var, needs_copy)
713 var, needs_copy = maybe_encode_fill_value(var, needs_copy)
--> 714 var = maybe_encode_dtype(var, name)
715 var = maybe_encode_bools(var)
716 var = ensure_dtype_not_object(var)
//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in maybe_encode_dtype(var, name)
621 'any _FillValue to use for NaNs' % name,
622 RuntimeWarning, stacklevel=3)
--> 623 data = ops.around(data)[...]
624 if dtype == 'S1' and data.dtype != 'S1':
625 data = string_to_char(np.asarray(data, 'S'))
//anaconda/lib/python2.7/site-packages/xarray/core/ops.pyc in f(*args, **kwargs)
62 else:
63 module = eager_module
---> 64 return getattr(module, name)(*args, **kwargs)
65 else:
66 def f(data, *args, **kwargs):
//anaconda/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc in around(a, decimals, out)
2766 except AttributeError:
2767 return _wrapit(a, 'round', decimals, out)
-> 2768 return round(decimals, out)
2769
2770
AttributeError: 'netcdftime._datetime.datetime' object has no attribute 'rint' Using a PeriodIndex in xarrayThis raises the question: what happens if we use the pandas recommend workaround in xarray? Does the groupby issue persist? Interestingly, the answer is no -- we can indeed do time-related groupby operations on dates outside the Timestamp-valid range using a PeriodIndex in xarray: In [1]: import xarray as xr
In [2]: import pandas as pd
In [3]: da = xr.DataArray([1, 3, 2, 5], coords=[[15, 16, 45, 74]], dims=['time'], name='a')
In [4]: da['time'].attrs['units'] = 'days since 0001-01-01'
In [5]: da = da.to_dataset()
In [6]: da = xr.decode_cf(da)
//anaconda/lib/python2.7/site-packages/xarray/conventions.py:377: RuntimeWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy netCDF4.datetime objects instead, reason: dates out of range
result = decode_cf_datetime(example_value, units, calendar)
In [7]: periods = []
In [8]: for date in da['time']:
...: raw_date = date.values.item()
...: periods.append(pd.Period(year=raw_date.year, month=raw_date.month, day=raw_date.day, freq='D'))
...:
In [9]: da['time'] = pd.PeriodIndex(periods)
In [10]: da.groupby('time.month').mean('time')
Out[10]:
<xarray.Dataset>
Dimensions: (month: 3)
Coordinates:
* month (month) int64 1 2 3
Data variables:
a (month) float64 2.0 2.0 5.0
Critically though, subsetting based on time slices no longer works. In [11]: da.sel(time=slice(pd.Period(year=1, month=1, day=1, freq='D'), pd.Period(year=1, month=2, d
...: ay=1, freq='D')))
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-11-123d7c0d656c> in <module>()
----> 1 da.sel(time=slice(pd.Period(year=1, month=1, day=1, freq='D'), pd.Period(year=1, month=2, day=1, freq='D')))
//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in sel(self, method, tolerance, **indexers)
967 """
968 pos_indexers, new_indexes = indexing.remap_label_indexers(
--> 969 self, indexers, method=method, tolerance=tolerance
970 )
971 return self.isel(**pos_indexers)._replace_indexes(new_indexes)
//anaconda/lib/python2.7/site-packages/xarray/core/indexing.pyc in remap_label_indexers(data_obj, indexers, method, tolerance)
227 index = data_obj[dim].to_index()
228 idxr, new_idx = convert_label_indexer(index, label,
--> 229 dim, method, tolerance)
230 pos_indexers[dim] = idxr
231 if new_idx is not None:
//anaconda/lib/python2.7/site-packages/xarray/core/indexing.pyc in convert_label_indexer(index, label, index_name, method, tolerance)
169 indexer = index.slice_indexer(_try_get_item(label.start),
170 _try_get_item(label.stop),
--> 171 _try_get_item(label.step))
172 if not isinstance(indexer, slice):
173 # unlike pandas, in xarray we never want to silently convert a slice
//anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in slice_indexer(self, start, end, step, kind)
2783 """
2784 start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 2785 kind=kind)
2786
2787 # return a slice
//anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in slice_locs(self, start, end, step, kind)
2962 start_slice = None
2963 if start is not None:
-> 2964 start_slice = self.get_slice_bound(start, 'left', kind)
2965 if start_slice is None:
2966 start_slice = 0
//anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_slice_bound(self, label, side, kind)
2901 # For datetime indices label may be a string that has to be converted
2902 # to datetime boundary according to its resolution.
-> 2903 label = self._maybe_cast_slice_bound(label, side, kind)
2904
2905 # we need to look up the label
//anaconda/lib/python2.7/site-packages/pandas/tseries/period.pyc in _maybe_cast_slice_bound(self, label, side, kind)
724
725 """
--> 726 assert kind in ['ix', 'loc', 'getitem']
727
728 if isinstance(label, datetime):
AssertionError: Serialization also fails for Datasets indexed by a PeriodIndex: In [12]: da.to_netcdf('test_periodindex.nc')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-318efb0010c7> in <module>()
----> 1 da.to_netcdf('test_periodindex.nc')
//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in to_netcdf(self, path, mode, format, group, engine, encoding)
780 from ..backends.api import to_netcdf
781 return to_netcdf(self, path, mode, format=format, group=group,
--> 782 engine=engine, encoding=encoding)
783
784 dump = utils.function_alias(to_netcdf, 'dump')
//anaconda/lib/python2.7/site-packages/xarray/backends/api.pyc in to_netcdf(dataset, path, mode, format, group, engine, writer, encoding)
352 store = store_cls(path, mode, format, group, writer)
353 try:
--> 354 dataset.dump_to_store(store, sync=sync, encoding=encoding)
355 if isinstance(path, BytesIO):
356 return path.getvalue()
//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in dump_to_store(self, store, encoder, sync, encoding)
726 variables, attrs = encoder(variables, attrs)
727
--> 728 store.store(variables, attrs, check_encoding)
729 if sync:
730 store.sync()
//anaconda/lib/python2.7/site-packages/xarray/backends/common.pyc in store(self, variables, attributes, check_encoding_set)
230 # All NetCDF files get CF encoded by default, without this attempting
231 # to write times, for example, would fail.
--> 232 cf_variables, cf_attrs = cf_encoder(variables, attributes)
233 AbstractWritableDataStore.store(self, cf_variables, cf_attrs,
234 check_encoding_set)
//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in cf_encoder(variables, attributes)
1058 """
1059 new_vars = OrderedDict((k, encode_cf_variable(v, name=k))
-> 1060 for k, v in iteritems(variables))
1061 return new_vars, attributes
//anaconda/python.app/Contents/lib/python2.7/collections.pyc in __init__(*args, **kwds)
67 root[:] = [root, root, None]
68 self.__map = {}
---> 69 self.__update(*args, **kwds)
70
71 def __setitem__(self, key, value, dict_setitem=dict.__setitem__):
//anaconda/python.app/Contents/lib/python2.7/_abcoll.pyc in update(*args, **kwds)
569 self[key] = other[key]
570 else:
--> 571 for key, value in other:
572 self[key] = value
573 for key, value in kwds.items():
//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in <genexpr>((k, v))
1058 """
1059 new_vars = OrderedDict((k, encode_cf_variable(v, name=k))
-> 1060 for k, v in iteritems(variables))
1061 return new_vars, attributes
//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in encode_cf_variable(var, needs_copy, name)
714 var = maybe_encode_dtype(var, name)
715 var = maybe_encode_bools(var)
--> 716 var = ensure_dtype_not_object(var)
717 return var
718
//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in ensure_dtype_not_object(var)
683 data[missing] = fill_value
684 else:
--> 685 data = data.astype(dtype=_infer_dtype(data))
686 var = Variable(dims, data, attrs, encoding)
687 return var
//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in _infer_dtype(array)
653 dtype = np.dtype(dtype.kind)
654 elif dtype.kind == 'O':
--> 655 raise ValueError('unable to infer dtype; xarray cannot '
656 'serialize arbitrary Python objects')
657 return dtype
ValueError: unable to infer dtype; xarray cannot serialize arbitrary Python objects The constructor for pd.Period is also a little flaky for unusual dates (for instance below it should not think it is year 0016; it should be year 0001), which makes me a bit wary: In [13]: pd.Period('0001-01-16')
Out[13]: Period('0016-01-01', 'D') One can work around this by being more explicit in the constructor, however (as I've done earlier): In [14]: pd.Period(year=1, month=1, day=16, freq='D')
Out[14]: Period('0001-01-16', 'D') SummaryTo conclude, I do not believe there is an existing workaround other than what we have currently been doing (i.e. offsetting dates to put things in the pandas Timestamp-valid range) that grants us both time-related groupby operations AND date range subsetting. I think the lowest hanging fruit upstream that would enable a cleaner workaround might be fixing time subsetting with a PeriodIndex. That being said, on the topic of upstream fixes I would prefer the |
@spencerkclark, thank you! That is extremely useful. I agree with you. My hope is that one outcome of the upcoming "other aospy" workshop will be meaningful progress on this issue at the xarray level. So more wait-and-see. |
An alternative to our existing workaround that comes to mind is that we could combine these two approaches; we could subset in time first (using |
Very intrigued by this. So then would there be a post-calculation step that converts back from PeriodIndexes to the original type? Some of the functions I use utilize groupby internally, so for them to work the conversion would need to occur before passing the data to the user's functions. Other things that come to mind: Can we assume that groupby operations will work identically for PeriodIndexes as for the other types? Less importantly (since we can easily modify our own code), is there code of ours within aospy (i.e. not aospy-obj-lib functions) that will break/behave differently with PeriodIndexes? If this did work, it seems like something that xarray would maybe even consider adopting into their code base. |
Upon a little more tinkering, it seems that one cannot serialize a Dataset indexed by either
As far as I can tell, yes (since like DatetimeIndexes, PeriodIndexes have year, month, and day attributes).
The only thing I'm aware of at the moment is serialization (but that's kind of important), but maybe we can find a workaround. |
What are the prospects of upstream fixes on the netCDF4 side, either to (1) the lack of groupby support or (2) the inability to serialize? |
So, if I'm understanding right, the procedure would be:
Is that right? One potential snag is that, conceivably, a user's function could involve slicing on the time array (which, for out-of-range dates, would require datetime-like), groupby operations on time array (requires Periods), or both (no good solution). But I'm not sure how much this matters. |
Contributors to netCDF4 / xarray / and pandas would know better than me, but...
|
Yes, that's exactly the procedure I had in mind -- and if serialization gets fixed, we won't have to worry about copying / storing the raw time vector.
There is a messy workaround to this (but really not ideal...), which is to carry all three representations of the time coordinate with every DataArray, and use each for their appropriate purpose. |
Ha that's true! We'll keep this in our backpocket as our "nuclear option" |
I guess what I'm wondering is: what is it about the netCDF datetime objects that precludes groupby operations? The workshop this weekend should shed light on this. |
E.g. #96. This maybe is a wait-and-see for upstream solutions (either at xarray, pandas, or numpy levels), but wanted to have an Issue open where we can track this/discuss alternatives.
Our current workarounds limit us to <585 years of data, which for many folks in the community would be wholly insufficient.
The text was updated successfully, but these errors were encountered: