Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assign_coords with datetime64[us] changes dtype to datetime64[ns] #4427

Closed
andrewpauling opened this issue Sep 16, 2020 · 3 comments · Fixed by #4454
Closed

assign_coords with datetime64[us] changes dtype to datetime64[ns] #4427

andrewpauling opened this issue Sep 16, 2020 · 3 comments · Fixed by #4454
Labels

Comments

@andrewpauling
Copy link
Contributor

What happened:
When using xr.DataArray.assign_coords() to assign a new coordinate to the time dimension that is an array with dtype datetime64[us], after assigning, the dtype is datetime64[ns], resulting in the wrong dates, since the dates I am using are outside the valid range for the [ns] units.

What you expected to happen:
Preserve dtype of array when assigning as a coordinate.

Minimal Complete Verifiable Example:

import numpy as np
import xarray as xr
import cftime

tmp = np.random.random(12)

da = xr.DataArray(tmp, dims='time')

times=list()

for mth in np.arange(1, 13):
    times.append(cftime.DatetimeNoLeap(1250, mth, 1))

times64 = np.array([np.datetime64(t, 'us') for t in times])

da = da.assign_coords({'time': times64})

which gives for the original array:

In [49]: times64
Out[49]: 
array(['1250-01-01T00:00:00.000000', '1250-02-01T00:00:00.000000',
       '1250-03-01T00:00:00.000000', '1250-04-01T00:00:00.000000',
       '1250-05-01T00:00:00.000000', '1250-06-01T00:00:00.000000',
       '1250-07-01T00:00:00.000000', '1250-08-01T00:00:00.000000',
       '1250-09-01T00:00:00.000000', '1250-10-01T00:00:00.000000',
       '1250-11-01T00:00:00.000000', '1250-12-01T00:00:00.000000'],
      dtype='datetime64[us]')

and for the array after assigning:

In [51]: da.time
Out[51]: 
<xarray.DataArray 'time' (time: 12)>
array(['1834-07-22T23:34:33.709551616', '1834-08-22T23:34:33.709551616',
       '1834-09-19T23:34:33.709551616', '1834-10-20T23:34:33.709551616',
       '1834-11-19T23:34:33.709551616', '1834-12-20T23:34:33.709551616',
       '1835-01-19T23:34:33.709551616', '1835-02-19T23:34:33.709551616',
       '1835-03-22T23:34:33.709551616', '1835-04-21T23:34:33.709551616',
       '1835-05-22T23:34:33.709551616', '1835-06-21T23:34:33.709551616'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 1834-07-22T23:34:33.709551616 ... 1835-06-...

Anything else we need to know?:

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:37:09)
[Clang 10.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 18.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.3

xarray: 0.16.0
pandas: 1.1.0
numpy: 1.19.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: installed
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.21.0
distributed: 2.22.0
matplotlib: 3.1.2
cartopy: 0.17.0
seaborn: None
numbagg: None
pint: None
setuptools: 49.3.1.post20200810
pip: 20.2.2
conda: None
pytest: None
IPython: 7.17.0
sphinx: 3.2.0

@spencerkclark
Copy link
Member

Thanks @andrewpauling -- I do think there's a bug here, but this issue happens to be more complicated than it might seem on the surface :).

Xarray standardizes around nanosecond precision for np.datetime64 dtypes, and casts any NumPy array of dtype datetime64 to nanosecond precision. This is mainly motivated by pandas -- pandas requires nanosecond precision -- which xarray relies on for time indexing and other time-related operations through things like pandas.DatetimeIndex or the pandas.Series.dt accessor. As you've noted this is unfortunate since it limits the supported time range for np.datetime64 types (see, e.g., discussion in #789).

Addressing this fully would be a challenge (we've discussed this at times in the past). It was concluded that for dates outside the representable range that cftime dates would be used, and that over time we would build up infrastructure to enable some of the nice things you can do with np.datetime64 types with cftime objects. The functionality now largely exists, and a nice benefit of doing this through cftime is that we also gain compatibility with non-standard calendar types, e.g. DatetimeNoLeap. I encourage you to try and take advantage of that, and please let us know if there is something missing that you would like to see implemented or improved!

This is a long way of saying, without a fair amount of work (i.e. addressing this issue upstream in pandas) xarray is unlikely to relax its approach for the precision of np.datetime64 dtypes, and will continue casting to nanosecond precision.

However, the fact that your example silently results in non-sensical times should be considered a bug; instead, following pandas, I would argue we should raise an error if the dates cannot be represented with nanosecond precision.

@andrewpauling
Copy link
Contributor Author

Thanks @spencerkclark for the response, that makes sense. I have been able to get what I needed to work using cftime, it just seemed strange to me that xarray behaved as it did with datetime64.

I agree it would be nice if an error was raised when dates can't be represented, would this be difficult to implement? I have been hoping to contribute to some open-source projects so if it's not too complex I'd be happy to tackle it, and if you have any advice on where to start with this problem that would be great.

@spencerkclark
Copy link
Member

spencerkclark commented Sep 21, 2020

That would be great @andrewpauling! I think this is the relevant code in xarray:

if isinstance(data, np.ndarray):
if data.dtype.kind == "O":
data = _possibly_convert_objects(data)
elif data.dtype.kind == "M":
data = np.asarray(data, "datetime64[ns]")
elif data.dtype.kind == "m":
data = np.asarray(data, "timedelta64[ns]")

I want to say arguably we could use the _possibly_convert_objects function on datetime64 and timedelta64 data as well; you'll see this goes through a pandas.Series to do the casting, which has built-in logic to check that the values can be represented with nanosecond precision. But it's up to you how you ultimately want to go about things.

I agree this casting behavior is a bit surprising. If we wanted to be a little more transparent, we could also warn when attempting to cast non-nanosecond-precision datetime64 data to nanosecond precision. I'm not sure what others think; I know pandas doesn't do this, but it could be friendlier for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants