Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I copy my array forwards in time? #2547

Closed
tommylees112 opened this issue Nov 6, 2018 · 14 comments
Closed

How do I copy my array forwards in time? #2547

tommylees112 opened this issue Nov 6, 2018 · 14 comments
Labels
needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports topic-CF conventions

Comments

@tommylees112
Copy link

I have this xr.Dataset:

<xarray.Dataset>
Dimensions:    (time: 1, x: 200, y: 200)
Coordinates:
  * time       (time) float64 9.505e+17
Dimensions without coordinates: x, y
Data variables:
    Rg         (time, y, x) float32 ...
    latitude   (y, x) float64 ...
    longitude  (y, x) float64 ...
    time       datetime64[ns] 2000-02-14
Attributes:
    Conventions:     CF-1.0
    content:         HARMONIZED WORLD SOIL DATABASE; first it was aggregated ...
    scaling_factor:  20

I want to copy it through time, adding the time dimension at a given daterange

Something like this:

times = pd.date_range('2000-01-01', '2000-12-31', name='time')

ds.time = times[0]

all_data = [ds]
for i, time in enumerate(times[1:]):
    ds_t1 = ds.copy()
    ds_t1.time = time

    all_data.append(ds)
    ds = ds_t1

ds = xr.concat(all_data)

So I should have output data like:

<xarray.Dataset>
Dimensions:    (time: 366, x: 200, y: 200)
Coordinates:
  * time       (time) float64 ...
Dimensions without coordinates: x, y
Data variables:
    Rg         (time, y, x) float32 ...
    latitude   (y, x) float64 ...
    longitude  (y, x) float64 ...
    time       datetime64[ns] 2000-02-14
Attributes:
    Conventions:     CF-1.0
    content:         HARMONIZED WORLD SOIL DATABASE; first it was aggregated ...
    scaling_factor:  20
@max-sixty
Copy link
Collaborator

IIUC, this is actually much easier - reindex on your enlarged times index, and then use ffill. No need for the for loop etc

@tommylees112
Copy link
Author

tommylees112 commented Nov 6, 2018

Thanks for your help! you definitely understood me correctly!

This doesn't seem to work as it fills myRg arrays with nan values.

ds = ds.reindex({"time":pd.date_range("2000-01-01","2000-12-31")})
ds = ds.ffill("time")

@max-sixty
Copy link
Collaborator

Your original value needs to be in the time index in order to remain there after the reindex. Currently it looks like a float value:

Coordinates:
  * time       (time) float64 9.505e+17

If you have a repro example (link in issue template) it's easier to offer help on the whole issue

@tommylees112
Copy link
Author

tommylees112 commented Nov 6, 2018

Data:
netcdf_files.zip

Code below:

import numpy as np
import pandas as pd
import xarray as xr
from shutil import copyfile
import os

data_dir = "./"
filename = "Rg_dummy.nc"

# get the datetime range
times = pd.date_range("2000-01-01", "2000-12-31", name="time")
var = "Rg"

copyfile(filename, "temp.nc")
ds = xr.open_dataset("temp.nc")
print("Temporary Data read to Python")

# FORWARD FILL FROM THE ORIGINAL DATA to new timesteps
ds.reindex({"time":times})
ds.ffill("time")

# ds.to_netcdf(filename, format="NETCDF3_CLASSIC")
# print(filename, "Written!")

# remove temporary file
os.remove(data_dir+"temp.nc")
print("Temporary Data Removed")

del ds

@max-sixty
Copy link
Collaborator

Can you ensure times[0].item() == ds.time[0].item() prior to the reindex? Otherwise the reindex won't find the original index value and will fill with NaNs...

@tommylees112
Copy link
Author

tommylees112 commented Nov 7, 2018

The ds.time[0] won't let me set it's value to a datetime. Instead it returns a float:

In [19]: ds.time[0].item()
Out[19]: 10509.0

And none of the following work:

# doesn't change the time value
ds.time[0].values = times[0]

# returns an error because I can't assign to a function call
ds.time[0].item() = times[0]

# returns ValueError: replacement data must match the Variable's shape
ds['time'].values = np.array(times[0])

Thanks for your help!

@spencerkclark
Copy link
Member

@tommylees112 I had a look at your dataset and noticed that the time coordinate was not encoded in a way that allows xarray to automatically decode the time values into datetimes. Specifically, the units attribute is not in the format of some unit of time (e.g. 'seconds') since a given reference date.

$ ncdump -h Rg_dummy.nc
netcdf Rg_dummy {
dimensions:
	time = 1 ;
	y = 200 ;
	x = 200 ;
variables:
	double time(time) ;
		time:_FillValue = NaN ;
		time:standard_name = "time" ;
		time:units = "day as %Y%m%d.%f" ;
		time:calendar = "proleptic_gregorian" ;
	short Rg(time, y, x) ;
		Rg:_FillValue = -1s ;
		Rg:long_name = "HWSD sub sum content" ;
		Rg:units = "percent wt" ;
		Rg:valid_range = 97., 103. ;
	double latitude(y, x) ;
		latitude:_FillValue = -99999. ;
	double longitude(y, x) ;
		longitude:_FillValue = -99999. ;

// global attributes:
		:Conventions = "CF-1.0" ;
		:content = "HARMONIZED WORLD SOIL DATABASE; first it was aggregated to one global file; then the missing areas were filled with interpolated data; then separate tiles were extracted" ;
		:scaling_factor = "20" ;
}

This is why ds.time is an array of floats when you load the file in. We could discuss how to address that, but depending on your broader needs here that might not be a major concern.

Regarding assigning a new value to the time coordinate, the following should work:

ds['time'] = np.array([times[0]])

@tommylees112
Copy link
Author

tommylees112 commented Nov 7, 2018

That all worked great until I tried to write out to a .nc file.

data_dir = "./"
filename = "Rg_dummy.nc"

# get the datetime range
times = pd.date_range("2000-01-01", "2000-12-31", name="time")
var = "Rg"

copyfile(data_dir + filename, "temp.nc")
ds = xr.open_dataset("temp.nc")
print("Temporary Data read to Python")

# FORWARD FILL FROM THE ORIGINAL DATA to new timesteps
ds['time'] = np.array([times[0]])
ds.reindex({"time":times})
ds.ffill("time")

ds.to_netcdf(filename, format="NETCDF3_CLASSIC")
print(filename, "Written!")

# remove temporary file
os.remove(data_dir+"temp.nc")
print("Temporary Data Removed")

del ds

I get the following Error message:

Temporary Data read to Python
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-228-e3d645224353> in <module>()
     15 ds.ffill("time")
     16
---> 17 ds.to_netcdf(filename, format="NETCDF3_CLASSIC")
     18 print(filename, "Written!")
     19

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/core/dataset.pyc in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute)
   1148                          engine=engine, encoding=encoding,
   1149                          unlimited_dims=unlimited_dims,
-> 1150                          compute=compute)
   1151
   1152     def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/backends/api.pyc in to_netcdf(dataset, path_or_file, mode, format, group, engine, writer, encoding, unlimited_dims, compute)
    721     try:
    722         dataset.dump_to_store(store, sync=sync, encoding=encoding,
--> 723                               unlimited_dims=unlimited_dims, compute=compute)
    724         if path_or_file is None:
    725             return target.getvalue()

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/core/dataset.pyc in dump_to_store(self, store, encoder, sync, encoding, unlimited_dims, compute)
   1073
   1074         store.store(variables, attrs, check_encoding,
-> 1075                     unlimited_dims=unlimited_dims)
   1076         if sync:
   1077             store.sync(compute=compute)

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/backends/common.pyc in store(self, variables, attributes, check_encoding_set, unlimited_dims)
    366         self.set_dimensions(variables, unlimited_dims=unlimited_dims)
    367         self.set_variables(variables, check_encoding_set,
--> 368                            unlimited_dims=unlimited_dims)
    369
    370     def set_attributes(self, attributes):

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/backends/netCDF4_.pyc in set_variables(self, *args, **kwargs)
    405     def set_variables(self, *args, **kwargs):
    406         with self.ensure_open(autoclose=False):
--> 407             super(NetCDF4DataStore, self).set_variables(*args, **kwargs)
    408
    409     def encode_variable(self, variable):

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/backends/common.pyc in set_variables(self, variables, check_encoding_set, unlimited_dims)
    403             check = vn in check_encoding_set
    404             target, source = self.prepare_variable(
--> 405                 name, v, check, unlimited_dims=unlimited_dims)
    406
    407             self.writer.add(source, target)

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/backends/netCDF4_.pyc in prepare_variable(self, name, variable, check_encoding, unlimited_dims)
    451                 least_significant_digit=encoding.get(
    452                     'least_significant_digit'),
--> 453                 fill_value=fill_value)
    454             _disable_auto_decode_variable(nc4_var)
    455

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.createVariable()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__init__()

TypeError: illegal primitive data type, must be one of ['i8', 'f4', 'f8', 'S1', 'i2', 'i4', 'u8', 'u4', 'u1', 'u2', 'i1'], got datetime64[ns]

and if I try with the default netcdf writing options

ds.to_netcdf(filename)
print(filename, "Written!")

I get this error message:

Temporary Data read to Python
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-229-453d5f074d33> in <module>()
     15 ds.ffill("time")
     16
---> 17 ds.to_netcdf(filename)
     18 print(filename, "Written!")
     19

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/core/dataset.pyc in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute)
   1148                          engine=engine, encoding=encoding,
   1149                          unlimited_dims=unlimited_dims,
-> 1150                          compute=compute)
   1151
   1152     def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/backends/api.pyc in to_netcdf(dataset, path_or_file, mode, format, group, engine, writer, encoding, unlimited_dims, compute)
    721     try:
    722         dataset.dump_to_store(store, sync=sync, encoding=encoding,
--> 723                               unlimited_dims=unlimited_dims, compute=compute)
    724         if path_or_file is None:
    725             return target.getvalue()

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/core/dataset.pyc in dump_to_store(self, store, encoder, sync, encoding, unlimited_dims, compute)
   1073
   1074         store.store(variables, attrs, check_encoding,
-> 1075                     unlimited_dims=unlimited_dims)
   1076         if sync:
   1077             store.sync(compute=compute)

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/backends/common.pyc in store(self, variables, attributes, check_encoding_set, unlimited_dims)
    366         self.set_dimensions(variables, unlimited_dims=unlimited_dims)
    367         self.set_variables(variables, check_encoding_set,
--> 368                            unlimited_dims=unlimited_dims)
    369
    370     def set_attributes(self, attributes):

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/backends/netCDF4_.pyc in set_variables(self, *args, **kwargs)
    405     def set_variables(self, *args, **kwargs):
    406         with self.ensure_open(autoclose=False):
--> 407             super(NetCDF4DataStore, self).set_variables(*args, **kwargs)
    408
    409     def encode_variable(self, variable):

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/backends/common.pyc in set_variables(self, variables, check_encoding_set, unlimited_dims)
    403             check = vn in check_encoding_set
    404             target, source = self.prepare_variable(
--> 405                 name, v, check, unlimited_dims=unlimited_dims)
    406
    407             self.writer.add(source, target)

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/backends/netCDF4_.pyc in prepare_variable(self, name, variable, check_encoding, unlimited_dims)
    418                          unlimited_dims=None):
    419         datatype = _get_datatype(variable, self.format,
--> 420                                  raise_on_invalid_encoding=check_encoding)
    421         attrs = variable.attrs.copy()
    422

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/backends/netCDF4_.pyc in _get_datatype(var, nc_format, raise_on_invalid_encoding)
     99 def _get_datatype(var, nc_format='NETCDF4', raise_on_invalid_encoding=False):
    100     if nc_format == 'NETCDF4':
--> 101         datatype = _nc4_dtype(var)
    102     else:
    103         if 'dtype' in var.encoding:

/home/mpim/m300690/miniconda3/envs/holaps/lib/python2.7/site-packages/xarray/backends/netCDF4_.pyc in _nc4_dtype(var)
    122     else:
    123         raise ValueError('unsupported dtype for netCDF4 variable: {}'
--> 124                          .format(var.dtype))
    125     return dtype
    126

ValueError: unsupported dtype for netCDF4 variable: datetime64[ns]

Version information

If this is useful:

In [230]: xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-696.18.7.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

xarray: 0.10.7
pandas: 0.23.0
numpy: 1.11.3
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: None
h5py: None
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: None
distributed: None
matplotlib: 1.5.1
cartopy: None
seaborn: None
setuptools: 39.1.0
pip: 18.1
conda: None
pytest: None
IPython: 5.7.0
sphinx: None

@spencerkclark
Copy link
Member

Thanks @tommylees112 -- note that the reindex and ffill methods do not operate in-place, so you'll need to assign the results of them to new variables. You can also do them all in one step:

In [1]: import xarray as xr; import pandas as pd; import numpy as np

In [2]: ds = xr.open_dataset('Rg_dummy.nc')

In [3]: times = pd.date_range("2000-01-01", "2000-12-31", name="time")

In [4]: ds['time'] = np.array([times[0]])

In [5]: ds2 = ds.reindex(time=times, method='ffill')

In [6]: ds2.to_netcdf('result.nc')

Regarding the issue saving to files -- I can reproduce that issue with older xarray versions. It is related to
#2512 and was fixed in #2513 (i.e. it works with the master version of xarray). The good news is this bug only applies to saving ds in my example, not ds2, so you should be able to do this with your current setup just fine.

@tommylees112
Copy link
Author

tommylees112 commented Nov 7, 2018

@spencerkclark Thanks very much that is awesome!

One final Q: How do I set the fill_value not to nan but to -1 (this is what I need for the Land Surface Model)?

The current output of ncdump -h Rg_dummy.nc is:

...
variables:
	double time(time) ;
		time:_FillValue = NaN ;
		time:standard_name = "time" ;
		time:units = "day as %Y%m%d.%f" ;
		time:calendar = "proleptic_gregorian" ;
	short Rg(time, y, x) ;
		Rg:_FillValue = NaN ;
		Rg:long_name = "HWSD sub sum content" ;
		Rg:units = "percent wt" ;
		Rg:valid_range = 97., 103. ;
	double latitude(y, x) ;
		latitude:_FillValue = -99999. ;
	double longitude(y, x) ;
		longitude:_FillValue = -99999. ;

What I want is:

ncdump -h Rg_dummy.nc

...
variables:
	double time(time) ;
		time:_FillValue = -1s ;
		time:standard_name = "time" ;
		time:units = "day as %Y%m%d.%f" ;
		time:calendar = "proleptic_gregorian" ;
	short Rg(time, y, x) ;
		Rg:_FillValue = -1s ;
		Rg:long_name = "HWSD sub sum content" ;
		Rg:units = "percent wt" ;
		Rg:valid_range = 97., 103. ;
	double latitude(y, x) ;
		latitude:_FillValue = -99999. ;
	double longitude(y, x) ;
		longitude:_FillValue = -99999. ;

I want to do something like:

ds2.to_netcdf(filename, set_fill_value=-1)

I saw these:
Issue #1598
Issue #1865

But failed to understand where/how to use them

Thank you so much for helping me out with xarray. It's crazy powerful. It's also just very big!

@spencerkclark
Copy link
Member

The _FillValue encoding seems to be preserved with your desired value in my example:

$ ncdump -h result.nc
netcdf result {
dimensions:
	time = 366 ;
	y = 200 ;
	x = 200 ;
variables:
	int64 time(time) ;
		time:units = "days since 2000-01-01 00:00:00" ;
		time:calendar = "proleptic_gregorian" ;
	short Rg(time, y, x) ;
		Rg:_FillValue = -1s ;
		Rg:long_name = "HWSD sub sum content" ;
		Rg:units = "percent wt" ;
		Rg:valid_range = 97., 103. ;
	double latitude(y, x) ;
		latitude:_FillValue = -99999. ;
	double longitude(y, x) ;
		longitude:_FillValue = -99999. ;

// global attributes:
		:Conventions = "CF-1.0" ;
		:content = "HARMONIZED WORLD SOIL DATABASE; first it was aggregated to one global file; then the missing areas were filled with interpolated data; then separate tiles were extracted" ;
		:scaling_factor = "20" ;

Is the time encoding important for your land surface model? That does change in my example (see the units attribute); you might need some special logic to handle that.

@jhamman
Copy link
Member

jhamman commented Feb 3, 2019

@tommylees112 - this issue has sat for a bit of time now. Did you end up with a solution here? If so, is okay with you if we close this out?

@jhamman jhamman added topic-CF conventions needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports labels Feb 3, 2019
@tommylees112
Copy link
Author

Sorry for the silence! I got pulled away to another project. Unfortunately I wasn't able to finish completing the task in xarray but I found that the easiest way around the problem was to use a combination of two functions:

def change_missing_vals_to_9999f(ds, variable):
    """ Change the missing values from np.nan to -9999.0f"""
    arr = ds[variable].values

    # set the values to -9999
    arr[np.isnan(arr)] = -9999

    # reassign the values back to the array
    ds[variable] = (ds[variable].dims, arr)

    return ds

def change_missing_data_values(filename):
    """ change the values INSIDE the .nc file to -9999.0f """
    assert (
            filename.split(".")[-1] == "nc"
    ), "This function only works with .nc files. Filename: {}".format(filename)
    print("** Processing {} **").format(filename)

    # ONLY OPEN THE DATASET ONCE
    ds = xr.open_dataset(filename)
    variables = ds.data_vars.keys()

    for variable in variables:
        print("** Working on variable {} **".format(variable))
        ds = change_missing_vals_to_9999f(ds, variable)

    # ds.map(change_missing_vals_to_9999f, variable)

    # rewrite to netcdf file
    ds.to_netcdf(filename)
    print("** Written variables {} to filename {} **").format(variables, filename)

    return

and then another function using the NCO command:

def change_nc_FillValue(filename):
    """ use the NCO command to change the fillvalue metadata in the .nc files"""
    command = "ncatted -a _FillValue,,m,f,-9999.0 {}".format(filename)
    os.system(command)
    print("** _FillValue changed on {} file **".format(filename))
    return

RUN HERE:

@click.command()
@click.argument("filename", type=str)
def main(filename):
    """ Run the two commands
    a) change the Values INSIDE the .nc file                    [python, numpy, xarray]
    b) change the associated METADATA for the .nc file headers  [nco]
    """
    change_missing_data_values(filename)
    change_nc_FillValue(filename)

    print("**** PROCESS DONE FOR {} ****").format(filename)

    return

@tommylees112
Copy link
Author

But the original question was answered so thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports topic-CF conventions
Projects
None yet
Development

No branches or pull requests

5 participants