Data from years 2018 onwards cannot be extracted with xarray #1

darkquasar · 2020-02-23T10:24:11Z

Xarray seems to have some issues extracting slices of data from SILO NETCF4 files from 2018 onwards. Error thrown is:

Traceback (most recent call last):#####6                                                 | 1/3 [03:42<07:25, 222.63s/it]
  File "bestiapop.py", line 638, in <module>
  File "bestiapop.py", line 624, in main
    if __name__ == '__main__':
  File "bestiapop.py", line 247, in process_records
    output_format="MET")
  File "bestiapop.py", line 472, in generate_climate_dataframe
    # Note: there is a better method for obtaining this by looking at the
  File "bestiapop.py", line 352, in get_values_from_array
    data_values = [np.round(x, decimals=1) for x in value_array[variable_short_name].sel(lat=lat, lon=lon).values]
  File "C:\tools\Anaconda3\envs\bestiapop2\lib\site-packages\xarray\core\common.py", line 233, in __getattr__
    "{!r} object has no attribute {!r}".format(type(self).__name__, name)
AttributeError: 'DataArray' object has no attribute 'values'

Specifically, the error is here:

File "bestiapop.py", line 352, in get_values_from_array
    data_values = [np.round(x, decimals=1) for x in value_array[variable_short_name].sel(lat=lat, lon=lon).values]

This seems to be related with the way files starting from 2018 are encoded. We need to investigate further by obtaining one such file and exploring it interactively with Jupyter.

To troubleshoot this we need to import xarray and then load the NETCD4 file inside a Jupyter Notebook:

import xarray as xr
value_array = xr.open_dataset("path_to_NetCDF_file", engine='h5netcdf')

This will store the values in the variable value_array as a numpy array. We need to figure out why this code value_array[variable_short_name].sel(lat=some_lat, lon=some_lon).values (where "variable_short_name" is the name of the variable stored in the NetCDF file like "daily_rain") doesn't work for files higher than 2017. It is likely that the values for that lat and lon combinations are stored inside another layer in that array.

The text was updated successfully, but these errors were encountered:

JJguri · 2020-03-31T06:59:36Z

I was looking through several NetCDF files and climate variables from 2015 to 2018 and, unfortunately, I did find any differences in terms of data structure. See an example for max temp at https://github.com/JJguri/bestiapop/blob/master/sample_data/netcdf_exploration.ipynb

darkquasar · 2020-04-01T04:22:29Z

Can you try to get to a singular data point in both of those? use a combination of lat-lon-time to get there, if possible, display the data values for a whole week or something like that, need to see that they are both reachable via the same code.

JJguri · 2020-04-15T23:30:26Z

Here is the difference between NetCDF files. I got information about the max temperature values across all latitudes and longitudes for two years, 2015 and 2018. For 2015, the elements in the array are nan values while in 2018 it has 572721 max_temp values (572721max_temp/681 lat/841 lon=1)

darkquasar · 2020-04-24T06:13:56Z

ok this is good, we then need to extract the value_array2018 to get to the actual array of values

JJguri · 2020-04-26T01:50:33Z

I was playing a little bit and, finally, I found both 2015 and 2018 are equal. The only difference I found was the position of the Attributes in any array you got. I am not sure it could be affecting the download, otherwise, the error is not in the structure of the file. Please see the netcdf_exploration file for details.

JJguri · 2020-04-29T08:24:11Z

SILO used two different formats of FillValues because the NetCDF files were constructed using different software tools.

Data up to 2016 is in 64bit format, with fill values of -32768.
Data from 2017 (or 2018) onwards is in 32bit format, with fill values of -32767.

bestiapop should be able to read and skip all of them.

darkquasar · 2020-05-02T04:36:32Z

Cannot find the "-32768" fillvalues (assuming that by "fillvalues" you mean non-existent values). I've tested with multiple variables for year 2018 and they are all still "NaN".

I can't still determine why the data exploration slows down for 2018 data. Perhaps the error is not where we thought it was?

JJguri · 2020-05-02T05:45:51Z

I think it is related also with the dtype as was discussed in the following issue pydata/xarray#2304

JJguri · 2020-06-27T00:41:17Z

Could this bug generates an issue when we want to publish the package? Which 'easy' options we have to avoid this?

darkquasar · 2020-08-26T00:25:38Z

Closing this issue as the problem was not with BestiaPop but rather with how SILO compiled its NetCDF4 files. As explained by the SILO team, as of Jun 2020, SILO has refactored all its NetCDF4 files to perform better when extracting spatial data points rather than time-based data points.

This effectively means that it is slower to extract data for all days of the year from NetCDF4 files, for a single combination of lat/lon, than it is to extract data for all combinations of lat/lon for a single day. Since SILO NetCDF4 files are split into year-variable units you will always have to extract data from different files when using multiple years.

These factors create a double bottleneck when directly loading NetCDF4 files from AWS S3 buckets:

To generate a MET, BestiaPop must read from multiple different files, as many as the years required to be present in the final MET and as many as the variables involved
Given the way that chunking was configured for SILO NetCDF4 files, it was very slow for otherwise fast Python packages to read the data directly from the cloud and parse it accordingly.

This issue has been circumvented now by leveraging SILO's cloud API which most likely is tied up to a fast backend database and/or reads directly from their local NetCDF4 files but leveraging the power of cloud computing.

darkquasar added the bug Something isn't working label Feb 23, 2020

darkquasar added this to the February 2020 Code Review milestone Feb 23, 2020

darkquasar self-assigned this Feb 23, 2020

darkquasar assigned JJguri Mar 27, 2020

JJguri added the help wanted Extra attention is needed label Apr 29, 2020

darkquasar linked a pull request Aug 11, 2020 that will close this issue

Dev #23

Merged

darkquasar closed this as completed Aug 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data from years 2018 onwards cannot be extracted with xarray #1

Data from years 2018 onwards cannot be extracted with xarray #1

darkquasar commented Feb 23, 2020 •

edited

Loading

JJguri commented Mar 31, 2020 •

edited

Loading

darkquasar commented Apr 1, 2020

JJguri commented Apr 15, 2020

darkquasar commented Apr 24, 2020

JJguri commented Apr 26, 2020 •

edited

Loading

JJguri commented Apr 29, 2020

darkquasar commented May 2, 2020

JJguri commented May 2, 2020

JJguri commented Jun 27, 2020

darkquasar commented Aug 26, 2020

Data from years 2018 onwards cannot be extracted with xarray #1

Data from years 2018 onwards cannot be extracted with xarray #1

Comments

darkquasar commented Feb 23, 2020 • edited Loading

JJguri commented Mar 31, 2020 • edited Loading

darkquasar commented Apr 1, 2020

JJguri commented Apr 15, 2020

darkquasar commented Apr 24, 2020

JJguri commented Apr 26, 2020 • edited Loading

JJguri commented Apr 29, 2020

darkquasar commented May 2, 2020

JJguri commented May 2, 2020

JJguri commented Jun 27, 2020

darkquasar commented Aug 26, 2020

darkquasar commented Feb 23, 2020 •

edited

Loading

JJguri commented Mar 31, 2020 •

edited

Loading

JJguri commented Apr 26, 2020 •

edited

Loading