Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data from years 2018 onwards cannot be extracted with xarray #1

Closed
darkquasar opened this issue Feb 23, 2020 · 10 comments · Fixed by #23
Closed

Data from years 2018 onwards cannot be extracted with xarray #1

darkquasar opened this issue Feb 23, 2020 · 10 comments · Fixed by #23
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@darkquasar
Copy link
Collaborator

darkquasar commented Feb 23, 2020

Xarray seems to have some issues extracting slices of data from SILO NETCF4 files from 2018 onwards. Error thrown is:

Traceback (most recent call last):#####6                                                 | 1/3 [03:42<07:25, 222.63s/it]
  File "bestiapop.py", line 638, in <module>
  File "bestiapop.py", line 624, in main
    if __name__ == '__main__':
  File "bestiapop.py", line 247, in process_records
    output_format="MET")
  File "bestiapop.py", line 472, in generate_climate_dataframe
    # Note: there is a better method for obtaining this by looking at the
  File "bestiapop.py", line 352, in get_values_from_array
    data_values = [np.round(x, decimals=1) for x in value_array[variable_short_name].sel(lat=lat, lon=lon).values]
  File "C:\tools\Anaconda3\envs\bestiapop2\lib\site-packages\xarray\core\common.py", line 233, in __getattr__
    "{!r} object has no attribute {!r}".format(type(self).__name__, name)
AttributeError: 'DataArray' object has no attribute 'values'

Specifically, the error is here:

File "bestiapop.py", line 352, in get_values_from_array
    data_values = [np.round(x, decimals=1) for x in value_array[variable_short_name].sel(lat=lat, lon=lon).values]

This seems to be related with the way files starting from 2018 are encoded. We need to investigate further by obtaining one such file and exploring it interactively with Jupyter.

To troubleshoot this we need to import xarray and then load the NETCD4 file inside a Jupyter Notebook:

import xarray as xr
value_array = xr.open_dataset("path_to_NetCDF_file", engine='h5netcdf')

This will store the values in the variable value_array as a numpy array. We need to figure out why this code value_array[variable_short_name].sel(lat=some_lat, lon=some_lon).values (where "variable_short_name" is the name of the variable stored in the NetCDF file like "daily_rain") doesn't work for files higher than 2017. It is likely that the values for that lat and lon combinations are stored inside another layer in that array.

@darkquasar darkquasar added the bug Something isn't working label Feb 23, 2020
@darkquasar darkquasar added this to the February 2020 Code Review milestone Feb 23, 2020
@darkquasar darkquasar self-assigned this Feb 23, 2020
@JJguri
Copy link
Owner

JJguri commented Mar 31, 2020

I was looking through several NetCDF files and climate variables from 2015 to 2018 and, unfortunately, I did find any differences in terms of data structure. See an example for max temp at https://github.com/JJguri/bestiapop/blob/master/sample_data/netcdf_exploration.ipynb

@darkquasar
Copy link
Collaborator Author

Can you try to get to a singular data point in both of those? use a combination of lat-lon-time to get there, if possible, display the data values for a whole week or something like that, need to see that they are both reachable via the same code.

@JJguri
Copy link
Owner

JJguri commented Apr 15, 2020

Here is the difference between NetCDF files. I got information about the max temperature values across all latitudes and longitudes for two years, 2015 and 2018. For 2015, the elements in the array are nan values while in 2018 it has 572721 max_temp values (572721max_temp/681 lat/841 lon=1)

Untitled

@darkquasar
Copy link
Collaborator Author

ok this is good, we then need to extract the value_array2018 to get to the actual array of values

@JJguri
Copy link
Owner

JJguri commented Apr 26, 2020

I was playing a little bit and, finally, I found both 2015 and 2018 are equal. The only difference I found was the position of the Attributes in any array you got. I am not sure it could be affecting the download, otherwise, the error is not in the structure of the file. Please see the netcdf_exploration file for details.

@JJguri
Copy link
Owner

JJguri commented Apr 29, 2020

SILO used two different formats of FillValues because the NetCDF files were constructed using different software tools.

Data up to 2016 is in 64bit format, with fill values of -32768.
Data from 2017 (or 2018) onwards is in 32bit format, with fill values of -32767.

bestiapop should be able to read and skip all of them.

@JJguri JJguri added the help wanted Extra attention is needed label Apr 29, 2020
@darkquasar
Copy link
Collaborator Author

Cannot find the "-32768" fillvalues (assuming that by "fillvalues" you mean non-existent values). I've tested with multiple variables for year 2018 and they are all still "NaN".

image

I can't still determine why the data exploration slows down for 2018 data. Perhaps the error is not where we thought it was?

@JJguri
Copy link
Owner

JJguri commented May 2, 2020

I think it is related also with the dtype as was discussed in the following issue pydata/xarray#2304

@JJguri
Copy link
Owner

JJguri commented Jun 27, 2020

Could this bug generates an issue when we want to publish the package? Which 'easy' options we have to avoid this?

@darkquasar darkquasar linked a pull request Aug 11, 2020 that will close this issue
Merged
@darkquasar
Copy link
Collaborator Author

Closing this issue as the problem was not with BestiaPop but rather with how SILO compiled its NetCDF4 files. As explained by the SILO team, as of Jun 2020, SILO has refactored all its NetCDF4 files to perform better when extracting spatial data points rather than time-based data points.

This effectively means that it is slower to extract data for all days of the year from NetCDF4 files, for a single combination of lat/lon, than it is to extract data for all combinations of lat/lon for a single day. Since SILO NetCDF4 files are split into year-variable units you will always have to extract data from different files when using multiple years.

These factors create a double bottleneck when directly loading NetCDF4 files from AWS S3 buckets:

  1. To generate a MET, BestiaPop must read from multiple different files, as many as the years required to be present in the final MET and as many as the variables involved
  2. Given the way that chunking was configured for SILO NetCDF4 files, it was very slow for otherwise fast Python packages to read the data directly from the cloud and parse it accordingly.

This issue has been circumvented now by leveraging SILO's cloud API which most likely is tied up to a fast backend database and/or reads directly from their local NetCDF4 files but leveraging the power of cloud computing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants