Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atlite ESGF interface for downloading and preparing CMIP6 data #180

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

Ovewh
Copy link
Contributor

@Ovewh Ovewh commented Jun 29, 2021

Change proposed in this Pull Request

Add a interface in atlite to the ESGF CMIP database for downloading and preparing CORDEX and CMIP6 data.

Description

An interface in atlite for working with Climate model output have been developed. There is an example on how to use this interface available in examples/cmip_interface_example.ipynb. The search parameters for the ESGF database can be specified as either as dictionary when setting up the cutout or in the atlite/datasets/cmip.ymal file. Determining the search parameters have to be done manually by searching the ESGF database.

Variables required by atlite are:

  • rsds, surface downwelling radiation shortwave
  • rsus, surface upwelling radiation shortwave
  • sfcWind, wind speed 10m
  • mrro ,runoff
  • tas, surface temperature

This similar to the variables required by the old CORDEX interface, however surface roughness doesn't seem to be available from CMIP.
Based on the provided search parameters, it uses the pyesf-search python api to find matching results, if there are more than one result it will take the most resent result. Then the OPeNDAP urls for that result are obtained and which can be loaded lazily using xarray. This means that the data can be subset according to the cutout and the download and computation will be triggered by cutout.prepare(). Be aware that some models and dataservers doesn't provide OPeNDAP urls, which means that you might have to try different ensamble to find a model that has. The current example uses data from the EC-Earth3 model. There is also a possibility to download netCDF files with 1 year of data individually, however this haven't been implemented in atlite. That might be more robust, but atleast so far the OPeNDAP interface in xarray have been working flawlessly.

Caveats:

The highest temporal resolution that are available CMIP is 3hr, however some models only have some of variables required by atlite at 3hr resolution while others are at 6hr resolution. CMIP6 also have quite coarse resolution ~ 100km, CORDEX has higher resolution. The surface roughness is not available in CMIP, currently averaged roughness is taken from ERA5.

Motivation and Context

Explore influence of future climate change on energy systems. Related issue #59

How Has This Been Tested?

The functionally have been tested for calculating wind and pv capacities. Tested with python > 3.9.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I tested my contribution locally and it seems to work fine.
  • I locally ran pytest inside the repository and no unexpected problems came up.
  • I have adjusted the docstrings in the code appropriately.
  • I have documented the effects of my code changes in the documentation doc/.
  • I have added newly introduced dependencies to environment.yaml file.
  • I have added a note to release notes doc/release_notes.rst.
  • I have used pre-commit run --all to lint/format/check my contribution

@Ovewh Ovewh marked this pull request as ready for review June 29, 2021 12:29
Copy link
Contributor

@FabianHofmann FabianHofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Ovewh this is great! I just had a brief look through without testing the code. You should also write a guide on how to introduce new datasets ;)

Here are only some general comments. I have to go through it all when I have time (which still might happen this week). Everything looks good. The only thing I do not like too much is having a yaml for the model config. In contrast to the windturbine/solar config, these are parameters which should be easily adjustable and not that hidden. I would rather prefer a slim version where only esgf_params are supported (no model argument and no yaml file) and in the documentation (or example) we provide some reasonable sets of params. Sounds sensible?

Comment on lines +33 to +38
features = {
"wind": ["wnd10m"],
"influx": ["influx", "outflux"],
"temperature": ["temperature"],
"runoff": ["runoff"],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these features are all the features provided by cmip? Just wondering because they look very similar to era5 (Idea is that a feature is a umbrella term for the xarray variable which will be retrieved)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's all the features. Though I'm not sure I get your question, isn't the point that the features should have the same name across the different datasets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, totally correct. I just wanted to be sure this isn't a copy/paste artefact

Comment on lines +56 to +57
if v in datavars:
ds[v].attrs["feature"] = [k for k, l in fd if v in l].pop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give me a hint why this is necessary? It is hard to comprehend without the context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CMIP data have longitude bins, latitude bins and time bins included as variables and these aren't defined as variables in the features dict. If I don't include if statement it will try to pop an empty list. Suggestion on a better way to do this?

Data variables:
    time_bnds    (time, bnds) datetime64[ns] 2021-01-01 ... 2021-02-01
    lat_bnds     (y, bnds) float64 31.65 32.35 32.65 33.35 ... 82.35 82.65 83.35
    lon_bnds     (x, bnds) float64 346.6 347.4 347.6 348.4 ... 44.35 44.65 45.35
    wnd10m       (time, y, x) float32 dask.array<chunksize=(100, 52, 59), meta=np.ndarray>
    influx       (time, y, x) float32 dask.array<chunksize=(100, 52, 59), meta=np.ndarray>
    outflux      (time, y, x) float32 dask.array<chunksize=(100, 52, 59), meta=np.ndarray>
    temperature  (time, y, x) float32 dask.array<chunksize=(100, 52, 59), meta=np.ndarray>
    runoff       (time, y, x) float32 dask.array<chunksize=(100, 52, 59), meta=np.ndarray>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather question whether we want to carry these variables along? If not needed, I'd argue to delete them before (in get_data) since they not requested from the feature list anyway

Comment on lines 162 to 163
CMIP sometimes specify the time in the center of the output intervall this shifted
to the beginning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather shift it to the end of the period, see https://atlite.readthedocs.io/en/master/conventions.html#time-points

atlite/datasets/cmip.py Show resolved Hide resolved
cutout : atlite.Cutout
feature : str
Name of the feature data to retrieve. Must be in
`atlite.datasets.era5.features`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be cmip features

@@ -126,6 +126,14 @@ def __init__(self, path, **cutoutparams):
gebco_path: str
Path to find the gebco netcdf file. Only necessary when including
the gebco module.
esgf_params: dict
Parameters to be used in search on the ESGF database.
model: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep that we should rename this keyword argument into something more specific, like esgf_model

@FabianHofmann
Copy link
Contributor

Reuse compliance requires a comment with the license at the beginning of each new file (can just be copied from the other .py/.yaml files)
Could you also merge the up-to-date master so that tests run through?

@Ovewh
Copy link
Contributor Author

Ovewh commented Jun 29, 2021

@FabianHofmann Something I would like your thoughts on. As I mentioned the surface roughness isn't available from CMIP, I have yet to address this issue in my code. My idea is to just have a keyword argument path with some external roughness dataset, however I could also make atlite prepare a static roughness dataset from ERA5? Atleast requiring the path to roughness dataset to be provided during the creation of the cutout would avoid any confusion of the data source.

@FabianHofmann
Copy link
Contributor

@Ovewh for retrieving features from other sources one has to add the module to the cutout.
But: The surface roughness is only used for extrapolating the wind speed to the turbine hub height. But the ESGF can retrieve wind speed at arbitrary heights right?

@Ovewh
Copy link
Contributor Author

Ovewh commented Jul 5, 2021

@Ovewh for retrieving features from other sources one has to add the module to the cutout.
But: The surface roughness is only used for extrapolating the wind speed to the turbine hub height. But the ESGF can retrieve wind speed at arbitrary heights right?

No, only a few models provide wind speed at 100m, most models provide only the surface wind speed. So the windspeed has to be extrapolated.

Ovewh and others added 5 commits July 7, 2021 12:31
Some models have the output stored in files containing 10-15 years of data rather than one year. In addition not the time index isn't either always the same this has been fixed by converting dataset following cftime to numpy datetime64
@FabianHofmann
Copy link
Contributor

FabianHofmann commented Jul 8, 2021

No, only a few models provide wind speed at 100m, most models provide only the surface wind speed. So the windspeed has to be extrapolated.

Okay then the roughness data has to come from the era5 dataset. atlite allows to mix datasources. So the best way would be to retrieve all variable from ESGF and fill up with era5 data which is principally done with

cutout = atlite.Cutout('my_cutout', module=['esgf', 'era5'], time=....)

Then it will retrieve all availabe features from esgf and the fill up missing variables (in that case the roughness data) from era5. Could you try that out?

@Ovewh
Copy link
Contributor Author

Ovewh commented Jul 8, 2021

No, only a few models provide wind speed at 100m, most models provide only the surface wind speed. So the windspeed has to be extrapolated.

Okay then the roughness data has to come from the era5 dataset. atlite allows to mix datasources. So the best way would be to retrieve all variable from ESGF and fill up with era5 data which is principally done with

cutout = atlite.Cutout('my_cutout', module=['esgf', 'era5'], time=....)

Then it will retrieve all availabe features from esgf and the fill up missing variables (in that case the roughness data) from era5. Could you try that out?

@FabianHofmann Yes ,so the issue is that CMIP contains future climate projections, and ERA5 is a reanalysis. It only makes sense to take the averaged roughness from ERA5, either based on one year or a single month. I did some sensitivity tests calculating capacity factors using constant and forecasted roughness for ERA5. There where only a slight difference in the offshore capacities.

@FabianHofmann
Copy link
Contributor

Let's also have a look at https://py-cordex.readthedocs.io/en/stable/index.html

@Ovewh
Copy link
Contributor Author

Ovewh commented Dec 10, 2021

@FabianHofmann It doesn't look like py-cordex have an interface for downloading data, but I did not work with the CORDEX data.

The first attempt on creating a CMIP interface I made turned out to be bit of a dead end. Integrating downloading of the CMIP data directly in atlite did not work out that well, since the CMIP datafiles are formated slightly different from model to model (e.g. some models provide yearly files or 10 years in one file, and then the models also use different calendars). It is probably simpler and more robust to make a very general interface for sideloading locally stored climate and weather data into atlite. Then it would be up to the user to preprocess the data into a format that atlite can understand.

@euronion
Copy link
Collaborator

It is probably simpler and more robust to make a very general interface for sideloading locally stored climate and weather data into atlite. Then it would be up to the user to preprocess the data into a format that atlite can understand.

Interesting! This would be similar to what we have for the SARAH2 dataset (cutout(...) get's called with an additional argument sarah_dir pointing to a local directory containing the manually downlaoded SARAH2 data due to a lack of API.

(Just a comment)

Outsider question (I'm not familiar with CMIP/COREDEX datasets): Is there like a central repository from which one can manually downloaded the data?

@Ovewh
Copy link
Contributor Author

Ovewh commented Dec 12, 2021

Interesting! This would be similar to what we have for the SARAH2 dataset (cutout(...) get's called with an additional argument sarah_dir pointing to a local directory containing the manually downlaoded SARAH2 data due to a lack of API.

Yes, that's my idea, though perhaps even more general, instead of path, it would be a xarray.Dataset.

Outsider question (I'm not familiar with CMIP/COREDEX datasets): Is there like a central repository from which one can manually downloaded the data?

Yes, all the CMIP6/CORDEX data is stored at ESGF data nodes. It provides different ways of downloading the data e..g. OPeNDAP and wget scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants