-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use coordinates directly from xarray.DataArrays, not from parent Model #14
Comments
Now that we have migrated aospy to being largely xarray-based, this has been mostly resolved, in that the functions are now passed Some functions and variables remain, however, that aren't defined in their own netCDF files but reasonably could be needed by themselves. Currently these are still pulled from the model. This aspect still has room for improvement/refactoring. |
One potential solution, although it could be heavy-handed, is to copy all of the grid fields into each DataArray when they are loaded from disk. So, for example, a t_surf DataArray, which is defined only in lat, lon, and time, would also get However, at least the last two aren't coordinates but data arrays themselves, meaning that we would need to pass in Datasets, not DataArrays, with them included among the variables. So still need to think this one through. |
This might actually work better than you think; multi-dimensional coordinates are supported within xray (so Multi-dimensional coordinates are supported within xray; the catch is that to include them in a DataArray as coordinates, they must be related to the existing dimensions (that is the dimensions that the variable described by the DataArray lies upon). What this means is that the likes of This is what it looks like raw upon being read in as a Dataset:
I can now set
If I look to pick off the
If I pick off
I think the question we may need to ask is: are there frequently scenarios when a 2D variable needs access to the level dimensions (when NOT already within the context of a 3D variable which already does)? The only case I can think of is in computing pressure on sigma levels -- if this is the only case perhaps we should just continue to handle pressure as a special case. On the flip-side, is there ever a case where a variable defined only in the vertical needs access to 2D grid information? The only variables that I can think of that are only defined in the vertical are For that reason I see no harm in adding all of these "grid" variables (including surface area as a function of latitude and longitude) as coordinates to the Datasets1 we read in, and then simply letting xray determine which ones to pass on as coordinates when we isolate them as DataArrays in passing them to our calculation functions. Under this system, any variable that is defined in lat-lon would automatically get the spatial grid attributes ( You could accomplish this within def _create_input_data_obj(self, var, start_date=False,
end_date=False, n=0, set_dt=False):
"""Create xray.DataArray for the Var from files on disk."""
paths = self._get_input_data_paths(var, start_date, end_date, n)
ds_chunks = []
for file_ in paths:
test = xray.open_dataset(file_, decode_cf=False,
drop_variables=['time_bounds', 'nv',
'average_T1',
'average_T2'])
if start_date.year < 1678:
for v in ['time']:
test[v].attrs['units'] = ('days since 1900-01-01 '
'00:00:00')
test['time'].attrs['calendar'] = 'noleap'
test = xray.decode_cf(test)
ds_chunks.append(test)
ds = xray.concat(ds_chunks, dim='time')
##### Add multi-dimensional coordinate information to Dataset here. #####
for name in var.names:
try:
##### This stays the same; let xray self-select relevant multi-dimensional grid attributes. #####
arr = ds[name]
except KeyError:
pass
else:
break
else:
raise KeyError('Variable not found: {}'.format(var)) 1Note that you can add all the coordinates you want to a Dataset -- as opposed to a DataArray, for a Dataset there is no stipulation that they must be related to a variable (take |
This looks really solid overall. Regarding the special case of pressure, what about (in the case of data on model coordinates), pre-computing the full 4D pressure array, and then tacking it on as a coordinate defined in (time, pfull, lat, lon)? In that case it would persist into the DataArrays, right? (Could we even potentially re-assign the variable's coordinates to include it? This sounds more tricky. But at least for the functions I use (in particular finite differencing ones), there would be upside.) At the same time, this basically doubles the amount of data in memory, since the pressure array is the same shape as the variable array. This is non-negligible for high frequency output. |
Including pressure as a coordinate (regardless of whether sigma or interpolated) would help simplify things a bunch. It's a bit confusing to me how currently In contrast to horizontal coordinates where we have a choice, we must allow different vertical coordinates from Run to Run from the same Model. This is especially true for As such, to remove the possibility of using the wrong pressure levels, I think in handling pressure we should allow The memory issue is an important consideration; however, if we eventually move to passing Datasets as arguments to functions rather than DataArrays then we could pass just one copy of pressure at a time. That said, if a calculation doesn't use pressure (like if it's just averaging a certain quantity that is defined in 4D), then it would be a bit of a waste. |
One other thing to note: beyond memory at runtime, without some logic to strip off secondary grid-attributes before saving output, these extra coordinates will be written out to .nc files. In the case of the 1D and 2D attributes this isn't a big deal, but like you mention, if we add sigma pressure to the mix, then it becomes non-negligiable. I don't think it would be too hard to add such logic, but I just want to bring it up here so we don't forget. |
I agree, the 'p' and 'dp' strings are unnecessary. We can just flag them as special using We should allow for different horizontal grids within a single model: in the GFDL ana4mips repo, at least one of the products (MERRA, I believe), outputs vertically defined data on a coarser lat-lon grid and non-vertically defined data on a more refined lat-lon grid. But these still in my mind should fall under the umbrella of a single Model and Run. So, in my mind, But even for those two, at least for the full GCMs (let me know if the idealized models differ), each run has an At the same time, we should retain the "chain-of-responsibility" framework that already exists for some attributes (via the Any function (at least of mine) that requires knowledge of the pressure values explicitly passes it in as an argument, either Good point re: writing extra coordinates to disk. For now let's not worry about it, since the current user base (=you and me) has little storage limitation, and unless lat-lon |
I understand completely. If we go this route (which I'm fully on board with) what is the purpose of a Model object? Might we be able to phase it out? Once you allow grid attributes to be defined at the run level, you could accomplish this "grid inheritance" feature of the Model object by just having an optional argument within a Run constructor along the lines of "use grid from 'pre-existing Run name'" (if you didn't have a Run to inherit from, you'd provide the paths to all the grid files, like you do currently in the Model object). You could have a similar argument for inheriting the What else does the Model object do besides aggregate grid information (and group Runs to some extent)? |
I would like to retain the Model class. Even if most of the grid specifications can be made at the Run level, the Model level is still useful.
I like a lot the idea of inheriting grid information from another object. Maybe it doesn't even require a separate argument; if the method gets passed a Proj/Model/Run for that argument, then use that object's value for that argument. |
That makes sense -- thanks for bringing those aspects up! That's an interesting thought about passing objects directly as arguments. Ultimately (in continuing with what's already done aospy) I think we should strive to code things in a way such that the user has to input as little repeat information as possible (since often in experiments the grids and time periods are the same across Runs); if you think that's the most intuitive way to do that for non-default Model grids and time periods, I'm all for it. |
Currently, each
Model
object is instantiated with paths to various netCDF files which are used to supply basic information about its grid: latitude, longitude, etc. Some physical calculations require this grid information in addition to physical variables themselves. However, the values of these grid properties are not necessarily unique to aModel
: simulations run in the same model may be output to different grids (horizontal or vertical), or changes in software may make, e.g. the latitude values off on the 5th digit, causing xray to throw an error (the last example comes from @spencerkclark ).The netCDF files from which data is loaded all contain the coordinate data locally already, and so do the xray.Dataset and/or DataArray objects derived from those files. So there is no reason to always pull from the Model's grid attributes -- they should be pulled from the local variable itself.
As a starting point,
Calc
should just grab the needed coordinate/grid values from the 1st variable that it loads in that has those values. Pressure values and vertical pressure thicknesses will likely need to be handled as their own unique case, which is what's done already anyways.The text was updated successfully, but these errors were encountered: