-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add units and description to output netcdf files #232
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting on this @micahkim23. I have a few initial inline style comments as well as one general question to discuss with @spencerahill.
I think it might make things easier to test if you made these methods independent of the Calc
class (i.e. methods that didn't take self
as the first argument), passing units
, description
, and dtype_out_vert
as additional arguments:
def add_units_and_description_attrs(data, units, description, dtype_out_vert):
...
@spencerahill perhaps we could consider moving these methods to the utils/io.py
module? Or maybe create a new utils/metadata.py
module? They seem general and simple enough that they don't need to be part of Calc
.
aospy/calc.py
Outdated
@@ -767,3 +767,19 @@ def load(self, dtype_out_time, dtype_out_vert=False, region=False, | |||
if plot_units: | |||
data = self.var.to_plot_units(data, dtype_vert=dtype_out_vert) | |||
return data | |||
|
|||
def addAttrs(self, data): | |||
if (isinstance(data, xr.DataArray)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the outer set of parentheses (e.g. only isinstance(data, xr.DataArray)
is needed).
aospy/calc.py
Outdated
for name, da in data.data_vars.items(): | ||
self.addAttrsDataArray(da) | ||
|
||
def addAttrsDataArray(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above comment regarding function naming, I recommend calling this one add_units_and_description_attrs_da
.
aospy/calc.py
Outdated
else: | ||
data.attrs['units'] = self.var.units | ||
if self.var.description != '': | ||
data.attrs['description'] = self.var.description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: make sure to add a newline at the end of the file (that's what the red "no" symbol at the end of this line on GitHub signifies).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually inclined to go ahead and write the description to the file even if it's blank. That makes the logic simpler and also makes the netCDF files outputted by aospy more homogeneous. I.e. once this is merged, users can always rely on there being a 'description' attr.
aospy/calc.py
Outdated
def addAttrsDataArray(self, data): | ||
if self.var.units != '': | ||
if self.var.dtype_out_vert == 'vert_int': | ||
data.attrs['units'] = '(vertical integral of {0}): {0} kg m^-2)'.format(self.var.units) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line is longer than 79 characters; one way to break it up would be to define the string first in a separate variable, then format and assign to data.attrs['units']
in a separate line:
units = '(Vertical integral of {0}): {0} kg m^-2'
data.attrs['units'] = units.format(self.var.units)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment below re: always adding a description even if empty, I think we should do the same thing for units.
One question, though, is what units to use for vertical integrals when the original units are empty. As written, we don't include them. My initial thought is to, in that case, do e.g. (Vertical integral of quantity with unspecified units)
aospy/calc.py
Outdated
@@ -767,3 +767,19 @@ def load(self, dtype_out_time, dtype_out_vert=False, region=False, | |||
if plot_units: | |||
data = self.var.to_plot_units(data, dtype_vert=dtype_out_vert) | |||
return data | |||
|
|||
def addAttrs(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that PEP8 suggests using function names that are all lowercase, with words separated by underscores. For this function I would recommend using something more descriptive like add_units_and_description_attrs
.
In terms of testing, should I test the individual functions themselves and also test that the output netcdf files are getting the |
@micahkim23 exactly, I would do both. If you go the route of writing the functions to be independent of By existing tests of aospy/aospy/test/test_calc_basic.py Lines 32 to 39 in 1a39157
After the assert isfile statements I would open up the calc.path_out['av'] file and check the attributes were added to it.
It might also be good to do the same thing for a regional calculation like the following one (since the logic for saving those is slightly different): aospy/aospy/test/test_calc_basic.py Lines 86 to 94 in 1a39157
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few more things in addition to @spencerkclark's, all of which I agree with.
Also, I'd like to highlight this functionality in our docs beyond a What's New entry. Specifically, let's modify our Examples page to demonstrate it, e.g. in this section.
aospy/calc.py
Outdated
@@ -601,8 +602,7 @@ def _save_files(self, data, dtype_out_time): | |||
reg_data = xr.open_dataset(path) | |||
except (EOFError, RuntimeError, IOError): | |||
reg_data = xr.Dataset() | |||
# Add the new data to the dictionary or Dataset. | |||
# Same method works for both. | |||
# Add the new data to the Dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this logic is now self-explanatory, so I'd just omit the comment altogether.
aospy/calc.py
Outdated
else: | ||
data.attrs['units'] = self.var.units | ||
if self.var.description != '': | ||
data.attrs['description'] = self.var.description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually inclined to go ahead and write the description to the file even if it's blank. That makes the logic simpler and also makes the netCDF files outputted by aospy more homogeneous. I.e. once this is merged, users can always rely on there being a 'description' attr.
aospy/calc.py
Outdated
def addAttrsDataArray(self, data): | ||
if self.var.units != '': | ||
if self.var.dtype_out_vert == 'vert_int': | ||
data.attrs['units'] = '(vertical integral of {0}): {0} kg m^-2)'.format(self.var.units) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment below re: always adding a description even if empty, I think we should do the same thing for units.
One question, though, is what units to use for vertical integrals when the original units are empty. As written, we don't include them. My initial thought is to, in that case, do e.g. (Vertical integral of quantity with unspecified units)
I agree that they shouldn't be |
I still have to finish testing the functions and editing the Examples page. |
My thinking was just that
@micahkim23 this still means that these methods should be defined outside of the |
aospy/test/test_calc_basic.py
Outdated
@@ -37,6 +39,10 @@ def test_annual_mean(self): | |||
calc.compute() | |||
assert isfile(calc.path_out['av']) | |||
assert isfile(calc.path_tar_out) | |||
data = xr.open_mfdataset(calc.path_out['av'], decode_times=False) | |||
for name, da in data.data_vars.items(): | |||
assert 'units' in da.attrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of testing just that the attribute was added, I recommend testing that the value of the attribute is what you expect. This will test both that the attribute is in the dictionary, and make sure the proper value was encoded simultaneously.
I think you can pick off the expected values of the units and description attributes from the Var
object associated with the test:
expected_units = calc.var.units
expected_description = calc.var.description
aospy/test/test_calc_basic.py
Outdated
@@ -37,6 +39,10 @@ def test_annual_mean(self): | |||
calc.compute() | |||
assert isfile(calc.path_out['av']) | |||
assert isfile(calc.path_tar_out) | |||
data = xr.open_mfdataset(calc.path_out['av'], decode_times=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think xr.open_dataset
should be all that is needed here. I suspect decode_times=False
is also not needed as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@micahkim23 we are almost there! This should be my last round of review.
A couple cleanup things, and also your new tests sparked me to come up with a way to clean-up all of the tests in TestCalcBasic
.
aospy/calc.py
Outdated
@@ -767,3 +768,23 @@ def load(self, dtype_out_time, dtype_out_vert=False, region=False, | |||
if plot_units: | |||
data = self.var.to_plot_units(data, dtype_vert=dtype_out_vert) | |||
return data | |||
|
|||
def _add_units_and_description_attrs(data, units, description, dtype_out_vert): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I see this, the function name seems clunky. How about _add_metadata_as_attrs` instead? That is less of a mouthful and also leaves the door open to us adding additional attrs within the function in the future, without renaming.
aospy/calc.py
Outdated
dtype_out_vert) | ||
|
||
def _add_units_and_description_attrs_da(data, units, description, | ||
dtype_out_vert): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these helper functions should have a one-line docstring. (Since they're private API and pretty self explanatory, a full one listing parameters, return value, etc., isn't necessary.)
aospy/calc.py
Outdated
|
||
def _add_units_and_description_attrs_da(data, units, description, | ||
dtype_out_vert): | ||
units = units |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line doesn't do anything; it's assigning something to itself. So it can be removed.
aospy/test/test_calc_basic.py
Outdated
@@ -143,5 +146,55 @@ def setUp(self): | |||
'dtype_out_vert': 'vert_int' | |||
} | |||
|
|||
class TestCalcAttrs(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I didn't mention this before, but we are transitioning away from the class-based, unittest.Testcase
style of tests in favor of py.test
's more function-based tests. The latter requires less boilerplate and provides additional functionality.
So all you have to do here is delete this TestCalcAttrs
class, and turn it's test_units_description_attrs
method into a module-level function.
When you do that, I think it will flow better if you place it after your helper functions, rather than before.
aospy/test/test_calc_basic.py
Outdated
_add_units_and_description_attrs(ds, units, description, dtype_out_vert) | ||
assert expected_units == da.attrs['units'] | ||
assert expected_description == da.attrs['description'] | ||
for name, d_arr in ds.data_vars.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention is to use the name arr
for xr.DataArrays.
aospy/test/test_calc_basic.py
Outdated
assert expected_units == d_arr.attrs['units'] | ||
assert expected_description == d_arr.attrs['description'] | ||
|
||
def _test_output_attrs(calc, file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above comment, I would move this to above the TestCalcBasic
definition, since it's used within that class.
aospy/test/test_calc_basic.py
Outdated
@@ -37,6 +39,7 @@ def test_annual_mean(self): | |||
calc.compute() | |||
assert isfile(calc.path_out['av']) | |||
assert isfile(calc.path_tar_out) | |||
_test_output_attrs(calc, 'av') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to add this test to all of the cases. In fact, we should bundle this along with the two assert
statements that are in all of these. E.g. define at the module level
def _test_files_and_attrs(calc, dtype_out):
assert isfile(calc.path_out[dtype_out])
assert isfile(calc.path_tar_out)
_test_output_attrs(calc, 'av')
And then replace the two assert statements (and `_test_output_attrs` where you added it) in each of these tests with this single function call.
Woops, two more things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @micahkim23 -- just a couple more things from me as well!
aospy/test/test_calc_basic.py
Outdated
@@ -143,5 +146,55 @@ def setUp(self): | |||
'dtype_out_vert': 'vert_int' | |||
} | |||
|
|||
class TestCalcAttrs(unittest.TestCase): | |||
def test_units_description_attrs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building off of @spencerahill's comment regarding using pytest
, this is a classic use case for pytest.mark.parametrize
, so let's make sure to use that too. See here for an example use in our code-base:
aospy/aospy/test/test_automate.py
Lines 144 to 150 in 2ef0b9a
@pytest.mark.parametrize( | |
('type_', 'expected'), | |
[(Var, [condensation_rain, convection_rain, precip, ps, sphum]), | |
(Proj, [example_proj])]) | |
def test_get_all_objs_of_type(obj_lib, type_, expected): | |
actual = _get_all_objs_of_type(type_, obj_lib) | |
assert set(expected) == set(actual) |
The benefit of using pytest.mark.parametrize
is that it splits up this big group of tests (which if one fails the entire test function fails, and the remaining cases don't get run) into individual tests, each of which gets run separately. This gives us a more complete picture for debugging if something goes wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's cool. I'll use that.
aospy/calc.py
Outdated
@@ -587,6 +587,9 @@ def compute(self, write_to_tar=True): | |||
reduced = self._apply_all_time_reductions(full, monthly, eddy) | |||
logging.info("Writing desired gridded outputs to disk.") | |||
for dtype_time, data in reduced.items(): | |||
_add_units_and_description_attrs(data, self.var.units, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's just me, but for some reason the fact that this function doesn't return anything (and just mutates the input xarray object) feels odd. It would feel more readable to me if this line were something along the lines of:
data = _add_units_and_description_attrs(data, self.var.units, self.var.description, self.dtype_out_vert)
So I might consider re-writing the functions such that this would be possible (i.e. add return
statements where appropriate).
weird, for the travis build, the error is:
It has problems with |
Indeed this is a puzzling issue. Things seem to work fine in the Python 3.4 environment, which uses an earlier version of @micahkim23 which version of
|
The Windows test failures for Python 3.4 on AppVeyor seem different:
To address those I might try closing the Dataset opened in data.close() |
Good call. Even better and more idiomatic would be to use a context manager, i.e. a with xr.open_dataset(calc.path_out[dtype_out]) as data:
... The file is then automatically closed once the context manager block is finished, even in the case of crashes. |
@micahkim23 FYI I swear there aren't usually this many annoying problems with the CI! Usually they just work. |
Micah and I confirmed offline that it is indeed the So a viable solution that I can live with is to pin it in all of the environments. @spencerkclark does that sound alright to you? |
We also confirmed that the |
Hmm...I tried reproducing the issue locally, but I wasn't able to. What platform are you trying this on? I've tried on both my Mac and GFDL's linux systems. Locally, is there any kind of debugging you can do to determine the root of the issue? In looking at prior issues citing this error message (e.g. Unidata/netcdf4-python#143) it seems a common thread is how the filepath is encoded. There was a recent PR in netcdf4-python that touched this part of their code base Unidata/netcdf4-python#693 (and as far as I can tell, version 1.3.1 is the only version on conda-forge that contains this change). Maybe what we're seeing here is related to those changes? (I guess I'm a little hesitant to pin |
This was on my Mac. Note that the errors are occurring only on the new tests by @micahkim23. See this branch: https://github.com/spencerahill/aospy/tree/micahkim23-attrs, and in particular its Travis log. We pinned netcdf to 1.2.7 only in the 3.6 environment, and the tests pass there but not in 3.5 where it's still 1.3.1.
Good point. Also, notice the
Is that related?
This is a totally valid point. Worth digging into a little more in order to avoid pinning if possible. |
FYI re: the |
Just to confirm, so you got the tests to fail on your Mac using |
Huh, that's odd. Yes, 1.3.1:
|
@spencerahill many thanks for that detail! Based on my experience, I think now it's actually libnetcdf that's the problem. I was still on It occurred to me that in lieu of finding the root of the issue, we could (just for these tests) open the files using |
In doing a little more digging, I don't think it has anything to do with the filenames. I have issues opening the files the test suite creates using
With
|
Given their dependency setup, it appears that xarray's CI has not been hit by
|
@spencerkclark nice work! Yes, I'd say you've pinpointed libnetcdf as the problem. I will pass word to xarray and netCDF4. In terms of the CI, do you think we should pin the libnetcdf version, or live with the test failures for now, or ...? |
Thanks for doing that; we'll see what they suspect the problem might be.
Hopefully the issues you've posted will spark discussion upstream, so I'd be fine pinning |
Sounds good to me. So @micahkim23, please walk back our pinning of netCDF4 and instead pin libnetcdf, in all our test environments. |
@spencerahill @micahkim23 based on discussion upstream, I think a (perhaps better?) alternative fix would be to save all our files using the 'netcdf4' engine going forward (rather than the 'scipy' engine). That seems to be where the issues stem from. Due to #69 (i.e. the So the suggestion is to change: Line 612 in 2ef0b9a
To: data_out.to_netcdf(path, engine='netcdf4', format='NETCDF3_64BIT') Maybe we should have been doing that all along... |
@spencerkclark thanks for engaging with the netcdf folks on this. I agree with your suggestion. Much better than pinning the libnetcdf version for our tests. Once we do this, that means scipy is no longer a dependency. So when we implement this, we should also remove scipy from all our environments and I'll go ahead and submit a PR on this shortly |
@micahkim23 this means that you can now disregard my previous about pinning libnetcdf (or netcdf...no pinning required) |
in |
Woops! Nice catch. One final request on this: can you add a sentence immediately after that @spencerkclark, if you wouldn't mind doing one last review also, then I'll merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@micahkim23 one final minor thing and then I think we're good to merge! Thanks for your patience as we sorted out what was going on with the CI.
docs/examples.rst
Outdated
Notice that the variable's name and description have been copied | ||
to the resulting Dataset (and hence also to the netCDF file saved | ||
to disk). This enables you to better understand what the physical | ||
quantity is, even if you don't have the original Var definition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make Var
a literal here (i.e. surround it by double back ticks).
docs/examples.rst
Outdated
@@ -404,7 +404,7 @@ and the results of each output type | |||
Notice that the variable's name and description have been copied | |||
to the resulting Dataset (and hence also to the netCDF file saved | |||
to disk). This enables you to better understand what the physical | |||
quantity is, even if you don't have the original Var definition | |||
quantity is, even if you don't have the original `Var` definition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double back ticks :)
``Var``
@spencerahill feel free to merge when you're ready! |
In it goes! Thanks much @micahkim23 |
Closes #201