Use Case Notebook for "Atmospheric Moisture Budgets" #1

rabernat · 2017-08-30T13:55:04Z

We need a preliminary notebook that uses Xarray to calculate something related to this use case. It doesn't have to be super long or complicated. It just has to express a real-world, scientifically relevant calculation on a large dataset.

Requirements:

Carefully explain the science behind the calculation in words and equations.
Clearly document the input dataset (preferably a standard product like ERA-interim or NCAR large ensemble)
Use Xarray to load the dataset and, as much as possible, to make the calculations
Where Xarray can't accomplish the necessary tasks, clearly document the other functions and / or outside packages that need to be employed
If the analysis has been implemented in other languages (e.g. MATLAB, Ingrid, etc.), provide a link to the code and, if possible, the output (which can be used as validation)

cc: @naomi-henderson

I would like to use this as a template for the other three use cases, so I would appreciate general feedback on the checklist above. Are these the right requirements?

rabernat · 2017-08-30T14:46:10Z

We should add something about setting up the environment and enumerating the dependencies.

Here is my guide to setting up a conda environment for xarray work:
https://medium.com/@rabernat/custom-conda-environments-for-data-science-on-hpc-clusters-32d58c63aa95

But @darothen's "getting started" guide for gcpy is better and more comprehensive:
http://www.danielrothenberg.com/gcpy/getting_started.html

rabernat · 2017-08-30T14:58:12Z

You also asked about the gist extension:
https://github.com/mozilla/jupyter-notebook-gist

darothen · 2017-08-30T17:23:43Z

Please feel free to cannibalize my guide. If I can help in any way, do let me know.

spencerahill · 2017-08-30T23:48:17Z

Curious, what motivated this (and "the other three") use case? Just something that satisfies your above constraints?

I ask because I have worked on the problem of closing atmospheric column tracer budgets computed post-hoc. Depending on how concerned you are with closure, this can be pretty involved. The Appendices of our recent paper touch on this: http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-16-0785.1.

But I suspect this is more detail than you are looking for? If not, I can provide the Python functions that I defined for these efforts (they relied on my infinite-diff package and were computed via an aospy pipeline).

rabernat · 2017-08-31T19:49:33Z

@spencerahill: hopefully the project description I published earlier today clarifies what I meant by "use cases": these are the official use cases from the NSF proposal. However, we are certainly in need of as many use cases as possible and welcome contributions from everyone. If you have use cases you want to contribute, I would submit a PR to the pangeo-tutorials repo (currently empty, but hopefully will fill up with cool stuff!)

rabernat · 2017-10-04T00:45:23Z

Just a heads-up that the lack of any use case notebooks is now blocking progress on our project. It would be great to get some basic use cases checked in. They don't have to be fancy! Any sort of semi-realistic workflow on real data is enough for the systems people to move forward with analyzing performance.

naomi-henderson · 2017-10-05T20:50:36Z

The calculation of the eddy moisture transports is the most intensive I/O part of the budgets, so I am developing a preliminary notebook which computes the monthly mean co-variances from 6 hourly (e.g., u'q'). I will get the notebook working on cheyenne for the CMIP5 version of CESM. Once it is working, I will submit it to the group for critique before cleaning it up. Then I will start working on the vertical integrals, etc. Sound ok?

mrocklin · 2017-10-05T20:52:34Z

@naomi-henderson if you run into performance issues (like I/O bottlenecks) and want to try scaling things let me know. I'd be happy to try to collaborate to try to accelerate things.

rabernat · 2017-10-05T21:47:33Z

@naomi-henderson your plan sounds perfect

Also feel free to ping us for help with any xarray or general python questions you may encounter while developing your notebook. I know you're still spinning up on these tools, so we are here to help with that too.

naomi-henderson · 2017-10-09T20:06:05Z

@rabernat - I have made a preliminary version of an eddy covariance notebook and it works fine, albeit slowly, on my local machine (one year has to process 24G). I am having a few issues with the time grids, groupby, etc, as you can see here: covar notebook I may have wandered off into the deep end ...

rabernat · 2017-10-10T00:53:16Z

Awesome @naomi-henderson! This looks like a great start. I'm very pleased you have gotten your calculation to a preliminary working version.

I would love to sit with you for an hour on Wednesday and go though this in detail. It should be possible to eliminate the loops over files. Dask is unable to parallelize such operations, and they can also be confusing to read. The idea is to first create one big xarray dataset with everything in it (although not in memory) and then perform the desired grouping / aggregation on it using xarray's built in capabilities. You should never have to manually accumulate an average, as you do in your notebook--although this is a very natural place to start for someone coming from matlab or another lower level language!

I'm free all day on Wednesday, so please suggest a time that works for you!

mrocklin · 2017-10-10T01:15:02Z

It might be useful to keep this notebook around to show before-and-after examples. This Rosetta-stone-style comparison might help other scientists more accustomed to C++/Fortran style development understand XArray-style development and vice versa.

mrocklin · 2017-10-10T01:15:31Z

We only have a few opportunities for this kind of comparison before everyone gets exposed to both types.

naomi-henderson · 2017-10-10T02:07:23Z

Thanks @rabernat, I was hoping you would have time to help me with this! Does 10am on Wed work for you?

rabernat · 2017-10-10T02:08:12Z

Yes, perfect.

…

On Mon, Oct 9, 2017 at 10:07 PM, Naomi Henderson ***@***.***> wrote: Thanks @rabernat <https://github.com/rabernat>, I was hoping you would have time to help me with this! Does 10am on Wed work for you? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJFJkyDCsJnkBDJZiFLMUqXn12NfqGdks5sqtFcgaJpZM4PHbB5> .

naomi-henderson · 2017-10-11T20:40:25Z

Thanks to @rabernat this morning, I have a new version of the preliminary eddy covariance notebook: New version, no loops. I love the suggestion by @mrocklin to keep my preliminary notebooks around to help others migrate to the xarray/dask world!

I hope to check it out on cheyenne (using the 'export LD_LIBRARY_PATH=' temporary fix) and
perhaps even using the dask dashboard at some point.

My next step is to make a new notebook which does the vertical interpolation of the monthly data from sigma levels to pressure levels. I know that there are python scripts to do this on the command line file by file, but we need something to work on the whole xarray Dataset. Expect many questions.

mrocklin · 2017-10-11T20:43:15Z

Cool looking notebook @naomi-henderson ! Makes me wish I understood what you all were doing :)

Do you have any thoughts about the experience? Both on what felt pleasant and on what felt frustrating at first? It would be useful to understand your thought process as you're going through this.

mrocklin · 2017-10-11T20:43:29Z

(only if you feel like sharing, didn't mean to put you on the spot)

naomi-henderson · 2017-10-11T21:10:03Z

I forgot to mention the (many) reasons why the eddy covariance notebook is still preliminary. Although it works for the NCAR/CCSM4 model, it will not work for the models with multiple ensemble members. This will require adding a new 'member' dimension.

The notebook also needs various checks for the existence of the correct variables as well as debugging for all possible calendars (leap, no leap, 360 days/year, etc).

rabernat · 2017-10-11T21:31:23Z

Makes me wish I understood what you all were doing :)

@mrocklin: the markdown cell at the end of the notebook gives a description of the calculation and data used. But it is probably still assumes too much domain knowledge and uses lots of jargon. Let us know how we can improve this to make it more understandable to a wider audience.

rabernat · 2017-10-11T21:33:05Z

I forgot to mention the (many) reasons why the eddy covariance notebook is still preliminary. Although it works for the NCAR/CCSM4 model, it will not work for the models with multiple ensemble members. This will require adding a new 'member' dimension.

The notebook also needs various checks for the existence of the correct variables as well as debugging for all possible calendars (leap, no leap, 360 days/year, etc).

@spencerahill: any thoughts on how aospy might fit into the CMIP data processing workflow that Naomi is describing above?

spencerahill · 2017-10-12T04:10:50Z

Although it works for the NCAR/CCSM4 model, it will not work for the models with multiple ensemble members. This will require adding a new 'member' dimension.

The notebook also needs various checks for the existence of the correct variables as well as debugging for all possible calendars (leap, no leap, 360 days/year, etc).

any thoughts on how aospy might fit into the CMIP data processing workflow that Naomi is describing above?

These kind of things are definitely in the spirit of aospy, but it may be overkill to implement a full aospy-based workflow in this case. Frankly, we require a fair amount of boilerplate at present, which I think cuts against the intent of the notebook. (Perhaps, @rabernat, a separate notebook specifically presenting an aospy workflow for automating calculations across CMIP runs would be welcome?)

For ensemble members, this is where xarray's data model shines: insert a new dimension into each dataset called e.g. ens_mem, which then they get combined over via xr.concat. It could be without a coordinate, in which case it simply indexes the different ensemble members. Or, you could add an ens_mem coordinate as well that stores the ensemble member's string identifier, which for CMIP data looks like 'r1i1p1','r2i1p1', etc. Also, aospy's support for ensemble members is currently broken: see spencerahill/aospy#76 😆

For the calendars, again xarray does the heavy lifting -- so long as the calendars and time data are CF-compliant (which CMIP data all should be), xarray should decode it properly without much trouble. Where aospy adds additional functionality is in dealing with dates outside the (dare I say narrow) range of permissible dates for np.datetime64 (@naomi-henderson, see e.g. here for background). This is not an issue for the example given currently in the notebook, but it is in many other cases.

See e.g. aospy.utils.times.numpy_datetime_workaround_encode_cf, which re-encodes the time array of a Dataset with out-of-bounds dates that has not been CF-decoded such that, when it is decoded, the time array starts at (nearly) the earliest permissible date. @spencerkclark has much more expertise on calendars/datetimes than me, both within aospy and without, so Spencer feel free to chime in.

naomi-henderson · 2017-10-12T20:10:09Z

Thanks @spencerahill for the hints on creating a new dimension for ensemble member.

I have now tried the eddy covariance notebook on cheyenne, using Jupyter Lab and the Dask dashboard. @mrocklin and @rabernat , I am certainly having performance issues, but it looks like it is mostly reading in the CMIP5 data from '/glade/p/CMIP/CMIP5/output1/...'

The open_mfdataset call
%time ds = xr.open_mfdataset(all_files, preprocess=set_correct_coords)
takes over 5 min on cheyenne:

CPU times: user 4.93 s, sys: 3.17 s, total: 8.1 s
Wall time: 5min 13s

Whereas the same on my linux box here at Lamont (data stored on internal 4Tb drive) takes less than 10s:

CPU times: user 5.91 s, sys: 5.51 s, total: 11.4 s
Wall time: 8.31 s

However the resampling and load() from 6-hourly to monthly
%time dsr = ds.resample(freq='M', dim='time').load()
on cheyenne takes:

CPU times: user 2min 24s, sys: 2min 11s, total: 4min 35s
Wall time: 3min 26s

compared to slightly longer time on my linux box:

CPU times: user 4min 12s, sys: 5min 44s, total: 9min 57s
Wall time: 4min 15s

This is using the sample dask.sh - two nodes, 36 cores each on cheyenne.

mrocklin · 2017-10-12T20:26:32Z

How many files are in your dataset? @rabernat is this possibly the slow file system thing going on where we might want to avoid opening every individual file to check metadata?

pwolfram · 2017-10-12T20:50:09Z

Is this related to the 'too many open files' issue we had in xarray a while back?

naomi-henderson · 2017-10-13T00:29:31Z

In this test case, I open only 12 files in order to process a single year of a 55-year dataset. Is there a way to turn off the metadata and coordinate checking entirely? Ryan and I did pre-process the files, trying to avoid of the overhead of checking metadata.

Since it is only a problem on cheyenne and not on my linux box I was assuming it was something in the file system. The directory structure and files are the same on both.

pwolfram · 2017-10-13T13:56:20Z

I'm guessing if you try %prun ds = xr.open_mfdataset(all_files, preprocess=set_correct_coords, autoclose=False) you get the same results because autoclose=False by default: https://github.com/pydata/xarray/blob/27132fba1e465955e76ed155ba6cbb769c0904df/xarray/backends/api.py#L135, so this is something else as @rabernat notes.

mrocklin · 2017-10-13T14:27:34Z

Seeing all the time being spent in _thread.lock.acquire is a signal that actually this code is running within a multi-threaded system (like local-machine dask) and that the profile isn't able to easily see within it.

The solution is to tell dask to use a single local thread and then try again

import dask
dask.set_options(get=dask.local.get_sync)

I get the following profile output

         33579 function calls (32771 primitive calls) in 18.746 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    12065   17.978    0.001   17.978    0.001 {method 'is_dir' of 'posix.DirEntry' objects}
      2/1    0.300    0.150    0.326    0.326 {built-in method _imp.create_dynamic}
        6    0.209    0.035    0.209    0.035 {built-in method _operator.getitem}
       38    0.098    0.003    0.098    0.003 {built-in method posix.lstat}
        1    0.074    0.074    0.074    0.074 netCDF4_.py:182(_open_netcdf4_group)
      835    0.029    0.000   18.012    0.022 glob.py:114(_iterdir)
        3    0.011    0.004    0.011    0.004 {method 'read' of '_io.FileIO' objects}
       54    0.007    0.000    0.007    0.000 {built-in method posix.stat}
       82    0.004    0.000    0.004    0.000 {built-in method posix.scandir}
        3    0.004    0.001    0.016    0.005 <frozen importlib._bootstrap_external>:830(get_data)
        2    0.003    0.001    0.003    0.001 {netCDF4._netCDF4.num2date}
        2    0.002    0.001    0.002    0.001 conventions.py:222(nctime_to_nptime)
  350/216    0.001    0.000    0.002    0.000 indexing.py:363(shape)
       49    0.001    0.000    0.001    0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects}
      238    0.001    0.000    0.001    0.000 posixpath.py:73(join)
     2243    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     31/4    0.001    0.000    0.002    0.000 sre_parse.py:470(_parse)
      717    0.001    0.000    0.001    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
       36    0.001    0.000    0.001    0.000 {method 'getncattr' of 'netCDF4._netCDF4.Dataset' objects}
       38    0.001    0.000    0.008    0.000 <frozen importlib._bootstrap_external>:1233(find_spec)
       26    0.000    0.000    0.009    0.000 <frozen importlib._bootstrap>:861(_find_spec)
       43    0.000    0.000   17.779    0.413 glob.py:79(_glob1)
       14    0.000    0.000    0.002    0.000 netCDF4_.py:220(open_store_variable)
   145/39    0.000    0.000    0.236    0.006 glob.py:132(_rlistdir)

mrocklin · 2017-10-13T14:36:44Z

So there is a ton of filesystem stuff going on (which is not surprising). I'm also working through the notebook now with a cluster active and am seeing some odd behavior. If anyone with knowledge of XArray internals has some time I would appreciate a video chat where I screenshare and we look at a few things. There is a lot of short-sequential behavior going on.

cc @jhamman or @shoyer if he is interested in seeing things run on a cluster.

mrocklin · 2017-10-13T14:55:09Z

This line can take a surprisingly long time when we first start up

all_files = glob(os.path.join(ddir, '**/*/latest/[h,u,v]*/*1960*.nc'), recursive=True)

mrocklin · 2017-10-13T14:55:53Z

The odd behavior that I'm seeing is tons of very small sequential computations. It is as though somewhere in XArray there is a for loop calling compute many times over each of the chunks.

rabernat · 2017-10-13T15:02:57Z

Is that sequential computation happening during open_mfdataset?

mrocklin · 2017-10-13T15:07:10Z

Yes, also resample

mrocklin · 2017-10-13T15:10:47Z

If you'd like to see I can easily spin up a cluster and screenshare

mrocklin · 2017-10-13T15:11:45Z

Or maybe I'll make a quick screencast

rabernat · 2017-10-13T15:16:43Z

Neither of these operations is parallelized. They are basically loops.

open_mfdataset just opens each file in a list and then loops over them again when concatenating.

resample works like groupby: it loops over each group. The inability to parallelize groupby is a known issue (pydata/xarray#585) which may have just very recently been fixed by pydata/xarray#1517). (Not sure that PR gets all the way to groupby though.)

You have arrived one of the cruxes of xarray's performance bottlenecks. Welcome!

rabernat · 2017-10-13T15:19:02Z

If you'd like to see I can easily spin up a cluster and screenshare

I personally don't need to see. I think I know exactly what your bokeh dashboard will show, because I see it myself every time I do these things: serial dispatching of dask tasks to the cluster.

A screencast would be very useful, however, to communicate the problem to a wider audience (e.g. xarray devs).

mrocklin · 2017-10-13T15:24:41Z

https://www.youtube.com/watch?v=nqud0JdaI6E&feature=youtu.be

mrocklin · 2017-10-13T15:25:07Z

@rabernat can you point me to that loop?

rabernat · 2017-10-13T15:28:10Z

@rabernat can you point me to that loop?

You mean the one in groupby? I'm pretty sure the core loop is here:
https://github.com/pydata/xarray/blob/master/xarray/core/groupby.py#L609

jhamman · 2017-10-13T15:43:27Z

@mrocklin - I'm guessing you're working off the last release of xarray. I think pydata/xarray#1551 and a few other associated improvements may help with some of what you're seeing here. These changes are on the master branch now and may be worth trying out.

mrocklin · 2017-10-13T16:02:20Z

Yes, that helped considerably. I also made the chunk sizes along time to be smaller, which both helped a lot with the parallelism and avoided crashing the main machine

ds = xr.open_mfdataset(all_files, chunks={'time': 10}, preprocess=set_correct_coords)

The computations at the end happen in around 20-30 seconds now. I think that the next largest performance boost for them would be to enable persist so that we can keep data in RAM rather than recompute each time.

mrocklin · 2017-10-13T16:06:30Z

Yeah, most of the time in the plotting lines is just redoing the data loading from disk.

Updated Notebook: https://gist.github.com/4800484676155745f3cf318bce2e78a4

shoyer · 2017-10-13T17:06:38Z

Currently, a call to open_mfdataset calls open_dataset many times in a loop, which means that coordinate data is loaded serially to create indexes. We could significantly speed this up by refactoring to either:

loading coordinate data in parallel, or
skip loading coordinate data in cases where we know it should be consistent ahead of time. This should probably use a defined standard like NcML. Googling turns up a few Python libraries: 1, 2. Probably it would be good idea to chat with the THREDDS developers about this, too.

naomi-henderson · 2017-10-13T18:47:10Z

Much better @mrocklin ! The new version is also working much faster on my machine.

Howver, I am having trouble with your new resample line:
dsr = ds.resample(time='M').mean()
which does not work on my linux box:

TypeError                                 Traceback (most recent call last)
<timed exec> in <module>()

TypeError: resample() got an unexpected keyword argument 'time'

instead I stilll need:
ds.resample(freq='M', dim='time')

Is this a difference in our versions of xarray? (0.9.6)

mrocklin · 2017-10-13T18:49:21Z

Yes, I think so. I got this warning message when I switched to xarray master. I suspect that if you also switch then things will get even faster for you as well.

pip install git+https://github.com/pydata/xarray.git --upgrade

rabernat · 2017-11-13T02:45:29Z

Now that we have the sphinx documentation site working, this use-case notebook can be added into docs/use_cases/.

@naomi-henderson: would you be comfortable making a pull request to add the latest and greatest version of your notebook to this repo?

naomi-henderson · 2017-11-13T14:38:40Z

@rabernat: I will give it a try, but it will have to wait until tomorrow. I will let you know if I have any trouble with the sphinx.

rabernat · 2017-11-13T14:57:55Z

great! you don't necessarily have to build the docs yourself locally. You can just add your notebook in the docs/use_cases/ folder and then add it to the toctree in docs/index.rst.

stale · 2018-06-15T22:40:44Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2018-06-22T23:26:12Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

Bring up to date

rabernat assigned naomi-henderson Aug 30, 2017

rabernat mentioned this issue Sep 1, 2017

Automatic grid generation xgcm/xgcm#73

Closed

chiaral mentioned this issue Sep 13, 2017

Use Case Notebook for "Convective Parameters for Understanding Severe Thunderstorms" #9

Closed

rabernat mentioned this issue Sep 19, 2017

How to host our documentation #12

Closed

rabernat mentioned this issue Oct 13, 2017

How to organize files on GCP #19

Closed

spencerahill mentioned this issue Oct 20, 2017

Coordinate parsing module xgcm/xgcm#75

Closed

stale bot added the stale label Jun 15, 2018

stale bot closed this as completed Jun 22, 2018

TomAugspurger pushed a commit that referenced this issue Jul 20, 2020

Merge pull request #1 from pangeo-data/master

2e1be2d

Bring up to date

Use Case Notebook for "Atmospheric Moisture Budgets" #1

Use Case Notebook for "Atmospheric Moisture Budgets" #1

Comments

rabernat commented Aug 30, 2017

rabernat commented Aug 30, 2017

rabernat commented Aug 30, 2017

darothen commented Aug 30, 2017

spencerahill commented Aug 30, 2017

rabernat commented Aug 31, 2017

rabernat commented Oct 4, 2017

naomi-henderson commented Oct 5, 2017

mrocklin commented Oct 5, 2017

rabernat commented Oct 5, 2017

naomi-henderson commented Oct 9, 2017

rabernat commented Oct 10, 2017

mrocklin commented Oct 10, 2017

mrocklin commented Oct 10, 2017

naomi-henderson commented Oct 10, 2017

rabernat commented Oct 10, 2017 via email

naomi-henderson commented Oct 11, 2017

mrocklin commented Oct 11, 2017

mrocklin commented Oct 11, 2017

naomi-henderson commented Oct 11, 2017

rabernat commented Oct 11, 2017 • edited Loading

rabernat commented Oct 11, 2017 • edited Loading

spencerahill commented Oct 12, 2017

naomi-henderson commented Oct 12, 2017

mrocklin commented Oct 12, 2017

pwolfram commented Oct 12, 2017

naomi-henderson commented Oct 13, 2017 • edited Loading

pwolfram commented Oct 13, 2017

mrocklin commented Oct 13, 2017

mrocklin commented Oct 13, 2017

mrocklin commented Oct 13, 2017

mrocklin commented Oct 13, 2017

rabernat commented Oct 13, 2017

mrocklin commented Oct 13, 2017

mrocklin commented Oct 13, 2017

mrocklin commented Oct 13, 2017

rabernat commented Oct 13, 2017 • edited Loading

rabernat commented Oct 13, 2017

mrocklin commented Oct 13, 2017

mrocklin commented Oct 13, 2017

rabernat commented Oct 13, 2017

jhamman commented Oct 13, 2017

mrocklin commented Oct 13, 2017

mrocklin commented Oct 13, 2017

shoyer commented Oct 13, 2017 • edited Loading

naomi-henderson commented Oct 13, 2017

mrocklin commented Oct 13, 2017

rabernat commented Nov 13, 2017

naomi-henderson commented Nov 13, 2017

rabernat commented Nov 13, 2017

stale bot commented Jun 15, 2018

stale bot commented Jun 22, 2018

rabernat commented Oct 11, 2017 •

edited

Loading

rabernat commented Oct 11, 2017 •

edited

Loading

naomi-henderson commented Oct 13, 2017 •

edited

Loading

rabernat commented Oct 13, 2017 •

edited

Loading

shoyer commented Oct 13, 2017 •

edited

Loading